Published: Jul 23, 2023

Introduction

The COVID-19 pandemic has greatly stimulated the development of video conferencing in various areas of human activity, especially in education [1,2]. Video conferencing has been crucial in maintaining productivity and communication in many areas, including business, education, and healthcare.

For business, the use of video conferencing technologies has made it possible to hold meetings, discuss projects, and work together on tasks reducing downtime and minimizing losses.

Online conferencing has allowed students and teachers to continue their education and share ideas and knowledge reducing the pandemic’s negative impact on educational processes. Online conferencing in medicine has become an important tool in the fight against the pandemic, and in this context, online conferences have proved invaluable.

They have allowed doctors to consult patients remotely and exchange opinions and knowledge with colleagues around the world. Since online conferencing eliminates the need to travel to attend meetings or seminars physically, it reduces travel costs, and the time saved can be used more productively. People from different geographical regions can participate in online conferences, which helps to increase social integration and the exchange of cultural and professional experience. The last-mentioned aspects are especially valuable for international organizations and projects. Thus, the importance of online conferences for modern society can hardly be overestimated. They help maintain social ties, promote scientific research, and ensure the continuity of business operations and the educational process.

An image of the smartphone with Zoom app on the display to illustrate the article about adding functionality to Zoom online conferencing for identity verification and monitoring of participants' activities.

This paper is devoted to expanding the standard capabilities of online conferencing for various purposes, with special attention paid to online conferencing in the educational process.

The authors considered and compared the capabilities of online conferences such as Zoom, Google Meet, and Microsoft Teams. Researchers have proposed to expand the standard features of Zoom conferencing by adding the ability to identify a participant's identity based on video stream analysis, assessing one’s emotional state and activity in the meeting, and tracking the participant's compliance with the established rules during online tests or exams. This functionality will greatly improve the impact of various types of distance learning activities with a large number of participants, allow to monitor students’ behavior during lectures and exams, and summarize and analyze activity over different periods.

The main aim of the work is to research the possibility of adding new features to the standard functionality of the Zoom platform and to develop an application based on the Zoom conference with additional monitoring capabilities during the online event and a generalized analysis of the participants' actions afterward. It is proposed to monitor the actions of conference participants based on the analysis of the Zoom video stream coming from the participants' cameras in real time, as well as by intercepting other standard Zoom conference events.

To achieve this aim, we had to solve the following tasks: study and compare existing Zoom SDKs, research and select the most suitable methods of face and emotion recognition, investigate the possibility of tracking conference events such as raising a hand, turning on/off the microphone, design the application architecture, various services, a database for saving events during the conference and subsequent analysis, and implement a prototype application.

1. Overview of modern online conferencing services and analysis of the possibilities of expanding the functionality of Zoom conferences based on video stream analysis

1.1 Comparative overview of online conferencing services

In a reference source [3], the popularity of online conferences in 118 countries was studied. The following services were considered: BigBlueButton, Bluejeans Meetings, ClickMeeting, Glip, Google Hangouts, Google Meet, GoToMeeting, Houseparty, Lifesize, Microsoft Teams, Nextiva, RingCentral Video, Skype, Slack, U MeetingZoom, and Webex (Fig. 1).

The study found that the top three were Zoom, Google Meet, and Microsoft Teams. The winner with a significant break-off is Zoom, which turned out to be the most popular video calling app in 2022 in 80 countries (66% of all countries analyzed). In 2021, Zoom was the most popular platform in 44 countries. That is, the popularity of Zoom has almost doubled over the past year. People in 28 countries chose Google Meet as the most used video calling platform, and Microsoft Teams is the most used in 7 countries. Compared to 2021, when Microsoft Teams was the most popular tool in 41 countries, its popularity has decreased significantly — by 83%. More details on the methodology and research results for each of the 118 countries, as well as changes compared to 2020 and 2021, can be found in the source [3]. Below is a brief description of the leading platforms. A comparison of some parameters of the top platforms is shown in Table 1.

Zoom is a video conferencing software that was developed to help corporations but is now one of the most popular video calling programs in general. The program allows meetings of up to 1000 participants with up to 49 participants on the screen. Zoom allows users to easily record meetings and share them with those who were not present.

Files can also be shared while participating in conferences. There have been significant security issues, but as the product has grown in popularity, the company has made great efforts to address these issues. Additional advantages include support for integration with Google Calendar, Facebook, DropBox, and many other third-party programs. Zoom also offers such engagement elements as a rise of a hand, polls, screen sharing, non-verbal feedback, and a wide range of video control options. The problem is the threat of intruders invading the video conference and numerous subscriptions and add-ons, which take time to understand.

A map showing the popularity of different video conferencing tools worldwide to illustrate the article about adding functionality to Zoom online conferencing for identity verification and monitoring of participants' activities.
Figure 1  The four most popular video conferencing platforms in the world (2023, January 4)

Source: The Most Popular Video Call Conferencing Platforms Worldwide [3].

Google Meet is a video conferencing tool designed specifically to meet the video meeting needs of businesses of all sizes. You need a Google account to use this software. The application is compatible with various Google products, such as Google Chat. Available features include screen sharing, call recording, full-screen viewing, subtitling, customizing the layout of elements, and more.

Users can join via various methods: shared emails, links, or calendar invitations. The program supports meetings with a maximum number of participants of up to 250 people per call, as well as live broadcasting for up to 100,000 viewers within a domain. Additional advantages: no need to download software, and meetings can be recorded directly to Google Drive. On the downside, it is not possible to transfer multimedia documents, and the platform consumes a significant amount of hardware resources.

Microsoft Teams is a video-conferencing application that can be a great choice for large companies. It's a part of the Microsoft 365. External participants using Microsoft 365 can join meetings without having to download Teams. Users can also easily share emails and attachments using this app. Many Microsoft programs, including Outlook and Office 365, are linked to the app. Teams has additional features such as background blurring, screen sharing, call recording, hand-raising, improved noise cancellation, useful chatbots and add-ons, file search, and backup. Among the limitations: high memory consumption, notifications are not always received, and the number of channels is limited.

Table 1 Comparison of video conferencing services

A table comparing different aspects and features of various video conferencing platforms to illustrate the article about adding functionality to Zoom online conferencing for identity verification and monitoring of participants' activities.

Source: self-processing of information from websites: https://zoom.us; https://meet.google.com; https://www.microsoft.com/en-us/microsoft-teams/group-chat-software

Each of these video conferencing services has its advantages and disadvantages. In this work on adding functionality for monitoring the actions of conference participants, the Zoom platform was chosen for the following reasons:

  • the most popular service in 2022 and 2021;
  • the maximum possible number of participants (for the most expensive subscription);
  • a large number of Software Development Kits (SDK).

An SDK is a set of software tools to develop applications for a particular platform or framework. These tools usually include a variety of libraries, Application Programming Interface (API) documentation, code samples, build processes, and other useful tools. SDKs help developers create software efficiently and quickly by providing them with ready-to-use functions and procedures.

1.2 Features of creating applications for Zoom conferences

You can integrate Zoom conferences into your app using the following solutions:

  • Zoom Meeting SDK [4];
  • Zoom Video SDK [5].

The Zoom Meeting SDK is a set of tools for developers that uses the Zoom interface directly and allows you to build additional functionality into a Zoom conference for the following platforms: Android, iOS, macOS, Web, and Windows. When using the Zoom Meeting SDK, there is minimal ability to modify the user interface. It still looks like the standard Zoom interface. But it is also an advantage at the same time since the conference participants are usually familiar with the Zoom interface. So, they do not have to get used to the new interface.

The Zoom Video SDK does not provide a user interface. Instead, it allows developers to create any user interface, depending on what purpose the video is used. Also, when using the Video SDK, it is not possible to join regular Zoom conferences, as separate conferences are created for the Video SDK, which are routed through other Zoom servers. In addition to the interface, complete freedom is provided when working with video/audio streams of meeting participants. A comparison of Zoom SDKs is shown in Table 2.

Table 2  Zoom SDKs comparison

A table of comparison of Zoom's SDK illustrates the article about adding functionality to Zoom online conferencing for identity verification and monitoring of participants' activities.

Source: authors' processing of sources [4,5].

Both SDKs provide the ability to receive video data from each participant in separate frames in the YUV420 format. Audio data can be received from each participant separately or the audio of the entire meeting, i.e., what the participants hear. However, you can receive raw video data using the Meeting SDK for macOS and Windows only.

Raw video data for the Video SDK can be received on all platforms that support the Video SDK with a limitation for the Web: no more than 25 participants at a time. The Video SDK provides functionality not only for receiving raw data, but also for transferring it to such streaming platforms as Facebook Live, YouTube Live, etc.

Another difference between the Meeting SDK and the Video SDK is that the Video SDK cannot use the standard software tools that are available for each platform and connect to regular Zoom meetings. Instead, it uses "isolated" conferences that go through Zoom servers: media processing, recording, and live streaming. You need to write the user interface by yourself. This feature makes the Video SDK more flexible. In addition, the Video SDK allows you to adjust video quality parameters between resolution and frame rate if network bandwidth is limited. If bandwidth is satisfying, the user will receive the best quality video.

Thus, the Video SDK is more appropriate to choose if you need to process raw video due to its wider functionality for working with video streams or for commercial solutions that require their own interface and additional functionality for video conferencing.

While the Meeting SDK is better suited for small commercial solutions that need to quickly integrate Zoom conferencing into their project on a large number of platforms. It was decided to use the Video SDK to develop an application for monitoring the actions of Zoom video conference participants because of the wider functionality for working with video streams and other Zoom events, as well as the lack of restrictions when designing the interface.

1.3 The relevance of the problem of adding to the online conference the ability to monitor user behavior based on the analysis of video stream from web cameras and the statement of the research problem

The issue of online learning is extremely relevant in the modern world and has a significant impact on educational processes and the development of humanity as a whole. Over the past few years, online learning has significantly expanded its capabilities.

Due to the COVID-19 pandemic and the war in Ukraine, distance education has become the only available option for many students. The necessity to integrate distance, online or mixed education has become a major challenge for the education system and has required appropriate changes in approaches to teaching and the organization of the learning process. Thus, the issue of online learning is extremely relevant and requires further development of technological and methodological approaches to conducting classes and controlling knowledge.

A review of the capabilities of modern online conferencing has shown that platforms such as Zoom, Google Meet, and Microsoft Teams have almost the same functionality during a video conference: virtual waiting rooms, screen sharing, chat, a raised hand feature, and others. On the whole, it’s a very convenient functionality.

But when using online conferencing in the educational process, where there are a large number of participants and individual conferences are combined into a series of lectures, workshops, and seminars in each discipline for each group of students, the following functions are lacking:

  • use of a video stream to automatically identify a participant based on a person’s face (according to the standard functionality, if a participant joins using a certain login, there is no guarantee that the person actually invited joins);
  • assigning and tracking meeting rules (for example, if it is an online exam or knowledge testing, the participant must have the camera turned on, a person’s face must be sufficiently illuminated for identification, only one participant can be in front of the camera, and so on);
  • monitoring the actions and emotional state of participants during conferences and the ability to summarize information by different time periods, disciplines, types of classes, groups of students, etc. (recording the time of joining /leaving the meeting, turning on the microphone, using emojis, analyzing the video stream to analyze emotions).

Such functionality would greatly help the conference organizer during the event, as well as allow them to assess the interest of participants and collect statistical information for further research on the relationship between conference behavior and the target outcome of the conference or series of conferences, which would help improve the learning process based on online events.

Thus, the object of the study is the problem of monitoring the activities of participants during an Zoom conference. The subject is to study the possibility of adding functionality to monitor the participants’ activity based on video stream analysis and interception of internal Zoom events based on the Zoom Video SDK.

The aim of the work is to investigate the possibility of adding new features to the Zoom standard functionality to monitor participants' activities based on video stream analysis and interception of internal Zoom events and to develop a prototype application based on the Zoom Video SD. To achieve the aim, the following tasks need to be solved:

  • to study the problem of intercepting a video stream based on the Zoom Video SDK and the issue of converting Zoom frames to the standard RGB format for Computer Vision; to consider the issue of intercepting Zoom events, such as "raised hand", turning on the microphone, setting emojis;
  • to research and use image analysis methods and existing Computer Vision libraries for solving the problems of face verification and emotional state recognition;
  • to develop a requirements specification for an application that should have this functionality:
    • to identify the participant's identity by analyzing the Zoom video stream for compliance with the image in the database for the entered login;
    • to track the activity of participants, such as "raised hand", microphone activation (based on Zoom events analysis), and the emotional state of participants (based on video stream analysis);
    • to set up conference rules for each session and monitor the compliance with these rules by the participant (one face in the frame, matching the face with an image of the logged-in profile, the constant presence of a face in the frame);
    • to collect all information in a database, summarize and visualize it;
  • to design an application architecture for monitoring the actions of Zoom conference participants; and to develop a prototype application;
  • to make a conclusion about the operation of the developed application based on the selected SDK, used image analysis methods, libraries, and frameworks.

2. Analyze video stream for face and emotion recognition

Currently, computer systems use a variety of color models: RGB, CMYK, HSV, LAB, YUV, etc. to represent color in a digital format. Zoom uses the YUV format to transmit video conference frames. These frames are obtained through the Zoom Video SDK, but most neural network models do not support YUV input. Therefore, you need to convert the frames to RGB format, which is more common for neural networks.

2.1 Stages of face and emotion recognition

Modern methods of verifying or identifying a person by face represent not just one model but a system, sequence or set of models and algorithms. The face recognition process typically consists of four stages: face detection, face alignment, embedding (vector of facial features), and verification or classification (Fig. 2).

An image of stages of face recognition to illustrate the article about adding functionality to Zoom online conferencing for identity verification and monitoring of participants’ activities.
Figure 2:  Stages of face recognition

Source: created by authors

2.2 Face detection

The detection and alignment stages are important due to the variability of the photo or video stream (input data). For example, the face may not be frontal, it may be of different sizes, and these nuances are crucial for face recognition.

There are many methods of face detection. The Haar cascade method showed the first serious successes. This method uses a set of primitives in a sliding window to analyze a face, and the Histogram of Oriented Gradients (HOG) method, which describes an object using a distribution of oriented gradients. These methods have become classics in object detection and provide high accuracy for the frontal position of the face. Still, they are not recommended for faces at an angle and in the presence of local occlusions of a part of the face.

The emergence of deep neural network architectures has significantly improved the quality of face detection. The main advantage of these networks is that a network trained on multi-class classification can independently identify the necessary features that characterize a face and detect it. Moreover, if the datasets on which the networks were trained contain samples with rotated faces, faces of different sizes, and cases where parts of the face are obscured (glasses, medical masks, etc.). In that case, the neural network learns to detect such complex face images. However, even today, there are tasks for which classical detection methods or their combination with neural network methods are used [6].

The paper [7] solves the problem of choosing the best face detection model for an enterprise security system based on the analysis of a video stream from surveillance cameras. In this paper, the following detection methods based on deep neural networks were studied and compared: Multi-Task Cascaded Convolutional Networks (MTCNN) [8], FaceBoxes [9], Dual Shot Face Detector (DSFD) [10], RetinaFace [11], CenterFace [12], and Single-stage Cascade Residual Face Detector (SCRFD) [13]. The methods were compared in terms of average precision (AP) [11-13], landmarks, maximum range of face rotation angle, minimum face size detected with a confidence of more than 0.9, and average processing time per frame for VGA images (640×480 pixels). Based on the comparison results, it was concluded that the RetinaFace-MobileNet0.25 model best meets the needs of a security system based on analyzing video from surveillance cameras, i.e., it is suitable for face detection in the presence of face rotation, changes in face size, and the demand to work in real time.

The current work solves the face verification problem based on the analysis of the Zoom conference video stream. And so, the detection method requirements are not as strict as for a security system based on the analysis of video from surveillance cameras [7]. In the proposed monitoring system, only one person is expected to be in the frame, with one’s face close to the frontal position. It means that less powerful lightweight models can be used to solve current problems. It was decided to consider the Haar, MTCNN, RetinaFace-MobileNet0.25, and Single Shot MultiBox Detector (SSD) models [14]. In a reference source [14], it is stated that the accuracy calculated as AP on the WiderFace dataset for SSD is 0.83, while Haar and MTCNN on the same images showed a result of 0.14 and 0.6, respectively.

In another source [7], the AP on the same dataset for RetinaFace-MobileNet0.25 is shown as 0.78. Table 3 shows some examples of the performance of these models. The detection time was measured on a mobile video card NVIDIA GeForce 940MX after preliminary "warming up" of the network, skipping the first detections.

According to the results of the comparison of face detection methods, Haar proved to be a fast algorithm but not resistant to changes in face angle and local occlusions. MTCNN shows good results in terms of time and satisfactory estimation accuracy. SSD is the fastest of the considered detection models, it detects large faces with a fairly significant angle of inclination and large local occlusions well, but it is not suitable for detecting small faces. RetinaFace-MobileNet0.25 coped with the angle of the face, local occlusion and detected all small faces, but has a speed significantly higher than SSD (about 20 times faster). Since the monitoring system is expected to have a fairly large face in the frame, it was decided to use the SSD model for face detection.

Table 3  Example of face detection by Haar, MTCNN, SSD, and RetinaFace models

A table of comparison of face detection methods to illustrate the article about adding functionality to Zoom online conferencing for identity verification and monitoring of participants' activities.

Source: created by authors

2.3 Aligning the face

Many algorithms can be applied to face normalization tasks in an image. The simplest one is to calculate the angle of rotation of the face based on the key points of the eyes, which will be equal to (A-90). To do this, draw a line between the points corresponding to the eye position and add lines to create a right triangle, as shown in Figure 3. The key points in the work are obtained using the Haar cascade method, namely its implementation from the OpenCV library. Next, determine the length of these lines and calculate cos A =(b²+c²-a²)/2bc. In addition, you need to determine the scaling factor so that all faces have the same size. This factor is calculated by knowing the distance between the key points of the current face returned by the detector and the target distance.

An image of face alignment by eye detection (Haar cascade method) to illustrate the article about adding functionality to Zoom online conferencing for identity verification and monitoring of participants' activities.
Figure 3  Face alignment by eye detection (Haar cascade method)

Source: created by authors

2.4 Identity verification

After alignment, you need to describe the face using a special vector, i.e., get a face embedding. Face embeddings can be obtained using neural networks. For this purpose, the following networks are used: Facebook DeepFace, Google FaceNet, VGGFace, ArcFace, CosFace, SphereFace, Insightface, InsightfaceV2, Dlib ResNet, and others. The length of the face embedding vector is different for each network, for example, Google FaceNet returns a vector of 128, VGGFace – 2622.

To solve the problem of classification based on face embeddings, it is necessary to assign the input face embedding to one of the classes. In this case, one of the following methods can be used: K-Nearest Neighbors (KNN), logistic regression, Support Vector Machines (SVM), Random Forest, and neural networks.

However, to solve the problem of identifying a conference participant's face, it is necessary to solve the problem of verification since the participant indicates one’s login when joining the conference, after which the system will already know which image is the sample. And then, it is only necessary to compare two vectors: the input face's embedding and the image sample's embedding. The following metrics can be used to compare vectors: Cosine similarity, Euclidean distance (L2), and Manhattan distance (L1).

Today, there are a lot of libraries for solving computer vision problems. Some are designed to solve specific tasks, while others are able to cover a vast range of issues. In this work, we decided to use the DeepFace library [15, 16]. The purpose of DeepFace is face processing. It can solve the following tasks: face verification, determination of emotional state, gender, and age of a person.

During face verification, the model implements the following stages: detection, alignment, embedding, and vector comparison. Moreover, the model is highly flexible; for each stage, it is possible to use quite a few variants of models.

To obtain embeddings, DeepFace supports the following face recognition models: Facebook DeepFace, VGG-Face, Google FaceNet, OpenFace, DeepID, Dlib, and ArcFace. For face detection: OpenCV, Dlib, SSD, MTCNN, and RetinaFace. DeepFace achieved an accuracy of 97.35% on the famous Labeled Faces in the Wild (LFW) dataset [17], approaching human accuracy of 97.53%.

This result was achieved by training a 9-layer model on 4 million face images. Works [16, 18] provide LFW Score values for some models that receive embeddings (Table 4). LFW Score is the accuracy calculated as the percentage of correctly identified pairs for images from the LFW dataset (for positive pairs, the model must determine that the faces belong to the same person, and for negative pairs, the model must determine that the faces belong to different persons).

Table 4 shows that the Facenet512, SFace, ArcFace, Dlib, Facenet, VGG-Face, and Facebook DeepFace models have an accuracy comparable to or higher than human verification.

Table 4  Score on Labeled Faces in the Wild Score

A table of scores on Labeled Faces made by the Wild Score to illustrate the article about adding functionality to Zoom online conferencing for identity verification and monitoring of participants' activities.

Source: authors' processing of source [16, 18].

Experiments were conducted with several embedding models, namely, VGG-Face, FaceNet, OpenFace, DeepFace, and ArcFace. In these experiments, verification was performed within a pipeline in which the SSD model was used for detection, the Haar cascade method for alignment, and cosine similarity for comparing embedding vectors. Some examples are shown in Table 5, where green indicates correct answers and red indicates incorrect answers.

The study showed that ArcFace was the best of all models on this set of images, providing high accuracy and speed. Pipeline with ArcFace demonstrated the lowest average verification time of 0.62 s, with a declared accuracy of 99.41, which fully meets the requirements of the monitoring system for Zoom conferences. FaceNet also showed good results, especially on high-quality images.

At the same time, the OpenFace model showed the lowest accuracy compared to the others. Based on the research, the ArcFace model was selected as part of the face verification pipeline for the monitoring system.

2.5 Analyzing emotions

To detect emotions, one of the models trained on the FER2013 dataset [19, 20] can be used, for example, the Face expression recognition (FER) model [21] or the DeepFace model [16], or other models mentioned in a reference source [22]. As can be seen in the source [22], the best emotion recognition accuracy on the Fer2013 set is currently 76.82.

You can also try to train your own neural network. However, the main objective of this work is to investigate the possibility of using the Zoom video stream to monitor the activities and state of participants.

Therefore, it was decided to use the DeepFace model, which, as shown above, has already been used for image verification. DeepFace is also capable of analyzing emotions (anger, fear, neutrality, sadness, disgust, joy, and surprise).

In the paper [23], it is stated that on the FER2013 dataset, the DeepFace model recognizes emotions with an accuracy of 57%, and when using AutoKeras, the accuracy reaches 66%. The age model processes the image with an accuracy of ±4.65 MAE, and the gender model obtained 97.44% accuracy, 96.29% precision, and 95.05% recall [16].

Studies have shown that the DeepFace model is the worst at recognizing the disgust emotion, even with good illumination and satisfactory image quality. Such results can be explained by the fact that in the FER2013 dataset, the disgust expression has a minimum number of images — 600, while other expressions have almost 5000 samples each. In total, FER2013 consists of 30,000 RGB images of 48×48 size with different facial expressions [20].

Table 5  Examples of verification results based on the pipeline: SSD (detection), Haar (alignment), VGG-Face/FaceNet/OpenFace/DeepFace/ArcFace (embedding), Cosine similarity (verification, threshold 0.4)

A table of examples of verification results to illustrate the article about adding functionality to Zoom online conferencing for identity verification and monitoring of participants' activities.

Source: created by authors

3. Developing an application to monitor participants' activities

3.1 Specification of application requirements

The system of monitoring the activity of participants should be able to conduct a regular Zoom online conference and have the following additional functionality:

  • verification of participants by matching the photo for the participant's entered nickname and the participant's face obtained from the participant's webcam;
  • record the emotions expressed on participants' faces, as well as their age and gender;
  • record the following activities of the participant: text messages, reactions in the form of emojis, turning on/off the microphone;
  • monitor compliance with the requirements for the behavior of the conference participant (the presence of a camera, the presence of only one person in the frame, etc.);
  • customize requirements for a specific session (conference);
  • display the collected information about the session (results of verification and processing of other events), summarize statistical information, and visualize it.

3.2 Specification of rules for conducting video conferences

A participant in a video conference must comply with the following rules:

  • there should be only one person in the frame, in case of the absence or presence of two or more persons in the frame, the conference owner will be notified;
  • in a nickname field, the person must indicate the email that was provided to the conference owner along with a template image of this person; the template image must be of satisfactory quality with good illumination, the image must contain only the required person, 30% or more of the image must be occupied by face; if the person who does not match the template image is in the frame, the conference owner will be notified;
  • the participant's camera must transmit a satisfactory quality image, the workplace must be well illuminated, and the face must be turned towards the camera; if the system cannot detect or verify the face due to insufficient quality, poor illumination, or the position of the face, the conference owner will be notified.

3.3 Development of the monitoring system architecture

The system for monitoring participants' activities will consist of several parts:

  • an application through which the online conference takes place (Zoom Video SDK);
  • a system for video analysis (Image Analyzer);
  • a storage for saving images;
  • a storage for saving the facts of participant's activities and their processing;
  • a service for organizing interaction between Zoom Video SDK, Image Analyzer, and storages, as well as generating a token for each session and generating statistics.

The application architecture shown in Figure 4 was developed considering the need to process and analyze the video stream in real time.

An image of application architecture to illustrate the article about adding functionality to Zoom online conferencing for identity verification and monitoring of participants' activities.
Figure 4  Application architecture

Source: created by authors

A UML diagram of the sequence of user activities and application services is shown in Figure 5. The diagram depicts the process of user interaction with the application, starting with generating a JWT token and joining a video conference, ending with receiving the results of frame analysis and ending the meeting.

An image of UML diagram to illustrate the article about adding functionality to Zoom online conferencing for identity verification and monitoring of participants' activities.
Figure 5  Diagram of the sequence of interaction between application services

Source: created by authors

Characteristics of components

The Zoom Video SDK is a fully customizable Zoom Client implemented in C++ and responsible for creating video conferences, connecting and disconnecting participants, transmitting video and audio streams of participants, chatting, and other basic Zoom functionality. In addition, it also records the received video data to the Image Storage and activities of “raising hand”, turning on/off the microphone to the Database, etc.;

Image storage is a storage that saves the frames of conference participants received in Zoom Video SDK until Session Observer retrieves it and transfers it for further processing and analysis;

Database is a database implemented using Microsoft SQL Server, where Zoom Video SDK and Session Observer save the results of face detection, verification, emotion analysis, Zoom events, etc. When you need to show these results, they are extracted using Session Observer to demonstrate statistics;

Session Observer is a service that works in parallel with the Zoom Video SDK, and when new frames are received in Image Storage, it asynchronously sends them for analysis to Image Analyzer using the RabbitMQ message broker, which builds a message queue and guarantees that consumers receive them. After RabbitMQ returns the result, it writes one to the database, checks whether the rules of the conference have been violated, and notifies the conference owner in case they have. If it is necessary to show statistics, it retrieves the results from the database, filters, sorts, and visualizes them. The display of current results, statistics, and notifications of conference rules violations is implemented using C#. Also, before the conference starts, Session Observer is responsible for creating a JSON Web Token that users will need to authorize and join the video conference;

Image Analyzer is a service that contains trained models for detecting, verifying, and analyzing the emotions of conference participants. Images are received via RabbitMQ, and verification and emotion analysis are performed in parallel (before that, detection and alignment are performed), which makes it possible not to wait for another model to finish processing. The results are sent back to RabbitMQ. Parallel execution and isolation of models allows to increase the throughput. The service is implemented using Python, with pre-trained models taken from the OpenCV and DeepFace libraries.

3.4. Illustration of the application operation

A user who wants to create a video conference needs to run Session Observer to generate a JWT token and provide it to future participants. The JWT token is required for authentication when joining a video conference.

The Session Observer home page is shown in Figure 6 (up). After launching Session Observer, the future owner of the conference goes to the JWT page and enters the "key", "Secret", and "Session name" (Fig. 6, down).

"Key" and "Secret" are obtained when creating a Zoom Video SDK account. An account should be created not for all users who will attend the conference but only for its owner.

The maximum validity of this token is 48 hours, after which it must be re-generated. The token is generated using the HS256 hashing algorithm. Then the user launches the Zoom Video SDK (Fig. 7, left), goes to the settings and enters the token (Fig. 7, right).

An image of the Session Observer start page and the token creation page to illustrate the article about adding functionality to Zoom online conferencing for identity verification and monitoring of participants' activities.
Figure 6  The Session Observer start page (up) and the token creation page (down)

Source: created by authors

An image of Zoom Video SDK home page and settings and the token creation page to illustrate the article about adding functionality to Zoom online conferencing for identity verification and monitoring of participants' activities.
Figure 7  7 Zoom Video SDK home page (left) and settings (right)

Source: created by authors

Next, the future conference owner should select "Create a Session", and the participant should select "Join a Session". After that, the conference owner has the opportunity to configure the parameters for a specific conference (session) (Fig. 8)

An image of a General settings window for setting up a session to illustrate the article about adding functionality to Zoom online conferencing for identity verification and monitoring of participants' activities.
Figure 8  General settings window for setting up a session

Source: created by authors

These parameters can also be adjusted by the owner during the session in the corresponding windows (Fig. 9):

  • "Toolbar" — a section for manual control buttons if you need to perform any operation instantly and manually, and do not wait until the time to expire (by timer);
  • "Panels" — a section for controlling open tabs. This window allows you to open and close "Attendee status," "Alerts", "General settings" tabs;
  • "Status" — a section for reporting the status of various timers and the status of the connection to the system, and it is also possible to start or stop the analysis timer separately;
  • "Alerts" — a window for displaying real-time notifications: successful verification of participants, rule violations, time of updating statistics, or errors if they are detected;
  • "Attendee status" — a window for displaying the status of conference participants: name, activities, verification status, verification time, recognized emotions, age, gender, and race.

To start the process of face detection, face recognition, and statistical data collection, "Face detection", "Face verification", and "Statistic" must be activated, respectively. And it is possible to perform operations without a timer manually by clicking on "Send detection frame", "Send verify frame", and "Send statistic frame" buttons correspondingly.

As shown in Figure 9, before activating “Face detection”, “Face verification”, and “Statistic”, “Alerts” windows are empty.
After the verification and static information collection processes are activated, the participants' images are circled with a colored frame: a green one if the verification was successful and a red one if the verification failed (the face does not match the nickname); "Alerts" and "Attendee status" windows are filled with the results of information processing (Fig. 10).

An image of a display of the conference window before the verification and statistical data collection processes are activated to illustrate the article about adding functionality to Zoom online conferencing for identity verification and monitoring of participants' activities.
Figure 9  Display of the conference window before the verification and statistical data collection processes are activated

Source: created by authors

An image of the conference window after activating the verification and statistical data collection processes to illustrate the article about adding functionality to Zoom online conferencing for identity verification and monitoring of participants' activities.
Figure 10  Displaying the conference window after activating the verification and statistical data collection processes

Source: created by authors

Figure 11 shows the results of determining emotions, age, and gender: information about the participant is updated in the “Attendee status” window, and a corresponding message appears in the “Alerts” window.

An image shows the results of determining emotions, age, and gender to illustrate the article about adding functionality to Zoom online conferencing for identity verification and monitoring of participants' activities.
Figure 11 Displaying changes in "Attendee status" and "Alerts" windows of some statistics (emotions, age, gender)

Source: created by authors

Figure 12 shows the case of violation of the requirements when the participant Oleg Babochkin disappeared from the camera's view and the event "Person is not detected on camera" occurred.

An image of the system's response to a participant's disappearance from the camera's view to illustrate the article about adding functionality to Zoom online conferencing for identity verification and monitoring of participants' activities.
Figure 12  Displaying the system's response to a participant's disappearance from the camera's view

Source: created by authors

Figure 13 illustrates the system's reaction to 2 violations: the disappearance of participant Olha Koshlata from the camera's view and the detection of more than one participant in the camera's view. Participants with violations are circled in red.

An image of the system's response to detecting more than one participant in the camera's view to illustrate the article about adding functionality to Zoom online conferencing for identity verification and monitoring of participants’ activities.
Figure 13  Displaying the system's response to detecting more than one participant in the camera's view

Source: created by authors

All processed events are saved during the conference session for further review and analysis. Figure 14 shows the "Alert list" window, which is used to view the list of processed events during the session.

An image of the "Alert list" window for displaying events during the conference and their description to illustrate the article about adding functionality to Zoom online conferencing for identity verification and monitoring of participants’ activities.
Figure 14  "Alert list" window for displaying events during the conference and their description

Source: created by authors

Figure 15 shows the “Event timeline” window, where you can view the processed positive and negative events and the frame that caused the event.

An image of the "Event timeline" window and the frame that caused the event to illustrate the article about adding functionality to Zoom online conferencing for identity verification and monitoring of participants’ activities.
Figure 15  “Event timeline” window and the frame that caused the event

Source: created by authors

It is also possible to view various statistical information by different parameters. For example, Figure 15 shows the emotions of certain participants as a percentage, the number of successful verifications, the average distance during verification, the time with the microphone on, and the number of hands raised in a selected period.

An image of the window for displaying some personalized information for a selected period to illustrate the article about adding functionality to Zoom online conferencing for identity verification and monitoring of participants’ activities.
Figure 16  Window for displaying some personalized information for a selected period

Source: created by authors

Discussion and conclusions

The paper examined the possibility of expanding the basic functionality of the Zoom online conference service by adding automatic identity verification and monitoring the activities of conference participants. As a result of the research, a Zoom conference application with additional features based on the Zoom Video SDK was developed.

The following issues were considered in this paper:

  • the authors examine and compare various services for video conferences, study their functions and characteristics; investigate the relevance of adding functionality for monitoring the activities of participants; analyze what functionality is lacking in online conferences for use in the educational process, where there are a large number of participants and individual conferences are combined into a series of related events (lectures, workshops, seminars);
  • the popularity of online conference services and technical capabilities for expanding the basic functionality is considered; it is recognized that Zoom is the most popular service and has sufficiently wide technical capabilities for adding functionality due to the existence of a large number of SDKs;
  • the technical features of creating applications for the Zoom conference were studied; the capabilities of the Zoom Meeting SDK and the Zoom Video SDK were compared; it was decided to use the Video SDK because of the richer functionality for working with video streams and other Zoom events, as well as the lack of restrictions for designing the user interface;
  • Computer Vision methods and libraries for solving the problems of face verification and emotional state recognition based on video stream analysis were investigated; it was decided to use the SSD model from the OpenCV library for face detection and the DeepFace library for verification and emotion recognition, where the ArcFace model was chosen to obtain face embeddings;
  • the architecture of the application for identity verification and monitoring of participants' activities was designed, and the prototype application was implemented.

As a result of testing the developed application, it was shown that Zoom Video SDK meets the requirements of the task set for this work. We would like to draw special attention to the fact that Zoom Video SDK provides video streams of sufficient resolution for the selected models and libraries to be able to perform face detection, face verification, and emotion analysis.

The proposed extension of the Zoom online conference functionality to monitor activities and verify the participant's identity can significantly improve the quality of online events in general, and especially in the educational process:

  • ensure the authenticity of participants, preventing possible fraud or misidentification. This feature is particularly important in an academic context where instructors need to know the real identities of a video conference participants, and ensure that grades are assigned to the correct students;
  • use video stream analysis for compliance with the rules during exams will improve the fairness and transparency of the educational process;
  • capturing the emotional state, the "raising hand" activities will allow teachers to understand how students respond to materials or teaching, teachers will be able to analyze and compare the effectiveness of different teaching strategies, adjust their approach to maximize student engagement and understanding of the material; will help in planning future events and understanding which topics or formats are most attractive to the audience;
  • monitoring will help to collect data (dataset) that will allow:
    • to investigate and identify the impact of the following factors on academic performance: learning activity (use of chat, "raising hands", use of emojis, etc.); presence in classes with an enabled or disabled camera (tracking the time of connection and disconnection from the conference, turning the camera on and off); emotional state (analyzing the video stream to determine the emotional state of students);
    • to identify problems with students' learning promptly in order to have time to correct the situation (signs of problems can be detected in absence from the class for a certain period, presence with a disabled camera, presence with an enabled camera but having negative emotions, not using chat, "raising hand" and other tools:
    • to plan future online events and understand which topics or formats attract the audience the most.

The developed application runs on the Windows platform. In the future, it would be desirable to develop interfaces that will allow users to connect from other platforms. It is also reasonable to study quantitative assessment of the quality of the verification module to allow us to formulate the requirements more clearly for the properties of the student's face in the frame, for example, requirements for its illumination, position, and size. Research on counteracting cheating of the verification module would also be useful (anti-spoofing).

In general, monitoring participants’ activities in online conferences is important to ensure the high quality and effectiveness of any type of event. The developed application is especially useful in the educational process. It will help the teacher both during the lesson and will allow to collect statistics for the semester. The collected monitoring information can be useful for further research on the interaction between student behavior in the lessons and their final level of knowledge. The processed information can help teachers better understand students' needs and adapt their teaching methods and resources. All this will contribute to improving the quality of distance education.

Our SYOTSS’s department of AI research and development conducted the work on implementing advanced monitoring features for Zoom meetings. SYTOSS specializes in custom software development solutions. We can help you integrate additional functionalities, implement innovations, optimize your current processes and create a secure and productive software environment. Contact us today to discuss your specific needs!

Acknowledgements

The authors are grateful to SYTOSS s.r.o., Bratislava, Slovakia, represented by the CEO Oleksiy Matikaynen, for the equipment provided for the research, as well as to employees for participating in experiments.

The work is funded by the EU NextGenerationEU through the Recovery and Resilience Plan for Slovakia under project No. 09I03-03-V01-00115.

Authors

Olena YAKOVLEVA, Marián KOVÁČ, Vadim ARDASOV, Ivan YEREMENKO

Department of AI Research and Development, SYTOSS Ltd

Citation

In case of citing, mentioning, quoting, or using other ways any part of this work, please give a reference to: Yakovleva, O., Kovač, M., Ardasov, V. & Yeremenko, I. (2023). Study on adding functionality to the Zoom online conference system for monitoring the participant activities. Public Administration and Regional Development, 19(1), pp. 158–184.
URL: https://www.vsemba.sk/portals/0/Subory/vedecky%20casopis%2001%20-%202023.pdf

References

[1]  Impact of distant teaching during Сovid-19 pandemic on civic and financial literacy / Marcel Lincényi, Matej Mindár, 2022. DOI 10.9770/jesi.2022.10.1(5). In: Entrepreneurship and Sustainability Issues / Marcel Lincényi, Matej Mindár. — ISSN 2345-2082, Roč. 10, no. 1 (2022), pp. 92-106.

[2]  Education in times of SARS-COV-2 Pandemic why we did not close our schools / Ľubomír Nebeský, Michal Fabuš, 2021. In: Conference Proceedings of the 2nd Online International Scientific Conference "Economics, Politics and Management in times of change", November 19th, 2021, Hungary. - Gödöllö: Hungarian University of Agriculture and Life Sciences, 2021. — ISBN 978-80-89654-83-3.

[3]  Brandl, R. 2023. Retrieved from https://www.emailtooltester.com/en/blog/video-conferencing-market-share.

[4]  Zoom meeting SDK. 2023. Retrieved from https://developers.zoom.us/docs/meeting-sdk.

[5]  Zoom Video SDK. 2023. Retrieved from https://developers.zoom.us/docs/video-sdk.

[6]   Kovtunenko, A., Yakovleva, O., Liubchenko, V., & Yanholenko, O. 2020. Research of the joint use of mathematical morphology and convolutional neural networks for the solution of the price tag recognition problem. Bulletin of National Technical University "KhPI". Series: System Analysis, Control and Information Technologies, 1 (3), 24-31. doi:10.20998/2079-0023.2020.01.05

[7]  Yakovleva, O., Kovtunenko, A., Liubchenko, V., Honcharenko, V., & Kobylin, O. 2023. Face Detection for Video Surveillance-based Security System (COLINS-2023). In CEUR Workshop Proceedings (Vol. 3403). pp. 69-86.

[8]  K. Zhang, Z. Zhang, Z. Li, Y. 2016. Qiao, Joint face detection and alignment using multitask cascaded convolutional networks, IEEE Signal Processing Letters 23(10) 1499-1503. doi:10.1109/lsp.2016.2603342.

[9]  S. Zhang, X. Zhu, Z. Lei, H. Shi, X. Wang, S. Z. Li. 2017. FaceBoxes: A CPU real-time face detector with high accuracy, in: 2017 IEEE International Joint Conference on Biometrics (IJCB), pp. 1-9. doi:10.1109/btas.2017.8272675.

[10] S. Ren, K. He, R. Girshick, J. Sun. 2016. Faster R-CNN: Towards real-time object detection with region proposal networks, IEEE Transactions on Pattern Analysis and Machine Intelligence 39(6). 1137-1149. doi:10.1109/tpami.2016.2577031.

[11] J. Deng, J. Guo, E. Ververas, I. Kotsia, S. Zafeiriou. 2020. Retinaface: Single-shot multi-level face localization in the wild, in: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 5203-5212. doi:10.1109/cvpr42600.2020.00525.

[12] Y. Xu, W. Yan, G. Yang, J. Luo, T. Li, J. He. 2020 Centerface: Joint face detection and alignment using face as point, Scientific Programming. 1-8. doi:10.1155/2020/7845384.

[13] J. Guo, J. Deng, A. Lattas, S. Zafeiriou. 2021. Sample and computation redistribution for efficient face detection. Retrieved from https://arxiv.org/abs/2105.04714.

[14] Liu, Y., Liu, R., Wang, S., Yan, D., Peng, B., & Zhang, T. 2022. Video face detection based on improved SSD model and target tracking algorithm. Journal of Web Engineering. Retrieved from https://doi.org/10.13052/jwe1540-9589.21218/.

[15] Taigman, Y., Yang, M., Ranzato, M., & Wolf, L. 2014. Deepface: Closing the gap to human-level performance in face verification. 2014 IEEE Conference on Computer Vision and Pattern Recognition. Retrieved from https://doi.org/10.1109/cvpr.2014.220.

[16] Deepface 0.0.79. 2023. Retrieved from https://pypi.org/project/deepface/.

[17] Labeled Faces in the Wild. 2007. Retrieved from http://vis-www.cs.umass.edu/lfw/.

[18] Serengil, S. 2022. Face recognition with facebook deepface in Keras. Retrieved from https://sefiks.com/2020/02/17/face-recognition-with-facebook-deepface-in-keras/ Labeled faces in the wild home.

[19] Goodfellow, I. J., Erhan, D., Carrier, P. L., Courville, A., Mirza, M., Hamner, B., ... Bengio, Y. 2013. Challenges in representation learning: A report on three machine learning contests. Neural Information Processing, 117-124. doi:10.1007/978-3-642-42051-1_16

[20] Challenges in Representation Learning: Facial Expression Recognition Challenge. (2013). Retrieved from https://www.kaggle.com/c/challenges-in-representation-learning-facial-expression-recognition-challenge/data.

[21] EDPS TechDispatch: Facial Emotion Recognition (FER). 2021. Issue 1. pp. 1-5. doi:10.2804/519064. Retrieved from https://edps.europa.eu/system/files/2021-05/21-05-26_techdispatch-facial-emotion-recognition_ref_en.pdf.

[22] Facial Expression Recognition (FER) on FER2013 (2023) Papers with code - fer2013 benchmark Retrieved from https://paperswithcode.com/sota/facial-expression-recognition-on-fer2013/.

[23] Serengil, S. 2021. Facial expression recognition with Keras. Retrieved from https://sefiks.com/2018/01/01/facial-expression-recognition-with-keras/.