Tools & Dependencies

A. MediaPipe

MediaPipe is an open-source ML model developed by Google. Our solution makes heavy use of the MediaPipe Hands module to track the landmark points and position of the hands, which we use for gesture calculations.

B. Nuitka

Nuitka is a highly efficient python compiler that can bundle all the dependencies and libraries into one executable. We use this to enable MotionInput and the intermediary server to run on any Windows platform, even if Python is not installed.

C. ZeroMQ

ZeroMQ (ØMQ) is a high-performance messaging library that provides a socket-like interface for communication between applications. It supports various messaging patterns, is lightweight, easy to use, and can be used in multiple programming languages. We use it to enable communication between our intermediary server and MotionInput.

D. Python WebSockets

The Python websockets library is a module for building WebSocket servers and clients in Python. It provides a simple and easy-to-use interface for sending and receiving messages over a WebSocket connection. We use it to enable communication between our intermediary server and the WebExtension frontend, which can only use web-based technologies.

MotionInput

A. Hand Module and Key Components

In this section, we describe the functioning of MotionInput, specifically the hand module, to better understand and discuss the changes and added features that we implemented for our solution. The general architecture of the implemented HandModule class is as follows:

HandModule Architecture

B. HandLandmark Detector

The HandLandmark class uses the MediaPipe Hands module for gathering raw data from each video frame. It tracks the specified number of hands in the frame and extracts biometric landmarks such as the coordinates of the wrist, coordinates of the fingertips etc. It then stores this gathered data in an instance of the RawData class, which is later available to both the HandModule and the main MotionInput Model class, which is in charge of orchestrating core functions, such as managing and switching events, initializing classes, etc.

C. HandPosition and HandGesture

HandPosition is used for determining whether specified primitives occur or not. For example, we define the primitive “index_pinched”, which uses the RawData coordinates to calculate whether this primitive holds true. The calculation in this case is based on the position of the index_tip, the wrist, the palm etc. These primitives make it easy for high-level gesture events to directly track low-level gestures rather than having to define and calculate them manually.

The HandGesture class is used for maintaining the state of all gestures related to the hand module.

D. Crowd Detection

Previously in MotionInput, the HandLandmarkDetector would send the RawData instance the coordinates of the most recent hand of each hand type detected. For example, if there were two left hands and two right hands in the frame, the RawData instance would only store the most recent right hand and the most recent left hand. This caused major issues for our scenario since it meant that a person standing in front of a kiosk could essentially have their control stolen away by a person in the background if they showed the same hand.

Furthermore, as MediaPipe essentially detected hands at random, the most recent hand it detected would be arbitrary and not replicable upon testing. To alleviate this issue, we developed the following heuristic function in the HandLandmarkDetector class:

It takes the area of the approximate bounding box of the palm, which gives an idea of how close the hand is to the frame, and scales it by the negative log of the distance from the hand to the centre of the camera frame, to prefer hands closer to the centre. Now, in the HandLandmarkDetector, whenever MediaPipe detects hands in frame, it is first passed to the heuristic function and ranked based on the highest score. The best-scored hands of each type are then passed to the RawData instance, which can then be accessed by other classes and modules.

E. Swipe Gesture

The next step was to implement the swipe gestures needed to map hand swipe movements to keyboard arrow presses. In MotionInput, all such gestures are defined by subclassing the GestureEvent class. A specific GestureEvent subclass is registered when MotionInput starts, based on config files, and defines the main mode of control of MotionInput. For example, there are pre-existing subclasses such as NoseScrollEvent and EyeTrackingEvent, which are responsible for scrolling the mouse based on nose movement, or moving the cursor based on eye movement respectively.

After being registered, the update method of the specific subclass is called on each frame to enact any actions caused by the gestures. To do this, on each frame, the HandLandmarkDetector, HandPosition, and HandGesture classes each perform their respective roles to track hands, calculate primitives and pass this information to the GestureEvent subclass. Based on this received hand data, the GestureEvent subclass can then implement logic to use the calculated primitives and coordinates to detect when a specific gesture occurs and to trigger an event such as a keyboard press.

In our case, we created a KioskSwipeEvent subclass that implements the swipe gesture we want. This is how it works:

Firstly, we define a threshold that is derived from the user-set sensitivity. It is a measure of how far the user needs to be from the centre of the frame to register a swipe.
Whenever the update function is called, it calculates the vector between the middle of the frame and the coordinates of the palm.
If the norm of this vector is lesser than the defined threshold, the hand is in the centre, otherwise it is outside the centre.
We also calculate the largest component of the vector to find the intended swipe direction.
Then, we simply check if the hand was previously in the centre and is now swiping towards a direction. If so, we map this to a keyboard arrow press.
We also check if the index finger is pinched, which acts as an analog for a fist gesture. If so, we trigger the “Enter” button on the keyboard.

The result is an extremely natural swipe gesture that is adaptable to the user’s needs. The user can adjust the sensitivity based on their preferences, and they simple have to swipe from the centre to one of the edges accordingly. Our gesture event is also extremely performant since it only uses the highly optimized C-based NumPy library for all the vector calculations. We also use some tricks such as roots of unity of a square to quickly map vectors to named directions, avoiding slow constructs such as Python for loops or nested if statements.

F. Communication with Intermediary Server

Now that we have a KioskGestureEvent that can track the coordinates of the hand at each frame, and detect swipe gestures, we need some way to indicate to the user where their hand is in the frame and how far they need to swipe based on the sensitivity. As mentioned previously, to maintain user privacy, maintain a browser sandbox, and neatly integrate into any pre-existing kiosk UI, we rely on a WebExtension frontend, which is connected to MotionInput via an intermediary server.

To communicate with the server, we implemented a client that uses the ZeroMQ library. It connects to the same localhost port as the intermediary server and runs receiver and listener threads based on the ZeroMQ PAIR architecture. It maintains two performant queue structures to keep track of incoming and outgoing messages and prevent dropping. It also contains logic for sending keepalive messages to maintain connection, resting when needed to lower CPU usage, and limiting throughput as needed to keep memory use low. All this detail is abstracted away for other classes and modules, and they can use the features of the communication protocol with the following user-facing APIs:

-> The send_to_website method can be called by any class to send JSON-encoded data to the WebExtension frontend. It works by appending this data to the outgoing queue, which is monitored for activity by the sender thread, and relays this information to the intermediary server.

-> The receive_from_website method retrieves JSON-encoded data from the incoming queue, which is continually fed data from the receiver thread.

These two methods allow easy bidirectional communication with the intermediary server, which in turn facilitates bidirectional communication with the WebExtension frontend.

An example of how these methods are used can be seen in our KioskSwipeEvent. In the update method that we outlined previously, we compose the following JSON message and use the send_to_website method to send it to the intermediary server, which sends it to the WebExtension frontend.

Web Extension

A. Web Extension

Our WebExtension frontend is written in React, a popular JavaScript library for building user interfaces. It communicates directly with the intermediary server using a local WebSockets connection and can send and receive JSON-encoded data, using the react useWebsocket React library.

The library is used to attach several listeners to the WebSocket connection that are triggered whenever a new message is received. It uses several of React’s features, such as useEffect hooks to efficiently detect new messages and trigger the appropriate response, as well as reactive useState hooks to keep track of settings and other configuration options throughout the lifetime of the extension. It can also directly send data using browser-native WebSocket sending capabilities. The UI for the frontend is written in ChakraUI, a popular react UI framework, that provides extensible and responsive UI elements such as sliders, buttons, headings etc.

Our program is divided into three main components: PreviewWindow, SensitivitySlider, and StatusInfo.

B. PreviewWindow

The PreviewWindow component is responsible for rendering a live preview of the user’s hand as well as the detection boundaries. This data is sent by the MotionInput backend, following our JSON schema, and is then relayed to the frontend via the intermediary server. The frontend then renders an abstract preview of the user’s hand using the coordinates sent. For example, to draw the emoji preview, the following WebSocket event listener is called.

To render the live preview, the following drawEmoji function is called to position an image of an emoji in the specified preview element. The abstract emoji element is targeted using React’s useRef hook, reducing memory use, and is be positioned exactly using CSS absolute positioning and dynamically specifying offsets based on the received coordinates, as well as adjusting the image size to simulate depth changes. The use of native HTML DOM elements and directly altering properties via JavaScript make it extremely performant and reactive compared to third party animation libraries or even the native Canvas API.

Similar functions are used for displaying the threshold edges and animating edges based on the swipe direction.

C. SensitivitySldier

The SensitivitySlider component is responsible for allowing the user to adjust the sensitivity of the camera/motion sensor. This is done using a slider input element, which uses the sendSensitivityJson function as an event listener whenever the slider value changes. This function sends a JSON message to the intermediary server with the new sensitivity value, following our JSON schema, which then relays it to the MotionInput backend. The backend can then adjust the sensitivity used by the KioskSwipeEvent based on this value. The component also has a WebSocket event listener, that listens for sensitivity changes from the backend, for example, if the admin frontend changes the sensitivity, and updates the slider accordingly.

D. StatusInfo

Finally, the StatusInfo component displays information about the status of the WebSocket connection and MotionInput connection. For the WebSocket connection status, this is done using the built-in readyState variable from the useWebSocket library, which changes whenever the state of the WebSocket connection changes. For the MotionInput conenction status, it monitors status message from the intermediary server using a WebSocket event listener and updates the status of the MotionInput connection accordingly.

E. Deploying the Extension

All these components are arranged into a vertical stack using ChakraUI layout components and are passed the same instance of the WebSocket connection using React’s context manager feature. This allows each component to be developed separately, maintaining separation of concerns, but be able to access the same application-wide WebSocket connection to send messages, receive messages and attach event listeners.

The overall component structure is then rendered onto a blank HTML page. To turn this into a sidebar extension, we simply follow the manifest v3 specification by Mozilla, and create a manifest.json file accordingly, specifying that we want the resulting HTML page to render in the sidebar. Due to the responsive design of the page, as well as the vertical layout, the content looks great in a sidebar of any size.

Communication Protocol

Enabling bidirectional communication between the WebExtension frontend and the MotionInput backend is an essential feature of our solution. The frontend needs to continually receive updates from the backend about things such as the hand positions, gesture sensitivity, direction moved, hand orientation etc., while the backend needs to be able to receive any config changes made from the WebExtension frontend.

Since web-based technologies are sandboxed and very restrictive in terms of access to the host operating system, we cannot make use of standard features such as lock files, process pipes, and other such synchronization primitives. To get around this, as mentioned in our system design page, we developed a very lightweight intermediary server that sits in-between the frontend and backend and communicates with platform-specific messaging protocols: ZeroMQ for the backend, and WebSockets for the frontend.

The core function of the intermediary server is to act as a relay server and pass messages along. The senders on the frontend and backend can send JSON-encoded messages to the intermediary server, while the receivers on the frontend and backend can then do as they please with the data received. The JSON messages follow a stateless design, where messages are interpreted independently of each other. This allows for high message throughput without worrying about dropped messages. The JSON schema is also simple and defines several message types such as “mi_status”, “mi_update”, and “mi_config”. Upon receiving the JSON-encoded messages, the receivers on either the frontend or backend can unpack the data based on the type of message and perform any changes accordingly.

For example, in the example below, when the WebExtension frontend receives a message with the type “mi_status”, it updates the connection status info element on the frontend.

Intermediary Server

Interfacing with ZeroMQ Sockets

The intermediary server provides the following methods for interfacing with ZeroMQ sockets:
‍
-> receive_from_zeromq(): Continually listens for new messages from the MotionInput backend via the ZeroMQ socket and appends them to the mi_to_website_queue message queue.
‍
-> send_to_zeromq(): Continually tries to send messages from the website_to_mi_queue message queue to the MotionInput backend via the ZeroMQ socket.

Both of these functions use a rest() function to sleep briefly to keep CPU usage low when polling sockets. They also limit throughput as needed to lower memory use.

Interfacing with Websockets

The intermediary server also provides functions for interfacing with websockets:

-> The websocket_handler() function is called for each new websocket connection and runs the send_to_websocket() and receive_from_websocket() functions concurrently

-> The send_to_websocket() function is an asynchronous function that listens for new messages in the mi_to_website_queue message queue and sends them to the WebExtension frontend via a WebSocket connection. If there are no messages in the queue for more than 5 seconds, it sends a "keepalive" message to the frontend, to let it know that the intermediary server is alive, and to notify the frontend that MotionInput isn’t connected.

The receive_from_websocket() function is another asynchronous function that listens for new messages from the WebExtension frontend via a WebSocket and adds them to the website_to_mi_queue message queue. It can also receive commands from the WebExtension frontend to do things like relaunching MotionInput if it is not running.

The program creates two threads to handle ZeroMQ communication, binding to a specified localhost port, and then starts the local websocket server. When the program is run, the ZeroMQ communication threads and the serve_websocket() function are run concurrently.

The end result is a lightweight, yet powerful relay server that facilitates bidirectional communication between MotionInput and the WebExtension frontend, using minimal overhead, and preserving the browser sandbox.

Implementation details of key features

Implementation

Tools & Dependencies

A. MediaPipe

B. Nuitka

C. ZeroMQ

D. Python WebSockets

MotionInput

A. Hand Module and Key Components

HandModule Architecture

B. HandLandmark Detector

C. HandPosition and HandGesture

D. Crowd Detection

E. Swipe Gesture

F. Communication with Intermediary Server

Web Extension

A. Web Extension

B. PreviewWindow

C. SensitivitySldier

D. StatusInfo

E. Deploying the Extension

Communication Protocol

Communication Protocol

Intermediary Server

Interfacing with ZeroMQ Sockets

Interfacing with Websockets