Computer Vision“MediaPipe”

MediaPipe is Google’s open-source framework, used for media processing. It is cross-platform or we can say it is platform friendly. It is run on Android, iOS, web, and YouTube servers that's what Cross-platform means, to run everywhere.

What do you think is common in all these pictures?

Think for a while and guess what is common in all…!!!

“MediaPipe” Computer Vision
“MediaPipe” Computer Vision

Your guess is absolutely correct, module mediaPipe is common in all these images.

Uses of MediaPipe

Every Youtube video we watch is processed with machine learning models using mediaPipe. Google has not hired thousands of employees to watch every video people upload. Because thousands of people are not enough to look after and check the videos, the amount of data Google gets daily is not easy for humans to check. Machine Learning models are developed to make our life easier, so tasks that are hard for us to complete, machine learning and deep learning models help us to do in less amount of time, on the other hand, we can save money by not hiring employees.

Yes, Google has machine learning/deep learning models to see if the videos match their policies and the content is not having copy-right issues.

Basically, MediaPipe is a framework for Computer Vision and Deep Learning that builds perception pipelines. For now, you just need to know, perception pipelines are some sort of audio, video, or time-series data that catch the process in pipelining zone.

Google has been using MediaPipe for so long and mainly Google uses it for two tasks.

1. Dataset preparation for Machine learning training

Pose Estimation

Pose estimation means finding a person’s or an object’s key points. A person’s key points are elbow, knee, wrist, etc so MediaPipe can be used for training the ML model to learn the key points and further use the knowledge for specific tasks, this actually can be useful for action recognition.

“MediaPipe” Computer Vision
“MediaPipe” Computer Vision
Pose Estimation

2. ML inference pipelines

Live Data

ML inference is the process of running live data points.

Example: We all have used Snap_chat and Instagram filters and may have recorded videos, this is what ML inference means.

ML inference pipelines

What is possible with MediaPipe?

There are a number of AI problems that can be done by MediaPipe. Here some are mentioned

  1. Object Tracking
  2. Box Tracking
  3. Face Mesh
  4. Hair Segmentation
  5. Live Hand Tracking and many more.

Real-Time Hand Tracking Project

Here I have developed the Live Hand Tracking project using MediaPipe.

Hand Tracking uses two modules on the backend

  1. Palm detection

Works on complete image and crops the image of hands to just work on the palm.

Palm Detection

2. Hand Landmarks

From the cropped image, the landmark module finds 21 different landmarks on the hand.

Hand Landmarks

Installation of modules

For this specific task, we require three modules, cv2, mediapipe, and time.

We can install all the modules/libraries of Python by installing pyforest in the jupyter Notebook.

Installing Modules

Once the modules are installed and the next time when this command is run, the output will be shown that (requirements are already satisfied). See below in the image.


If mediapipe is still not installed and does not work, install it separately because mediapipe is the newest module maybe it is not yet included in the pyforest, as I thought to work directly on Kaggle notebook but found out that mediapipe was not working, I installed it and worked on Jupyter Notebook, Jupyter Notebooks do not require internet it is a plus point.

This is how mediapipe is installed in jupyter notebook.

Mediapipe installation

Importing modules

Importing libraries

Camera Object

In the below code, I have created a camera object just to check if the camera is working properly.

Here is the output.

Camera object

Creating an object from class Hand

Created a hand object from hand class so that BGR image is converted to RGB, as hands object only uses/accepts RGB.


Extracting Information from the object results

Before extracting hands further details, make sure there is something in the object (results), do this simple step, Use a print statement, and print the object result to see what it holds. It just shows mediapipe solution-based solutions nothing else even if the hand is shown.

To check if hand is being detected or not

Update print statement by putting (multi_hand_landmarks), and see if the camera is detecting hands.

Now as I have updated the print statement, the information I am getting is “None” because no hand is shown.

Let's see what information is extracted when hand/ hands are shown.

So you see, when the hand is detected by the camera it gives some values.

Detecting Landmarks and Drawing points

In the below code, the drawing object is created (mp_draw), further the if statement says that if the landmarks are detected the for loop will run and draw a point wherever landmark is detected.

Interesting right!! see the image.

Landmarks are detected and points are drawn

Drawing Connections between landmarks

Connections are drawn by using a hand object (mp_hand.HAND_CONNECTIONS).


Frame Rate

For fps two variables are declared, p_time and c_time (previous and current time).

Extracting value of each landmark

Just in case if any specific point is needed to be tracked for any purpose.

As we know there are 21 landmarks in a hand (0 to 20). The landmark information gives the x,y, and z coordinates with id which are listed in the correct order. We can use x and y coordinates to find the location of a landmark on hand.

id and coordinates

Here firstly I have checked the height, width, and channels (h, w, c) of the image. In the previous code, I have got the decimal values and now I wanted exact integer values, therefore, I have converted the circle values (cx, cy) to integers.

Drawing circle on a specific landmark

So for drawing, I have created a drawing object (mp_draw), further, I have declared an if condition for point 0 because I wanted a filled circle at the landmark 0.

High lighting fingertips

For fingertips, the landmarks are (4,8,12, 16, and 20). See the code in the below image.

This is how we can use these landmarks for different tasks. Here I am ending the article also it's not the end of the study there is still a lot to explore.

BSCS graduate from BBSUL Karachi, interested in the field of Artificial Intelligence and eagerly willing to explore the field of AI and DS.