Real-Time Detection Of Social Distancing

Hello guys! Let us look at how we can build a real-time social distance detector using computer vision

The first and the most important task we need to carry out is object detection. To accomplish object detection, we are going to use a state of the art model named YOLO.

YOLO stands for ‘You Only Look One’. This algorithm was first proposed by Joseph Redmon, Santosh Divvala, Ross Ghirsick, and Ali Farhadi. In the present scenario of Object Detection, YOLOv3 is a state of the art model. The basic approach in this algorithm is to pass an image into a single neural network. The network divides the image into different regions, therefore predicting bounding boxes and probabilities in each region.

This algorithm predicts the regions in the image where some common objects are found, but we are only going to consider the regions where humans are localized. Let us now look at the architecture of this network.

YOLOv3 algorithm uses a Fully Connected Neural Network architecture. There are 75 convolutional layers with skip connections and upsampling layers. No form of pooling is used in the algorithm. A convolutional layer with stride 2 is used to downsample the feature maps. This prevents the loss of low-level features often attributed to pooling.

You can go through the ‘yolov3.cfg’ file in the GitHub link mentioned at the end to have a look at the complete architecture of the model. (Note: cfg stands for configuration)

YOLOv3 is invariant to the size of the input, but we will stick to a particular size(resize the input to the required size). The size of the input is a major factor to determine the rate at which outputs are predicted. As we are going to build a real-time application, we need to use a model that is fast in predicting the objects with good accuracy. Keeping this in mind, researchers have come up with many variations in YOLOv3, one being YOLOv3 320 where the size of the input image is 320 x 320. This architecture has an mAP (Mean average precision) score of 51.5 and parses 45 frames per second.

Now let us look at how the outputs of this model are interpreted.

This algorithm uses 1 x 1 convolutions to predict the attributes associated with the bounding box. The most important thing to notice is that each grid can detect exactly one object i.e. it predicts the bounding box associated with the object whose center lies in that grid cell. Suppose we assign B bounding boxes to be predicted per grid cell. Each bounding box will have 5 values associated with it i.e. probability of the presence of an object, x, y, w, and h where x and y are the x and y coordinates of the center pixel of the bounding box, w and h are the width and height of the bounding box respectively. Now let us assume we have C classes of objects. Hence, the output feature map is a vector of size (C + 5) x B for each grid cell.

For training purposes, let us consider an image with some objects. First, we divide the image into a certain number of grid cells and set the labels for each grid cell as a vector of size C + 5. Note that the label for each grid cell has exactly one bounding box because only one object is predicted per grid cell as mentioned above. Thus, we have the output labels with the input image. We then decide on a proper loss function and train the weights using backpropagation.

We are not going to train the weights as it requires high computation power(access to GPU). Thus we load the pretrained weights by darknet. You can go through the file ‘’ which builds a model in Keras in accordance with the cfg file which completely describes the architecture of YOLOv3 320.

Every supervised learning model has a loss function which is attributed to learning the correct weights. Let us have a look at the loss function associated with this model.

The first two rows are associated to box loss, 3rd row is associated to object loss, 4th row is associated with no object loss, and the last row is associated with class loss. To know more about the loss function, you can have a look at the research paper by the authors mentioned above.

The predictions of the network are transformed into the log space for faster computation. Thus we need to decode the output into an absolute term by applying the inverse of the transforms.

The following formulae describe how the output is transformed back into absolute form.

bx, by, bw and bh are the x, y center coordinates, width, and height of the bounding box respectively. tx, ty, tw, th are the values that the network outputs. pw and ph are the anchor dimensions for the box. (Note: The use of anchor boxes is a different concept that allows us to predict multiple objects of different sizes in a single grid cell).

You might have noticed that the center coordinates are passed through the sigmoid function. The reason to do this is that it gives a value between 0 and 1 and thus gives the relative position with respect to the bounding box.

There are two more terms associated with this architecture i.e. objectness score and class confidences. Both are associated with each grid cell. Objectness score is a measure of how likely that grid cell contains an object and class confidences are the measure of the probabilities of a particular class if an object is found.

Let us now process the output of the network.

As mentioned above, we specified the number of bounding boxes per grid cell. Now each bounding box doesn’t contain an object, right? So we need to discard the unnecessary bounding boxes. We follow two methods to do this.

1. Thresholding by object confidence i.e. discard the bounding boxes whose objectness score is less than some threshold value.

2. Non-max Suppression i.e. discard the boxes whose IoU values are greater than some threshold. IoU stands for Intersection over Union. The reason we discard the boxes whose IoU is greater

than some threshold is that they point to the same object. So we need to keep only one bounding box per object.

Now, we are done with the task of detecting the objects in the image. Our next task is to segregate only humans from the list of the objects detected. Since we have the coordinates of the center pixel of the bounding box of each object i.e. each detected human, we need to check whether the humans are too close to each other or are at a reasonable distance. To do this, we take the Euclidean distance between the centers of the bounding boxes and compare this with a threshold distance, which in my case, I have chosen to be 70. This threshold distance decides whether two persons are to be made close enough or at safe distance. Threshold distance will depend on the position of the camera and the area that the camera captures. We mark those boxes which are closer than the threshold distance as red and those which are at safe distance as green. Opencv is used to carry out the task of marking and displaying the bounding boxes.

Yup, we are done with a real-time social distance detector.

I would like to suggest that we can also find the exact distance between two objects in the image given some extra information like the focal length of the camera, the area that the camera covers, and the actual distance of at least one point in the image. We can take the help of warp perspective which converts the image into bird eye view and then compute the relative distance with respect to the image, ultimately getting the actual distance. You can have a thought on implementing this.

The link to GitHub code :