Color Detections Of Object In Image

5 min readJan 10, 2021


Welcome to this article where we are going to implement a color detector system.

We need to perform the following.

1. Object Detection in the image using YOLOv3 model (Transfer Learning)

2. Use K-means Clustering to state the dominant colours present in the object.

Object Detection:

YOLO stands for ‘You Only Look One’. This algorithm was first proposed by Joseph Redmon, Santosh Divvala, Ross Ghirsick and Ali Farhadi. In the present scenario of Object Detection, YOLOv3 is a state of the art model. The basic approach in this algorithm is to pass an image into a single neural network. The network divides the image into different regions, therefore predicting bounding boxes and probabilities in each region.

YOLOv3 algorithm uses a Fully Connected Neural Network architecture. There are 75 convolutional layers with skip connections and upsampling layers. No form of pooling is used in the algorithm. A convolutional layer with stride 2 is used to downsample the feature maps. This prevents the loss of low-level features often attributed to pooling.

YOLOv3 is invariant to the size of the input, but we will to stick to a particular size(resize the input to the required size). The network downsamples the image by a factor called the stride of the network.

Now let us look at how the outputs are interpreted.

This algorithm uses 1 x 1 convolutions to predict the attributes associated with the bounding box. The most important thing to notice is that each grid can detect exactly one object i.e. it predicts the bounding box associated with the object whose center lies in that grid cell. Suppose we assign B bounding boxes to be predicted per grid cell. Each bounding box will have 5 values associated with it i.e. probability of the presence of an object, x, y, w, and h where x and y are the x and y coordinates of the center pixel of the bounding box, w and h are the width and height of the bounding box respectively. Now let us assume we have C classes of objects. Hence, the output feature map is a vector of size (C + 5) x B for each grid cell.

For training purposes, let us consider an image with some objects. First, we divide the image into a certain number of grid cells and set the labels for each grid cell as a vector of size C + 5. Note that the label for each grid cell has exactly one bounding box because only one object is predicted per grid cell as mentioned above. Thus, we have the output labels with the input image. We then decide a proper loss function and train the weights using backpropagation.

We are not going to train the weight as it requires high computation power. Thus we load the pre-train weights by darknet.

The loss function looks like this.

The first two rows are associated to box loss, 3rd row is associated with object loss, 4th row is associated with no object loss, and the last row is associated with class loss. To know more about the loss function, you can have a look at the research paper by the authors mentioned above.

The predictions of the network are transformed into the log space for faster computation. Thus we need to decode the output into an absolute term by applying the inverse of the transforms.

The following formulae describe how the output is transformed back into absolute form.

bx, by, bw and bh are the x, y center coordinates, width, and height of the bounding box respectively. tx, ty, tw, th are the values that the network outputs. pw and ph are the anchor dimensions for the box. (Note: The use of anchor boxes is a different concept that allows us to predict multiple objects of different sizes in a single grid cell).

You might have noticed that the center coordinates is passed through the sigmoid function. The reason to do this is that it gives a value between 0 and 1 and thus gives the relative position with respect to the bounding box.

There are two more terms associated with this architecture i.e. objectness score and class confidences. Both are associated with each grid cell. Objectness score is a measure of how likely that grid cell contains an object and class confidences are the measure of the probabilities of a particular class if an object is found.

Let us now process the output of the network.

As mentioned above, we specified the number of bounding boxes per grid cell. Now each bounding box doesn’t contain an object, right? SO we need to discard the unnecessary bounding boxes. We follow two methods to do this.

1. Thresholding by object confidence i.e. discard the bounding boxes whose objectness score is less than some threshold value.

2. Non-max Suppression i.e. discard the boxes whose IoU values are greater than some threshold. IoU stands for Intersection over Union. The reason we discard the boxes whose IoU is greater than some threshold is that they point to the same object. So we need to keep only one bounding box per object.

Now as we have the coordinates associated with the bounding box, let us move to detect colors using K-Means Clustering.

Colour Detection:

K-Means Clustering is an unsupervised algorithm that works on the concept of aggregating similar objects into one bag. Consider some random points in the 3D space.

Here you can see that the points nearby to each other are clustered into one group. We will use a similar idea to find prominent colors in the image. Scikit-learn library has implemented this algorithm for us. K Means Clustering returns the coordinates of the centers of the clusters. We just need to specify the number of clusters.

We know that an image has three channels i.e. R, G, B. We are going to treat these RBG values as the x, y, z coordinates for the algorithm. The similar pixel values are clustered together giving us the mean of all the similar pixel values.

Note that we are only going to look at the pixels inside the bounding box.

In my implementation, I have predefined the RBG values of some default colors. Then I will be checking the Euclidean distance of the cluster centers with the default colors. The default color whose Euclidean distance is the minimum is chosen as the prominent color. This is continued till we go through all the clusters. Note that we need to use a set to store the colors found in each bounding box as two cluster centers may give the same default color. (You can add more default colors to the list as per your wish).

Yup, we are done with the concept used for detecting the colors of the objects present in the image.

You can go through two github code links mentioned below by: