Object Detection Using Machine Learning Techniques: A Fast Review

What is object detection ?
Predicting the location of the object along with the class is called object detection. so, object detection= locating object (localization ) + classifying object (classification). So in short, classification means answering ‘what’, and localization means answering ‘where’. 
e.g. Suppose you have an image , and you have to tell whether the image contains a cat or not. This is a classification problem, since we are referring to ‘what’ the image has. However, outlining the region within the image ‘where’ the cat is seen, is a localization problem. In short,
e.g. In the below image after classification we get elephant class.



Class: Elephant
Now after localization, we get the  exact location of elephant in image. 
Class: Elephant
Now after localization, we get the exact location of elephant in image. In literature object detection problem handled by 2 methods.
1. Object detection as a classification problem.
2. Object detection as regression problem.
Following are the different methods of object detection using classification approach. Why do we have so many methods and what are the salient features of each of these? Let’s have a look:
 
1. Object Detection using Hog Features: In history of computer vision, Navneet Dalal and Bill Triggs introduced Histogram of Oriented Gradients(HOG) features in 2005. Hog features are computationally inexpensive and are good for many real-world problems. On each window obtained from running the sliding window on the pyramid, we calculate Hog Features which are fed to an SVM(Support vector machine) to create classifiers.
 
2. Region-based Convolutional Neural Networks(R-CNN): After the rise of deep learning, HOG based classifiers are replaced with a more accurate convolutional neural network based classifier. However, there was one problem. CNNs were too slow and computationally very expensive. It was impossible to run CNNs on so many patches generated by sliding window detector. R-CNN solves this problem by using an object proposal algorithm called Selective Search which reduces the number of bounding boxes that are fed to the classifier to close to 2000 region proposals. Selective search uses local cues like texture, intensity, color and/or a measure of insideness etc to generate all the possible locations of the object.

3. Spatial Pyramid Pooling(SPP-net): Still, RCNN was very slow. Because running CNN on 2000 region proposals generated by Selective search takes a lot of time. SPP-Net solved this problem by calculating convolution for entire image only once. It derives the feature map for each patch generated by Selective Search using pooling type of operation on JUST that section of the feature maps of last conv layer that corresponds to the region.There was one more challenge: we need to generate the fixed size of input for the fully connected layers of the CNN so, SPP introduces one more trick. It uses spatial pooling after the last convolutional layer as opposed to traditionally used max-pooling.However, there was one big drawback with SPP net, it was not trivial to perform back-propagation through spatial pooling layer. Hence, the network only fine-tuned the fully connected part of the network. 

4. Fast R-CNN: Fast RCNN uses the ideas from SPP-net and RCNN and fixes the key problem in SPP-net i.e. they made it possible to train end-to-end. It solves the problem of back propagation.One more thing that Fast RCNN did that they added the bounding box regression to the neural network training itself. These two changes reduce the overall training time and increase the accuracy in comparison to SPP net because of the end to end learning of CNN.

5. Faster R-CNN: So, what did Faster RCNN improve? Well, it’s faster. And How does it achieve that? Slowest part in Fast RCNN was Selective Search or Edge boxes. Faster RCNN replaces selective search with a very small convolutional network called Region Proposal Network to generate regions of Interests. Faster-RCNN is 10 times faster than Fast-RCNN with similar accuracy.

6. Mask R-CNN: The method, called Mask R-CNN, extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps. “Mask R-CNN” is the most recent variation of the object detection and object segmentation handled problem as a classification problem. Concept of Mask R-CNN is very simple: Faster R-CNN has two outputs for each candidate object, a class label and a bounding-box offset; to this we add a third branch that outputs the object mask — which is a binary mask that indicates the pixels where the object is in the bounding box.  Mask R-CNN uses Fully convolutional network (FCN) to get finer spatial layout of an object in the output bounding box of Faster R-CNN . To know more about FCN refer paper on "Fully Convolutional Networks for Semantic Segmentation"


Now we will see different methods those handled object detection problem as  regression problem...

1. YOLO(You only Look Once): For YOLO, detection is a simple regression problem which takes an input image and learns the class probabilities and bounding box coordinates. Sounds simple? YOLO divides each image into a grid of K x K and for each grid it  predicts N bounding boxes and respective confidence scores. The confidence reflects the accuracy of the bounding box and whether the bounding box actually contains an object(regardless of class). YOLO also predicts the classification score for each box for every class in training. You can combine both the classes to calculate the probability of each class being present in a predicted box. So, total SxSxN boxes are predicted. However, most of these boxes have low confidence scores and if we set a threshold say 30% confidence, we can remove most of them as shown in the example below. As we run our image on CNN only once, YOLO is super fast and can be run in real time. Another key difference is that YOLO sees the complete image at once as opposed to looking at only a generated region proposals in the previous methods. So, this contextual information helps in avoiding false positives. However, one limitation for YOLO is that it only predicts 1 type of class in one grid hence, it struggles with very small objects.



7. Single Shot Detector(SSD): Single Shot Detector achieves a good balance between speed and accuracy. SSD runs a convolutional network on input image only once and calculates a feature map. Now, we run a small 3×3 sized convolutional kernel on this feature map to predict the bounding boxes and classification probability. SSD also uses anchor boxes at various aspect ratio similar to Faster-RCNN and learns the off-set rather than learning the box. In order to handle the scale, SSD predicts bounding boxes after multiple convolutional layers. Since each convolutional layer operates at a different scale, it is able to detect objects of various scales. 

Conclusion: Which one should you use?
Currently, Mask-RCNN is the choice if you are fanatic about the accuracy numbers and semantic segmentation. If you don't want semantic segmentation and only accuracy of object detection is matters for you the choice will Faster-RCNN. If you want real-time object detection and accuracy is not too much of a concern for you then  YOLO will be best choice. Finally, if you real-time speed with descent accuracy SSD is a better recommendation. 

Note: The post is not complete yet, I will write remaining part in upcoming days.



No comments:

Post a Comment

Rendering 3D maps on the web using opesource javascript library Cesium

This is a simple 3D viewer using the Cesium javascript library. The example code can be found here . Click on question-mark symbol on upp...