February learnings – CNNs – My Research Work and Life Adventures

28-Feb-2021

Training Loss functions used for Image Segmentation: Just like any other machine learning task, there is a loss function used during the training phase for segmentation. There are different types of loss functions. Some are defined discussed below.
- Dice Loss: Like IoU, it measures how well the two sets overlap. See reference 1 for more details
- Focal Loss: Derived from Pixel-wise Cross Entropy loss. See reference 3 for more details.
- Pixel-wise cross entropy loss: Reference 1
CoordConv layer: See reference 2
Paper Reading:

References

An overview of semantic image segmentation. – This explains Semantic Segmentation in a simplified language. The explanation of Dice loss is really good.
(original paper) An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution
Demystifying Focal Loss I: A More Focused Cross Entropy Loss

21-Feb-2021

Object detection vs Segmentation: Object detection in computer vision is a task to detect various objects in a given image. The result is usually given in the form of a bounding box around the detected object, along with the object type and the confidence score. In the segmentation task, each pixel of an image is assigned a label. Segmentation essentially brings is object detection at the pixel level. The output of Segmentation is not a bounding box but a mask instead. See reference 4 and 3 below. Further, during the training of object detection and segmentation tasks, different types of annotations are used. The Object Detection uses bounding box annotations. The Segmentation, on the other hand, uses mask annotations.
Instance Segmentation vs Semantic Segmentation: Semantic Segmentation is segmenting various object types in an image. It results in a same mask for all the objects belonging to one class. Instance segmentation goes a step further where different masks are assigned to objects of the same class. For example, if there are 3 persons in an image. Object segmentation will assign the same mask to all three persons. The Instance Segmentation, however, assigns different masks to 3 different persons. See Figure 2 below.
Popular Object detection and Segmentation neural network models: There are several object detection and segmentation models that can be used either as is or to retrain using custom data. Here are some of the state-of the art deep learning models.
- Object Detection: R-CNN, Fast R-CNN, Faster R-CNN, YOLO
- Segmentation: Mask R-CNN (Segmentation with bounding boxes), SOLOv2 (segmentation without bounding boxes), YOLACT_EDGE (segmentation with bounding boxes)
How to label images for training for object detection? There are several bounding box labelling tools available. The tricky park is having the right kind of annotation format because different types of object detection models have a requirement for different types of annotation format. For example, Detectron2 implementation of Faster R-CNN uses COCO JSON format. Some labelling tools are listed below.
- labelImg: suitable for YOLO and PASCAL VOC
- YOLO_mark: suitable for YOLO
- COCO-annotator: as the name suggest, it labels in COCO format
- VoTT: Exports in various format, namely, Azure Custom Vision Service, Microsoft Cognitive Toolkit (CNTK), TensorFlow (Pascal VOC and TFRecords), VoTT (generic JSON schema), Comma Separated Values (CSV)
Different Annotation formats for object detection: Different Object Detection and Segmentation models use different types of formats for object annotations (labels).
- COCO JSON: A very comprehensive description of the annotation format can be found here. The annotations are in a JSON file. At a very high level, the bounding box in COCO format is <top left x, top left y, width, height>, where width and height are the dimensions of the bounding box. See image below. Also, the category IDs of various classes in the COCO format starts from 1 to num_of_categories.
- .txt for YOLO: It basically uses .txt file to define bounding boxes for objects in each image. Additionally, there is object.names file that contains a list of all the categories. The bounding box in YOLO format is <relative x position of the center of the box, relative y position of the center of the box , relative width of the box, relative height of the box>. See image below. The categories ID for various classes in the YOLO format starts from 0 to num_of_categories -1.

Bounding boxes in COCO format (left) and YOLO format (right)

Tools for converting from one annotation format to another:
- How To Convert YOLO Darknet TXT to COCO JSON: (first 1000 images for free)
- Convert yolo format to coco format: I tried using this, however, it did not work for me as intended. So , I modified the original code. The modified code can be found here. The conversion can be done as follows. See the reference image above.

width_coco = width_yolo * Image width

height_coco = height_yolo * Image height

x_coco = x_yolo* Image width – (width_coco /2)

y_coco = y_yolo* Image height – (height_coco /2)

Conversion from YOLO to COCO format (bounding box)

Fully Convolutional Networks (FCN): Introduced in 2015, these are based on VGG-16 architecture. The final dense layers in the VGG-16 architecture are replaced with by 1×1 convolutions.
U-Net: Introduced in 2015. Used for semantic (object) segmentation.
Feature Pyramid Network (FPN): Introduced in 2017 to extract generic features at multi scale (that is, detect objects of different sizes). See definition below to understand what the feature pyramids are. According to the original paper Reference 2 below, ” A topdown architecture with lateral connections is developed for building high-level semantic feature maps at all scales. This architecture, called a Feature Pyramid Network (FPN).” See sub-figure d of Figure 1 below. The paper also mentions that FPNs can be used within the RPN which eliminates the need for multi-scale anchor box. A single sized anchor box suffices at each stage when FPNs are used. See the original paper for more details, reference 2 below.

Figure 1: This image is from the original paper (Reference 2)

Figure 2: This image is from Reference 1 and 2 (from the original paper)

Figure 2: Differences between Object detection and Segmentation. This is an image from Reference 4.

References

Review: FPN — Feature Pyramid Network (Object Detection)
Feature Pyramid Networks for Object Detection (original paper)
Deep Learning for Instance-level Object Understanding – By far, the most comprehensive and well explained resource for understanding where the current state-of-the-art for object detection/segmentation
What is the difference between semantic segmentation, object detection and instance segmentation?
COCO annotation format
Converting YOLO to COC annotation format
How does COCO annotation file handle missing annotations
https://www.immersivelimit.com/tutorials/create-coco-annotations-from-scratch
https://towardsdatascience.com/getting-started-with-coco-dataset-82def99fa0b8
How to create custom COCO data set for instance segmentation

14-Feb-2021

Region based CNN (R-CNN):
Fast R-CNN
RoI pooling:
Faster R-CNN: It is a deep learning based object detection framework introduced in 2015. This can be summarized as Fast R-CNN + RPN. The object detection occurs in two steps sequentially. In the first step, region proposals (RoIs) are found by a certain region proposal method. Then, in the second step, these region proposals are fed into a classifier to detect the type of object in the region proposal (that is, its class type for the classification task) . This networks introduces a concept of RPN and anchor boxes which replace the original region proposal method in Fast R-CNN. However, unlike the Fast R-CNN where the two steps were independent and did not share any convolutional layers, Faster RCNN offers a significant improvement because the newly introduced RPN network for finding the region proposal shares the convolution layer with the detection/classification step. Note: This network accepts an input of any size. See Figure 1 below for a high level layout of Faster R-CNN. It can be seen that the Region Proposal Network (RPN) and the classifier rely on the output of a common set of convolutional layers.
- Region Proposal Network (RPN): RPN is a technique that was first introduced in Faster R-CNN paper for finding Regions of Interest (ROIs) in an image – that is, it is a new type of region proposal method. It is a fully convolutional network. RPN takes an image as an input and outputs rectangular regions overlayed on an image. Those rectangular regions represent various region proposals (RoIs). To be precise,
  - Input: RPN does not directly work on an input image. It actually works on the feature map that is produced when the image goes through a set of convolutional layers. See Figure 1.
  - Output: The RPN network produces various region proposals in the form of rectangular boxes. Then for each of the rectangular box, it gives an objectness score and the location of the rectangular box on the feature map. The objectness score for a single rectangular box is vector of size 2. It contains the probability of the region being some object (foreground) and the probability of the region being a background). Additionally, the location of the rectangular box is represented by a vector of size 4 – it contains x, y, width, height. Further, the objectness score and location of the rectangular box is evaluated by two different branches of the RPN network. The objectness score is calculated by the classification layer. This classification layer has only two classes (object and no object/background). The location is calculated by the regression layer. See Figure 2 below.
  - The next question is how does RPN actually work? To calculate the output from the feature maps, RPN uses a concept of anchor boxes. In the Faster RCNN, the authors use 9 anchor boxes. These are simply 9 references boxes of different sizes and aspect ratios. Each location/ point in the feature map acts as an anchor point. The anchor point is found by convolving a feature map with a 3X3 filer.
- - RPN overlays 9 anchor boxes at each of the anchor points on the feature map. “These anchor boxes are centered at the point in the image which is corresponding to the anchor point of the feature map.” [1] So, by now we have a set 9 anchor boxes on every anchor point on the feature map. This gives a total of H x W x 9 anchor boxes for the entire feature map, where H and W represent the size of the feature map.
  - All of these anchor boxes are fed as inputs to two parallel convolutional layers – the classification and the regression layer. These are both 1X1 convolutional layers. The classification layer calculates the objectness score for each anchor box. The regression layer calculates the location/boundary of the anchor box.
    - How does the classification layer work? During the training phase, each anchor box is given a label based on the IoU with the ground truth bounding box. The labels are (1=object is present, meaning the IoU of this anchor box with the ground truth is higher, -1=no object is present, IoU with the ground truth is very low, 0= anchor does not fall under the above two labels.) The anchor boxes with label 0 are dropped and ignored in the training. After the labels are assigned, it creates a minibatch consisting of 256 randomly picked anchors from a single input image. These 256 anchor are split in the 1:1 ratio of anchors with labels 1 and anchors with label -1. The RPN is trained using backpropagation and stochastic gradient descent
    - How does the regression layer work? It works very similar to the classification layer. The only difference is that rather than using the labels, regression layer compares the location vector (x, y, width, height) with the indices of the ground truth bounding box during training. One another difference is that the anchor boxes with only (+1) labels are considered.
  - During the testing phase, we don’t have any ground truth data. Non Maximal Suppression is also applied as a Post Processing step to reduce the number of duplicate anchors. The anchor boxes after the NMS are the final of the RPN and are then fed into the second stage of Faster R-CNN where they are classified into appropriate classes such as dog, cat, person, etc.
  - References 1- 6 provide more details. Please feel free to check them out to understand RPN even better.
- Anchor box: An anchor box is simply a reference box that was first introduced in Faster R-NN paper and is it the basis of the RPN network. As mentioned above, typically more than one box is used at a time. In Faster R-CNN, the authors use 9 different types of anchor boxes. These different sized anchor boxes are used to detect objects at different scale. Using anchor box concept is must faster than other techniques for capturing object detection at multiple scales such as Image pyramids and Filter pyramids. See Reference 5 for more details (around 12 minutes into the video, the speaker explains the power of anchor boxes very well)
Image pyramids: It is a technique to address a common problem in object detection that can occur due to objects being of varying scales. In order to detect objects at multiple scales, the image pyramid technique resizes the image repeatedly. Then at each scale, features are extracted.
Filter pyramids: It is also a technique to detect objects at multiple scales. In this technique, sliding windows of different sizes are moved on an image to extract features.
Mask R-CNN: It is a deep learning algorithm for Image Segmentation that are first introduced in 2017. Mask R-CNN is based on the Faster R-CNN architecture. As seen above, the Faster R-CNN algorithm is used for object detection. It outputs bounding boxes for various objects detected in the image and their classification results. Mask R-CNN augments the Faster R-CNN network by adding another parallel branch at the output of the RPN network. This parallel branch is used for segmentation. This branch outputs a binary mask for all the elements in each region. There are N binary masks available for N classes, however, only the mask corresponding to a classes predicted by the classification branch (Faster R-CNN part) is considered for the final output.
- RoI align layer: In Mask R-CNN, RoI align layer replaces the Roi pooling layer of the Faster R-CNN. It uses bilinear interpolation instead of instead of discretization.
YOLO: It is deep learning based object detection framework first proposed in 2015. It is considered a single shot object detection unlike the Faster R-NN (or Fast R-CNN / R-CNN). This means that the object detection and classification occurs simultaneously. It is fast. For example, YOLOv3 can run at more than 170 frames per second on a modern GPU for an image of size of 256 x 256.
Methods to evaluate Object Detection Algorithms:
- Average Precision (AP) or mean Average Precision (mAP) if there is more than one class. This represents the area under the precision-recall curve. The Precision-Recall curve is drawn by finding precision and recall values at many different thresholds of confidence. AP is a number between 0 and 1. For more details, read Chapter 5: Object Detection Models of the book called Hands On Computer Vision with TensorFlow 2 [9]
  - Precision = TP/(TP+FP)
  - Recall = TP/ (TP+ FN)
- Intersection over Union (IoU): This is also a number between 0 and 1. It indicates how much two boxes overlap. When computing the average precision, we say the two boxes overlap if their IoU is above a certain threshold, such as AP@0.5 (meaning AP calculated at confidence threshold of 50%)
- 12 Metrics for evaluating COCO dataset: See Reference 12 and Figure 4 below.
Method to measure correlation between two vectors (Distance Measure): See Reference 8
- Pearson Coefficient
- Centered Cosine
Collaborative filtering: See Reference 11

**Figure 1: Image taken from Reference 1 below.**

**Figure 2: This image is taken from Reference 1 below.**

**Figure 3: This image is taken from Reference 10 below.**

References

7-Feb-2021

Non Maximum Suppression (NMS): It is a technique that is used in the object detection pipeline to reduce the number of duplicate detections of the same object. In traditional object detection where sliding window approach is used to detect and classify objects, multiple detections are reported for the same object because of how the sliding window algorithm works. This leads to higher false positives. Therefore, NMS is used as a post processing step to obtain final detections. (The goal is to get a single detection per object.) Similarly, for the deep learning based object detectors which are based on the Region Proposal Network (RPN), multiple proposals for a region in an image need to be down-selected before the final output is calculated. NMS is used to reduce the number of region proposals.
- How does NMS work? The key terms to understand before learning how this technique works are the Intersection of Union (IoU), confidence score, threshold. The confidence score is a score assigned by the underlying object detector for all the region proposals as a measure of they being some type of ‘object’. All these region proposals are fed into the NMS algorithm along with their confidence scores. Let’s call a set S which consists of all these proposals. Initially, a proposal with the highest confidence is chosen from set S and added to another set called T (which is originally empty). Its overlap with the remaining proposals in the set S is calculated in the form of IoU. If the IoU is greater than some user defined threshold, then that proposal is completely removed from set S (In other words, all proposals with higher overlap with this proposal that has higher confidence score, are discarded from any further processing in the object detection pipeline). This process is repeated until there are no more proposals left in set S. Finally, we get a set T that has a final candidates of region proposals. These are then processed further in the object detection pipeline where they are classified appropriately.
Soft-NMS: The problem with the traditional NMS is that if the threshold is set too high, it can cause elimination of true candidates. This is a problem especially when two or more similar objects are located next to each other. In these cases, we may end of reporting only one object after applying the NMS. This is where Soft-NMS comes into play. This technique differs from the traditional NMS in that it does not eliminate/discard any region proposals. Instead, all region proposals are assessed. For those proposals whose IoU is higher than some user defined threshold, their confidence score is lowered in proportion their IoU. See reference 1 for more details.
Fast NMS: The NMS and Soft-NMS are both sequential methods because of how the NMS is performed. That is, for each class in the dataset, all detection boxes are evaluated sequentially starting with the one with the highest confidence score. To begin with, all detection boxes are compared against the one with the highest score. The ones with an overlap exceeding some threshold are removed. Then the next highest score box is chosen and process is repeated until all detection boxes have been evaluated. In order to increase the computation time, Fast NMS proposes a new method of performing NMS. See reference 3 below.
- How does NMS work? It takes a relatively relaxed approach when removing the detection boxes. Firstly, (c X n X n) matrix is constructed where c is the number of classes in the dataset, n is the n detection boxes with highest scores for each class. Then, IoUs are calculated for each detection to get a matrix with all the IoUs for top n detections. The algorithm then takes a column wise maximum to find the max overlap with a particular detection. The resulting matrix containing the max IoUs is compared against some threshold. All those detections for which the IoU is greater than this threshold are discarded. This method is faster than the traditional NMS and Soft NMS as it computes all detections to be discarded in one step (in a matrix format). See Figure below for an example to understand the method further.

Matrix NMS: This approach is somewhat of a combination of Soft-NMS where detection scores of the ones with higher IoUs are lowered in proportion to their IoUs instead of removing those detections all together. Additionally, the way this method is implemented has some similarities with Fast NMS (matrix based computations). See reference 2 for more details.
- How does it actually work? The original paper (ref 2) explains the method in terms of masks (as opposed to detections which we have been focusing on so far), however, the terminology does not matter to understand the underlying approach of Matrix NMS. (Mask is a relevant term in Image segmentation. ) Let’s discuss the details of the method. Let a predicted mask to be suppressed be mask j. The suppression of mask j depends on two things. Firstly, the penalty caused by all other masks i that have confidence scores higher than mask j. This can be easily evaluated by calculating IoU(i,j) for all s(i) > s(j) where s stands for the confidence score. Secondly, the probability of mask i itself being suppressed. This makes sense because if there are masks with scores s(k) > s(i), then all those masks for which s(k) > s(i) would also indirectly impact the suppression of mask j since s(i) > s(j). The question is how to quantify the indirect affect of all those masks k for which s(k) > s(i). — we will explain this through an example shortly. So, after amount by which mask j should be suppressed is calculated by finding a decay factor (also explained later in the example below). Finally, the score of mask j is updated (lowered) based on the decay factor. This process is done for all masks j to be suppressed. Finally, for the final outcome, some threshold is chosen which is used to compare the final scores. The scores that are below this threshold are discarded from the final predictions. See steps below.
  - Implementation steps:
    - Create a (n X n) matrix of IoUs (See Figure 1), where n is represents the predictions with higher scores
    - Compute column wise Max to get the predictions with most overlap (as shown in Figure 1)
    - Compute the decay factor for all higher scoring predictions (decay factor is calculated based on equation 4 in the paper and is also explained with an example in Figure 2 below.
    - Compute column wise Min to select the decay factor for each prediction. Also explained in figure 2 for prediction j = 5.
    - Update the scores of each prediction based on their respective decay factors.
    - Finally, select some threshold t and select top m scoring masks as the final predictions.
  - Let’s study an example. Note, the goal is find out by how much mask j should be suppressed (that is, we want to find the decay factor for mask j). Consider the matrix X from figure 1. First, we construct this matrix of IoUs. Then,