Welcome to the second post of the landmark detection models. In this post, I will explain the models we used, the results we got and discuss the next steps for our Pepeketua (frog) identification project. For context, read the previous post with background information about the project and how we prepared the data.

Archey's frog on fern. Photo by James Reardon.

Archey’s frog on fern.
Photo by James Reardon.


Before I explain the models we used for the detection of frog landmarks, I will provide some information about the metrics we used to assess the model’s accuracy.

We used two metrics: accuracy of points within a radius of ground truth (APIR) and average distance of points outside the radius of the ground truth (ADPOR). Since each point is not specifically defined pixel-wise, we would like the predictions to be as much as possible inside a certain radius from the ground truth and assess the inaccuracy of those points outside the radius. These two matrices capture the most important information needed to evaluate our models.

APIR and ADPOR formulas

APIR and ADPOR formulas.
Where N is the number of examples, Np is the number of points per example, Sp is the set of predicted points, Sg is the set of ground truth points, pij is the predicted point j in example i, gij is the ground truth point j in example i, Ωr is the set of all points predicted outside the radius r, and r is the radius.

We also used two radiuses to estimate the accuracy of the labels: a small radius, which corresponded to 1.9% of the image and a big radius, which corresponded to 3.9% of the image. The idea behind these two radiuses is that the small radius is the one to reach, inside one standard deviation of double labelled images, while the big radius is an acceptable upper bound.

Examples of the small and big radioses used to assess model's accuracy.

Illustrations of the two radiuses we used to assess the model performance.
On the left, the radius of the small circles around the labels represents 1.9% of the image and on the right, the radius of the big circles around the labels represents 3.9% of the image.

Model V1 – Landmark model

The first model we used was a 3-block and one Dense layer model. Each block consisted of two CNN layers and a Group Normalization layer, the first CNN layer with a stride of 1 and the latter with a stride of 2. Each CNN layer had the same number of filters and each succeeding block had half the number of filters as the previous.

Architecture of model V1

The architecture of model V1.

To avoid using large GPU memory, images were rescaled to 256*256 pixels and stride 2 was used to downscale the number of features between each block, still for our hardware capabilities only a batch size of 5 could be used.

We used Group Normalization for adding a regularizing layer instead of Batch Normalization because Group Normalization is better suited for small batch sizes.

Each training session took about 48 hours, with epochs reaching 250 with Early Stopping. Due to the long training time, the hyperparameter search was slow and not automatic, with each change carefully done to achieve a high likelihood of improvement.

To consider the rotation of the frogs inside the images a (-20, 20) degree rotation was uniformly selected for data augmentation for each image and batch. An interesting result we found was that rotations above 20 degrees to either side worsen the model performance on the validation set.

While examining the predictions of the model in the validation and test sets, a clear pattern appeared: the model did not generalize to the frog images well. Images of frogs facing north, south, or east in the image coordination’s performed worse than images where the frogs were facing west. 

[table id=2 /]

Model V1 results on validation and test sets, with different rotation augmentations while training.
There is a clear drop in performance when increasing rotations.

A simple fix to this lack of generalization would have been to increase the rotational augmentation, but after we increased the rotational augmentation the model did not improve, as seen in the table above.

The limited generalization of the model could have been attributed to the fact that CNNs are relatively invariant under spatial translation, but not under rotation. This meaning, that the CNN layers did not find the rotational invariant features needed to generalize well to the rotations in the data.

Another potential explanation could have been that allowing frogs to face any direction, increased the variance in the dataset, making generalization much harder.

To evaluate the severity of the problem of the frogs facing different directions, we estimated the percentage of frogs facing different directions.

[table id=3 /]

Number and percentages of images facing each direction by dataset.

From the table above one can see that most of the images faced west, explaining why the model had a hard time generalizing to images of frogs facing different directions.

To address the lack of generalization of the model, we proposed a two-model approach. The first model will find the frog’s specific rotation, while the latter will classify the landmark points.

Model V2 – Rotation model

The rotation model for our two-model approach had a similar model architecture to the landmark model V1. To achieve more stable results, the inner epoch evaluation was done by averaging the results on three different augmentations instances on the validation set.

For the final evaluation of the validation and test sets, we used 10 different augmentation instances. These augmentation instances insured a high level of generalization to different frog directions.

To reduce training time and hardware capabilities, and since rotation classifications did not need as much complexity as landmark detection, images were downscaled to 128*128 pixels. The rotation model’s metrics were similarly to those used in the landmark detection models.

We needed a new loss function to account for the non-linear relation of the rotation variable. Since the values of our rotation variable ranged between 0-360 but 350 was closer to 0 than 20. 

A straightforward method would have been to use sines and cosines to automatically consider how angles change. However, the derivative of the loss function does not act well, since sine’s derivative is cosine, which is perpendicular to sine. In other words, for a value difference of 0, the derivative is maximum and equals 1. This causes the use of sine and cosine to be very tricky, in finding a correct combination to allow correct derivatives and easy convergence.

Another problem is that both functions are not convex, and this causes another computational obstacle. In view of these two problems, we ended up using a different approach. Our approach aimed to predict the norm vector from the vent to the snout and then evaluate the rotation of the vector found.

The “vent to snout vector” approach allows the loss function to correctly consider how angles work, while still having an informative derivative. Yet, this loss function had one drawback. If a very small vector was predicted, a very small change in either x or y coordinates will cause a relatively large change to the angle. To mitigate this drawback another term was added. This new term strives to keep the vector size close to the ground truth. By keeping the vector size large, the angle of the vector is stable under small variations in the x and y coordinates. The final loss function used is described below.

Loss function used to account for the non-linearity of rotation angles.

Loss function we used to account for the non-linearity of rotation angles.
Where g is the ground truth vector, p is the predicted vector, and α is a term to control loss function attention between angle to size.

While examining the results of the rotation models (table below), we found out that images incorrectly classified with more than 40 degrees of a difference than the ground truth were mislabeled images. This means one image was misclassified for the test set and four for the validation set. In the 30-degree range, almost all images were classified correctly, therefore we safely assumed a 30-degree augmentation for the landmark detection model. In addition, the standard deviation was relatively small, showing the stability of the predictions.

[table id=4 /]

Rotation Model’s results on validation and test sets.

Model V2 – Landmark model

The second version of the landmark model, in our two-model approach, was more efficient and followed a different approach than V1. Compared to the decreasing number of filters used by the first version, we used an increasing number of filters for the second version. This led to fewer computations requirements since the number of times each filter was multiplied was linearly correlated to image size on each axis. We also reduced training time and hardware capabilities by downscaling the images to 224*224 pixels.

The architecture of model V2 landmark detection.

The second version of the model achieved a reduction in parameters thanks to the switch from normal CNN layers to Depthwise Separable Convolution layers.

Depthwise Separable Convolution layers consist of two convolutional layers: Depthwise convolution and Pointwise convolution. Depthwise convolution is a convolution on each channel layer, and Pointwise convolution is a 1 by 1 convolution on all channel layers.

The separation causes the parameters to go from:
Df × Df × M × N to Df × Df × (M+N). Where Df is the filter size, M is the number of input channels, and N is the number of output channels which equals the number of filters. Resulting in a ratio reduction of ≈1/N. The same happens for the number of computations just with an added term for the number of times the kernel is multiplied.

We also tested a few backbone nets, including: MobileNetV2, MobileNetV3, ResNet50, ResNet101. However, none of the backbones was used in the final model as they did not improve the model’s performance.

To account for the rotation model’s misclassifications, we also used a (-30,30) degree rotation augmentation.

[table id=5 /]

Comparison between Model V1 and Model V2.

It was clear that Model V2 outperformed Model V1 in all parameters (table above and photo examples below).  Model V2 achieved an almost perfect score in the 3.9% radius and significantly reduced the misclassification of images where frogs were not facing west.

Example of the frog landmarks predicted by V1 and V2 models.

Example of the frog landmarks predicted by V1 and V2 models.
On the left, frog landmarks predicted by model V1 and on the right, frog landmarks predicted by model V2 on the same photo.

In conclusion, the two-model approach managed to overcome the scarcity of images and the problem of frog orientation. We think that as shown in this landmark detection model a two-step approach where the first model changes the image orientation, holds great potential for problems that lack a great number of examples.

In all, this two-step approach model managed to achieve the necessary goals of the project and can be used by the upcoming Frog Identification, and Size Identification models.

Find our code in:


I would like to thank the whole Wildlife.AI team for doing an amazing job, and especially Elad Carmon for his insightful comments, stimulating discussions and exceptional guidance through out this project.

Share this story!