In this post, we explain the different machine learning experiments we performed to create models to automatically identify fish in underwater videos.

Underwater image from Kapiti Marine reserve

An underwater image, from Kapiti Marine Reserve, labelled by the citizen scientists (black rectangles) and our fish identification model (white rectangles).

In the previous post…

If you haven’t, check out our previous post to learn “how we built a machine learning pipeline for fish identification”. We used this pipeline to run the experiments described in this post.

Data (Update)

We used 1,800 underwater images to create the deep learning models. Each image was annotated multiple times by volunteers (citizen scientists) and their annotations were aggregated. Unfortunately, some of the aggregated annotations were incorrect, such as miss-classification, over-classification (a single fish, with several different boxes), incorrect classification, or wider boxes than the fishes.

It is important to note that some of the original videos were filmed at different resolutions leading to images of different sizes.

Model Evaluation

Choosing a method to evaluate the model can be very hard, there are a lot of options, and plenty to focus on. To get a thorough understanding of our model performance we used the following evaluation methods:

  • Average IOU (Intersection Over Union) with the correct labels.
  • Average misclassification – averaging the number of predictions identifying the wrong species per image.
  • Average misdetection – averaging the amount of fish that doesn’t have a box around them in each image.
  • mAP (mean average precision) score.

To calculate some of the methods we needed to match each ground of the truth box to a prediction box. The method we choose to apply to match the boxes is to match each of the ground truth boxes to the highest IOU box of the predicted boxes. It needs to be noted that two ground truth boxes can have the same predicted box, but it will affect the score and the evaluation method. We found this method to be most suitable for our comparison.

Now we are ready to dive into the results!

Example of the style transfer technique used to increase the training data

One example of the style transformations we used.

ML Experiments

We used the loss and model predictions to compare different experiments, draw conclusions and plan how to move forward.

Experiment 1 – Augmentation and style transformation

After reading the article ‘STaDA: Style Transfer as Data Augmentation’ we wanted to test if transformations could help our model generalize better.

The article suggested using basic augmentations like horizontal flip as well as model-based augmentations like style transformation.

We choose three augmentation types:

  • Benchmark – for the benchmark experiment we will use no augmentations.
  • Basic augmentations and their probability to stage on the data – horizontal flip (0.5), vertical flip (0.5), and from color to black and white (0.1).
  • Model-based: style transformation augmentations in addition to the basic augmentations.

For each augmentation type, we trained the model from scratch, and did the exact same configurations, except for the augmentation applied.

Comparing the classifications of the four models on the same image.

Training loss and classification loss of the experiments, from Weights and Biases

We tested the models on the test dataset and got the following results:

BenchmarkBasic AugmentationsStyle Transfer
Average IOU0.330.4370.573
Average Misclassification0.7350.1750.225
Average Misdetection0.150.1250.005

We can clearly see that the augmentations helped the model generalize better.

Experiment 2 – Adding dropout, using one style transformation

After understanding and applying the results of the ‘augmentation and style’ experiment, we noticed that our model was having difficulties in predicting the species, hence we decided to focus on this area.

To increase the accuracy of species identification we decided to add ‘dropout’ to our model, a method well known to prevent overfitting of the model. Since one of the style transfers drastically changed the images, we tested whether the transfer influenced the classification performance of our model. The results of this ‘One Style’ Experiment showed a very small difference in the performance of the model in general so we decided to remove them and stay with the model as it was before.

Classifier’s validation loss in this experiment.

Experiment 3 – Hyperparameter Tuning

Following the results from the previous experiments, we decided to search for the optimal hyperparameters using our full dataset. In the search, we looked for the following parameters: Number of epochs, learning rate of the optimizer, weight decay of the optimizer, learning rate size of the scheduler, and gamma of the scheduler.

Having access to better GPUs enabled us to increase the batch size to 16 images instead of 8 and, as a result, decreased the amount of training time.

Hyperparameters influence the total validation loss.

The chosen parameters were:


  • Epochs: 108
  • Gamma: 0.01004
  • Learning rate: 0.00002121
  • Learning rate size: 5
  • Weight decay: 0.000002852


The model’s performance on the test set was:


  • Average IOU all over the images: 0.603
  • Average misclassification all over the images: 0.227
  • Average miss-detection all over the images: 0.06
  • mAP score over all test images: 0.198


 During this project, we tried several methods to improve the model’s performance, such as ‘Style transfer’, ‘Image Augmentation’, ‘dropout’ and even changing the classification loss of the prediction. Some of the methods improved our model performance, such as the augmentation and the styled images, which added support to the effect of such methods on the area of image recognition.

Even with a small dataset, it is undoubtedly possible to train fit-for-purpose fish identification models by adding augmentations and adding styled images.

Our project, in combination with new features and ongoing development, enables more effective ways to monitor and manage the health of New Zealand marine ecosystems.

Next steps

 Over the next few months, our team will continue working with the New Zealand Department of Conservation to further evaluate the accuracy of our models. Specifically, the team will compare machine-learning-generated classifications with those reported by marine biologists.

Based on the accuracy of the models, the team will develop a “human in the loop” approach to classify new underwater footage. In this “human in the loop” approach, machine learning models, citizen scientists and marine biologists will efficiently work together to quickly process underwater footage of snapper, blue cod and scarlet wrasse inside and outside marine protected areas around the country.

Stay tuned for further developments!


This was an amazing opportunity for us to test and apply the material that we learned during our academic journey to real-life problems.

This project wouldn’t be possible without the help of Eran Paz, who helped us to analyze our results and was always there to help us to plan and analyze experiments.

Special thanks to Victor Anton, who helped us from the beginning to understand the meaning of this project and helped us with the technical issues.

We want to thank Gal Hyams and the Hebrew University for presenting us with this opportunity and supporting us throughout the way.

We would like to thank the NeSI support team for their quick and professional responses to our questions and needs.

Find our code in:

Share this story!