March 23, 2023

Understanding seals with AI

AI for Seals aimed to improve the SealNet facial recognition model for studying and monitoring marine mammals, enhancing the model's accuracy and streamlining data processing and development workflows

Why identify seal faces?

Harbour seals are key regulators and indicators of marine ecosystem health. Accurate monitoring of their population and movement patterns is essential to understanding marine ecosystems. They are relatively easy to monitor via photographic analysis as large numbers of seals can be observed when they congregate out of water at sites like rocky islets, allowing them to thermoregulate and avoid predation. It makes them easily visible to researchers from afar. Computer vision techniques have the potential to provide researchers with new observation methods that support more systematic and scalable data collection. 

How did we do it?

Our high-level Approach

The FruitPunch AI community split into three high-level teams to divide and conquer the challenge goals: one team focused on facial detection, another on facial recognition, and the third on building a GUI for the biologists to interact with the model.

The Face Detection Team

The first priority of the face detection team was to migrate the originaldlib based detection model to a more user-friendly Python and Jupyter notebook-based implementation to enable rapid model development. This also enabled us to leverage best in class Open Source projects like Ultralytic’s YOLO, RetinaNet, and industry-standard tooling like Weight & Biases and RoboFlow. Following this migration, the team was able to quickly experiment with different models and hyperparameters to build the best model we could in the limited time available.

Face Recognition Team

The original SealNet application already had facial recognition implemented. The purpose of our team was to improve upon its development, improving the overall accuracy of the software. Our main aim was to predict the correct labels (individual seal names) of the seals. Given an image of a seal, the recognition app must be able to find the top 5 most similar seals that are already in the database. In addition to providing the prediction score (the similarity score) the end user should be able to inspect photographs of the seals that are deemed similar to the input image. The facial recognition team tested a wide range of different frameworks and models. Why test so many models? With no clear indication of the most suitable model up front, it was down to trial and error to find out which would perform best. Broadly speaking, two main groups of models were tested, simple and complex, as illustrated in the figure below.

RectangleDescription automatically generated
The team worked on three main steps in the seal recognition process:

  1. Clustering of the images
  2. Classification of the images
  3. Scoring the similarity of images

Each step used different algorithms. Clustering of the images utilized density-based spatial clustering of applications with noise (DBSCAN). Image classification made use of simple stochastic gradient descent models and more complex models such as VGG and EfficientNetB0. To determine the similarity of the images, image arrays were extracted using Support Vector Machine (SVM) models such that the  similarity of any new images could be computed. Siamese networks were also tested to compute the seal image similarities. For the Siamese network, an imagenet pre-trained ResNet50 model was used.

The problem can be tackled in two ways. The first one is the classification task in the closed-set problem, i.e. given an image of seal face (‘probe seal’) and a gallery of known seals images, determine who it is. The other one is the open-set problem, where given two seal images, determine the similarity score between them. The second task can be accomplished with the same model with the classifier head chopped off.

Digging Deeper

The Face Detection Team

The face detection team trained multiple YOLO based models, as well as an experimental RetinaNet model. On the YOLO front, the team experimented with several different versions (YOLOv5, YOLOv7, and YOLOv8), at different sizes (nano, small, and medium), and trained at different epochs (15, 30, and 45). 

After many training intervals, the team selected the model that performed best on the validation and test datasets; a YOLOv8s model trained at 45 epochs. Because we didn’t have many labeled images to train our model on (384 training images of harbour seal faces), we searched online for publicly available datasets of labeled seal faces with the hope that more training examples would increase the resilience and confidence of the model when predicting seal faces. On Roboflow Universe, we found two labeled datasets of fur seal faces: the first dataset contained 2.1k images of fur seal faces and the second was a smaller datasetwith 66 labeled images of fur seal faces. At this point, we also relabeled the images of harbour seals collected by Colgate University in accordance with strict labeling guidelines to ensure the model had clear and consistent input to make its predictions.

We then decided to train three models based on our highest performing model architecture: YOLOv8s trained at 45 epochs:

The first model (colgate-yolov8s-45-epochs) was trained only on the labeled dataset of harbour seal images collected by Colgate University. We didn’t apply any augmentations to the training dataset. 

The second model (unified-yolov8s-45-epochs) was trained on a larger dataset which included the Colgate University harbour seal images as well as the two public Roboflow datasets of labeled fur seal images. We also augmented the training images to increase the resilience and flexibility of the model. These data augmentations were applied randomly and tripled the number of training images available to the model. Specifically, we applied the following data augmentations:

  • Horizontal and vertical flipping
  • Added exposure between -25% and +25%
  • Added blur of up to 5px
  • Added noise of up to 2% of pixels

The third model (colgate-v9-yolov8s-45-epochs) was trained only on the harbour seal face dataset collected by Colgate University but with added data augmentations. The data augmentations applied were the same as those applied to the unified-yolov8s-45-epochs model.

On the validation dataset, colgate-yolov8s-45-epochs reached the highest mAP50-95 score after 45 epochs (0.654). It also displayed the best F1 confidence curve performance with a F1 score of 0.89 at a confidence level of 0.516.

Comparison of fitness functions of our YOLOv8s models

As you can see below, the F1 curve for colgate-v9-yolov8s-45-epochs shows that it consistently holds a higher F1 score for lower and higher confidence levels.  

Comparison of the  F1 curves for our YOLOv8s models

The final model improved detection performance by over 15%. By modernizing the legacy dlib-based seal face detection pipeline, migrating it to YOLOv8 with a custom Weights & Biases integration for collaborative training (which we contributed back to the YOLO project), and using Roboflow to support distributed dataset management, labeling, and enrichment, we also reduced the time and complexity of future improvements by the open-source community from days to minutes and allowed for the rapid evaluation and adoption of state-of-the-art visual detection models 

This result was initially surprising to us as we expected that the model with the most training examples of seal faces (unified-yolov8s-45-epochs) would demonstrate the best performance. However, this was not the case as it underperformed all the models we trained purely on the labeled Colgate data. Interestingly (especially given what we will see later), the augmented Colgate data scored marginally worse than the model trained on the non-augmented Colgate dataset.

The Face Recognition Team

Utilizing different techniques, the face recognition team tested many different models, because we did not know which could perform the best. The criteria for choosing these models were:

  • Fast training and retraining of the model
  • Models must be simple to understand and implement
  • They must use few resources (CPU, RAM, model storage)
  • Good feature extraction capability
  • Availability of pretrained models

The breakdown and description of each model is as follows:


A key aspect of facial recognition is feature extraction and optimization. The feature extractor in this model was HOG, which is not only a feature descriptor, but also produces a simplified representation of the image that contains only the most relevant information about the image. Since only relevant features are kept, this reduces the training time and size of the model. The optimization technique utilized in this model is SGD. 

As shown by the workflow below, images are converted to HOG images and then fed to the SGD in the training phase. After the training phase, the inference phase begins and a new image is presented and SGD predicts the image and the label of the new image, along with the prediction score of how close the new image is to the already seen images in the training phase. 

Image Vector + SVM

This model utilizes an image processing algorithm in which the image is read as an array, flattened, and stored in a pandas data frame, along with the class labels of each image. This data is read by the SVM machine learning model. 

Support vector machines (SVM) are machine learning algorithms that are designed to perform well with a limited amount of data to analyze. After giving an SVM model set of labeled training images for each category or class, they can recognize and categorize new unseen images. SVM can be advantageous as it uses memory efficiently. However, it sometimes requires a long training time. 

The workflow for the Image Vector + SVM model is the same as HOG+SGD, with the only difference being that image arrays are utilized instead of HOG images.

VGG16 + Cosine

VGG16 utilizes a deep-learning CNN model. In the training phase, a batch of pre-processed images (resized to 224X224) images is sent to the downloaded VGG model for feature extraction. The extracted feature is then sent to a cosine similarity module that computes the similarity between the images. The computed similarity is then stored in a pandas data frame. In the inference phase, a new image is sent to the “retrieve similar image” module. This module performs the same functionality as the training phase and then computes the similarity of the new image with the one that is computed in the training phase.


EfficientNetB0 is CNN-based neural network. The most notable thing from EfficientNet is that they are efficient, it uses less parameters and less computational resources. B0 is the smallest version of the EfficientNet family. We used pre-trained EfficientNet as the backbone, then added a pooling and linear layer on top of that. We froze all the backbone except the 3 topmost layers.

‍Siamese Networks

A siamese network is a technique to train neural networks that directly learn distances., Here, ‘distance’ means how similar or dissimilar two images are. Our siamese network is implemented with ResNet50 and RegNet16 with contrastive loss. The main objective is to maximize the distance between images of different seals and minimize distance between photos of the same invidivual. In this way, the final trained model can be used for the open-set problem.

The best configuration so far: RegNet 16 on 224 x 224 sized images of seals. Trained for 51 epochs (approx 45 min training on GPU). More parameter tuning could lead to better results. But as we saw that the other methods were performing better we decided to put the siames network aside. 

Performance metrics per model

But, how does all this help biologists?

All this technical speak might sound complicated so let’s break down what we provide to marine researchers. By implementing our models in the new SealNet app we provide a fast and accurate Seal face recognizer for researchers around the world. By simply uploading their images of seals they can get insights into where these seals have been spotted in the past. This shows migration patterns and on a higher level even population dynamics. Based on theses insights researchers and conservationsts can focus their efforts to places where they have the biggest impact. 

Interesting takeaways and learnings

As a result of training our seal face detection model on different datasets, we learned that investments into data quality overwhelmingly surpassed efforts to increase the quantity of training data in terms of  improving accuracy and precision. For example, models which had consistent and clear labeling that aligned with the desired model outcomes (matching the test dataset) but had less images (384 images), performed better than other models trained on more diverse external datasets (1.2k images) of seal faces that weren’t relabelled to adhere to our specific data guidelines. 

When testing out the YOLOv8 model, the detection team noticed that the YOLOv8 model didn’t have Weights & Biases natively integrated into it for automatic model evaluation telemetry like the YOLOv5 model we had previously been working with. We decided to expand the YOLOv8 model to include a callback for Weights & Biases which would give us the built-in model tracking and evaluation metrics (which we got used to in YOLOv5!). When doing this we realized how quickly the Machine Learning OSS community moves, as by the time we created the Pull Request, somebody else had already started to solve this issue. This experience also showed us how rewarding contributing to OSS libraries is, even if your contribution gets merged into the project or not. You can check out the PR here and the Gist that you can paste into your YOLOv8 Jupyter notebook to start immediately  using Weights & Biases tracking. 

Where can you find our work?

  • Here is where you can find our trained face detection models, their validation results, and their testing results.
  • In the same repository, you can find example notebooks on how to run these models as well as instructions on how to run them yourself.
  • Here is our Pull Request that integrates Weights & Biases into YOLOv8 as well as a GitHub Gist that the team used for model training and evaluation.

We want to give our special thanks to Krista Ingram and Ahmet Ay for their role as stakeholder in this Challenge!

And of course we want to thank all the amazing Challenge participants of AI for Seals: Aideen Fay, Dahiru Ibrahim Dahiru, Dali Ploegmakers, Dana Tran, Dani D, Daniel Kritzinger, Evan Guma, Graeme Harris, Jerry Zhu, Katerina Atallah-Yunes, Kevin Han, Lars Toonen, Matthew Shane Van den Berg, Mick van Deemter, Muhammad Hassan Maqsood, Nick Kang,  Nicolás Arrieta, Larraza, Roshan Kotian, Timothy Malche, Traun Leyden, Viktor Domazetoski, Yastika Joshi, Aditya Rachman Putra, Chetaly Mawal, Jayanti Lahoti, Mathieu Duteil, Moses Rupenga, Lauren Horstmyer, Redzhep Mehmedov, Jaka Cikac

AI for Wildlife
Computer vision
Challenge results
Subscribe to our newsletter

Be the first to know when a new AI for Good challenge is launched. Keep up do date with the latest AI for Good news.

* indicates required
Thank you!

We’ve just sent you a confirmation email.

We know, this can be annoying, but we want to make sure we don’t spam anyone. Please, check out your inbox and confirm the link in the email.

Once confirmed, you’ll be ready to go!

Oops! Something went wrong while submitting the form.