Back to Blog

Pushing the Limits of Mobile Object Detection

Computer vision applications and technology have recently become widely available, with object detection witnessing a great increase in popularity. This is due to a wide range of practical use cases, from vehicle detection, pedestrian detection and driverless cars to face detection and security systems, which have inspired both the academia and industry to create faster and more accurate machine learning models.

Object detection refers to localizing the objects in an image and correctly identifying them. Some of the main challenges in this context are related to obtaining high accuracy and generalization across different backgrounds and positions of the objects. Now imagine you want to do all this on mobile and in real time. On top of the previous issues, the new environment entails constraints regarding the speed and size of the model.

These are all important points we had to consider when we recently set off to create an Android application that keeps an inventory of items in a basket. The main requirements of such an app are that:

  • it has to be fast at inference time - ideally less than 100ms
  • it has to be very accurate - in order to correctly track the inventory
  • everything has to work offline - no access to cloud servers

A sample training image

This article describes our thought process and learnings, from labeling data and choosing an object detection model, to training and deploying it in an Android app. The goal is to provide an understanding of all the steps involved in the creation of the app, especially our approaches to mitigating false positives and improving the accuracy of the model.

Labeling Data

Undeniably, the biggest pain of training an object detection model is labeling data and we experienced this first-hand. For industry use cases, the common framework for training object detection models relies on transfer learning: a model is pre-trained on a huge dataset to recognize different common objects and this knowledge is leveraged through fine-tuning on a smaller target dataset, containing the objects relevant to the use case. The alternative would be to train a model from scratch on target data, which requires the availability of a huge dataset. Thus, through transfer learning, time and computational resources are drastically saved.

There are multiple tools available for labeling object detection datasets: LabelImg, LabelBox, etc. We used the latter tool, which is open-source and offers a nice and engaging UI. Amazon offers the Mechanical Turk service for crowdsourcing your labels, which is recommended if you have more than 1000 images (as we had less, we did the labeling ourselves, about 7 days of work for 1 person). For labeling the data, we introduced the labels in the tool (for example “apple”, “banana” etc.), uploaded the pictures, and started drawing rectangles around these previously defined objects. Once all the objects in the images are labeled, the tool offers 3 formats for exporting the annotations, including VOC XML in which one XML file containing the coordinates of the bounding boxes is generated per image. The final dataset consists of the images and the XML files matching the names of the image files.

Labeling in the tool

There are several rules of thumb when it comes to labeling data for object detection, but nothing is set in stone. Most resources and GitHub issues talking about how to label data will say that it depends on the case and it’s a trial and error process. For the creation of our training dataset we aimed to have between 1500-2000 labels per object, with the objects in all positions (vertically, horizontally, sideways, stacked etc). In general, it is recommended to have between 500-1000 labels per object, but because of the many different views and angles of our objects, as well as high similarity, we wanted to make sure that the model learns the shape and discriminative features between the objects, so we labeled double the amount. This adds up to approximately 600 images, split into 400 training images and 200 test images.

One dilemma that we experienced is that of labeling objects that are overlapping with only a small part of them visible. For these situations we defined the following rule of thumb: as the goal of a model is to learn the shape of an object, imagine the object in isolation and if you would be able to identify it without subjectively “inferring” from the context. Thus, if the object in isolation makes sense, label it, otherwise ignore it. Another rule of thumb that we used is that if 2 objects are close to each other or even overlapping, we tried as much as possible to not label where the object should be, but only the visible parts (if they make sense in isolation). Another idea to consider (which we didn’t have time to experiment ourselves) is to take pictures of the objects with varying backgrounds. For our model, we took images of the objects in the context in which they will be found in production, although taking pictures of the objects in different environments with different backgrounds, will likely help the model generalize better.

Object Detection Models

When it comes to object detection models, there are several computer vision architectures which are popular and come up again and again in discussions on this topic. Current state-of-the-art includes models like YOLOv3, SSD, Fast-RCNN, Faster-RCNN etc. It’s all fun and games, until you reach the accuracy-speed trade-off: the more accurate your model is, the slower it will be at inference time. Depending on the use case, you will have to pick what is more valuable to your end user.


  • Andrada Pumnea
    Data Scientist
  • Portrait of Jaakko Kangasharju
    Jaakko Kangasharju
    Service Technologist