Dan Davis

My 827K Dog Image Dataset

I had a client approach me with an interesting problem.

The client was a dog daycare/boarding facility and the SaaS platform they used for managing customers and scheduling also had an interesting image upload and tagging feature. They would take a bunch of photos of the dogs each day, upload them to the platform, and then manually tag the individual dogs in the photos by selecting the correct dog(s) from a drop down menu next to each photo. Once all photos were tagged they could use the platform to send an email or text message to each customer containing a link to the photos their dog was tagged in along with any notes from the daycare staff that day. The customers loved this feature and came to expect it. However, as the business grew from 20-30 dogs per day to 100-150, manually tagging all of those photos (+2000/day) was becoming too costly. The client estimated that on a typical weekday they were spending 6-8 labor hours just tagging photos. The client wanted to know if there was any way to automate this process. I knew full automation was impossible. But I thought I could cut the time to under an hour/day with a semi-automated system.

The Dataset

I knew that I was going to use a ML/CV model to assign labels/tags to the images but that would require a large labeled dataset to train on. Luckily the client had already been manually labeling images for 2 years and I could get at that data easily.

Their SaaS platform created a profile for every dog and on the dog's profile page there were links to every image (hosted on google cloud storage) that the dog had been tagged in. Just from counting those links I could see that I would be dealing with hundreds of thousands of images.

Although it's possible to store millions of files in a single directory it breaks most command line tools like ls. The obvious solution is to make a directory for each dog but since multiple dogs can be tagged in the same image there would be duplicates and I wanted to avoid that. Google storage already prepended a hash to each image file so I just used that

https://storage.googleapis.com/gingr-app-user-uploads//2021/08/03/1fb96889-201d-4a26-ae1d-bb4121b04a47-IMG_9086.jpeg

to create subdirectories based on the first 2 characters of the hash. The image above would be stored at:

1/f/1fb96889-201d-4a26-ae1d-bb4121b04a47-IMG_9086.jpeg

Nesting 2 levels deep means I would have to download 2.56 million images (16*16*10000) to reach 10K images per directory.

I decided to store each dogs metadata and the label information in a postgres database using the following schema:

create table image (
    image_id int generated always as identity,
        primary key (image_id),
    url text unique not null,
    date_taken date not null  -- it's in the url and is nice to have
);
create index image_date_taken_index on image(date_taken);

create table dog (
    dog_id int primary key (dog_id), -- Use same id as gingr uses
    first_name text not null,
    last_name text not null,
    breed text not null,
    birthday date not null
);

/* Intersection table because of Many-to-Many relationship between dogs
   and images. A dog is tagged in many images and an image can be tagged
   with many dogs. Also store a flag for whether or not the tag is correct
   so I can correct mis-tagged images and have an idea of how well the
   human taggers performed as a benchmark. */
create table dog_image (
    image_id int,
    foreign key (image_id) references image(image_id)
        on delete cascade,
    dog_id int,
    foreign key (dog_id) references dog(dog_id)
        on delete cascade,
    correct boolean default True,
    primary key (image_id, dog_id)
);
create index correct_flag_index on original_tag(correct);

Most of the images are iPhone quality (4032x3024) which is much larger than I needed so I used PIL's reduce function to shrink them by a factor of 4 (2 if WxH less than 2000x2000) before saving to disk. I could always redownload the full size originals if needed. It took about 2 days to download them all and came out to about 90GB of images.

Analysis

Totals | Images | Tags | Total Dogs | Dogs w/ > 50 images | |--------|--------|------------|---------------------| | 827017 | 977467 | 1499 | 1201 |

Images/Dog | Min | Max | Mean | Median | |-----|-----|------|--------| | 1 | 8946| 652 | 258 |

Almost 9K images for the top dog! Turns out that dog had been going to daycare 6 days/week for almost 2 years.

Also people really like doodles (poodle mix). Almost 20% of the dogs are some form of doodle.

select count(*) as doodles from dog where breed like '%doodle%';

doodles
-------
278

The top 20 percent of dogs ranked by number of images per dog contains ~81% of all of the images. Pretty cool that the 80-20 principle holds here too.

with top20percent as (
    select
      dog_id,
      count(image_id) as images
    from dog_image
    group by dog_id
    order by images desc limit 300
) select round(
    (select sum(images) from top20percent) /
    (select count(*) * 1.0 from image), 2) as top20;

top20
-----
0.81

The problem is that 80% of the dogs don't have very many images. I was confident that I could train a model with a solid performance if I had 9K examples per dog, but was worried that the model would struggle on dogs with only a few examples. By looking through the data I also saw some other potential problems that I'd need to deal with:

Designing the System

Disclaimer: I'm a backend guy with only limited data science experience. If I could have paid someone smarter than me to come up with a better system I would have

I personally subscribe to Andrew Ng's data-centric approach (data quality is more important than model architecture) so I knew would just use an off the shelf pretrained model. Resnets and YOLO models are good enough for just about any vision task, and one of the classes they are pretrained on is dogs so that was a plus. Since I didn't have a dataset with bounding box labels and I didn't want to draw bounding boxes on +100K images myself, that just the Resnet models.

By pulling the schedule information from the SaaS platform I could get an almost accurate (walk-ins complicated things) list of the dogs that would be present today and tomorrow. I could leverage this to eliminate most of the possible tags (don't tag a dog if it's not there). There was also the problem of new dogs constantly being introduced (1-2/week) which meant that one big model trained on all possible dogs would quickly become out of date.

The best solution I could come up would be to use the binary relevance method where I would a train a separate binary classifier for each dog. Basically asking "Is this dog in the image?" for every dog for every image. This would prove to work fairly well and offered the flexibility I needed to handle new dogs and walk-ins.

The large gap between the dogs with thousands of examples and those with only a few dozen meant that I couldn't use the same method for constructing the training sets for every dog. With thousands of positive examples it was easy to just throw in 10K-20K randomly chosen negative examples and get really good performance. When I only had a few dozen examples though I had to carefully select my negative examples and rely on oversampling and extra augmentations just to get acceptable performance. I ended up finding some heuristics that worked well in most cases, but it was never as robust as I would have liked.

The system I ended up with looked roughly like this:

CUTOFF = 8PM  # switch from today's schedule and switch to tomorrow's
UNTAGGED_IMAGE_THRESHOLD = 64  # wait until we have at least a full batch

while True:
    poll_schedule(CUTOFF)  # returns today's or tomorrow's
    for dog in schedule:
        if dog does not have a trained model:
            train_model(dog)
    untagged_images = get_untagged_images()
    if len(untagged_images) >= UNTAGGED_IMAGE_THRESHOLD:
        for dog in schedule:
            model = get_trained_model(dog)
            model.predict(untagged_images)
        aggregate_predictions()
        POST predictions to SaaS platform
    sleep(5 minutes)

Now I needed to think about the confusion matrix of possible predictions and the costs of incorrect predictions.

A false positive is relatively easy to correct using the existing platform tools, Just delete the tag. A false negative has no remedy so I'd rather err on the side of being slightly over sensitive rather than over specific.

So basically the idea was to push all of the tagging labor into the review stage with the hope that after the client's staff eliminated the false positives there would be enough true positives remaining that they wouldn't have to do manual tagging.

Did it Work?

Kind of. It wasn't a fully automated system but it did significantly cut the labor-hours spent tagging from 6-8hrs to ~1hr. The system was fragile though. If the schedule was not accurate (and it almost never was) performance would drop. If the human reviewers were not perfect the predictions would become progressively worse in a bad data feedback loop. In addition some of you reading this probably are probably thinking it's insane to train 100-150 models every single day and you'd be correct! Training on the cloud was out of the question cost-wise so my GPU server would be thrashing every night for 4-8hrs which is obviously not ideal. Ultimately the maintenance burden on myself was too high for the payout to justify continuing and I had to abandon the project after a few months.