Comparing the Top Five Computer Vision APIs

Published in

Gaurav Oberoi

10 min readJul 11, 2016

Over the last six months, Google, Microsoft, and IBM have all announced a suite of “intelligent APIs” that offer various types of image, video, speech, and text recognition. One can, for instance, pass in a photograph of a day at the park, and receive tags that tell you it includes a dog, frisbee, and trees. How good are these out-of-the-box models for general purpose tasks, and can I use them to build interesting new products or features?

In this post, I compare the image labelling capabilities of five vendors with observations, and suggestions for product builders. I also provide an open-source tool, Cloudy Vision, so you can test your own images and pick the best vendor for you (spoiler: Google was not the clear winner for my corpus). Finally, I share labeling results for 40 example images.

Note that I aim to get a qualitative feel for these APIs, this is not a rigorous accuracy study. I also focus on image labeling, not face detection or OCR [1].

Approach

I started by taking a few photos, and running them through the web based testing tools provided by some vendors. My plan was to manually capture results in a spreadsheet. Alas, but this process was so tedious that I found myself fretting over which small set of images I should try out. It was also incomplete because not all vendors have such testing tools (ahem, Google).

I quickly realized that to see side-by-side comparisons of lots of images, my best bet would be to build a little tool. I wrote a Python script to iterate over a directory of images, call the various vendor APIs, cache the responses, and render a web page to see the bits I’m interested in.

Cloudy Vision is an open source tool to generate results like this for your set of images.

The resulting tool, Cloudy Vision, presents image labeling results from Microsoft, Google, IBM, Clarifai, and Cloud Sight, but is easy to extend to support more vendors (please send me a pull request). If you have a corpus of images and want to explore labeling, this is a good starting point for qualitative assessment, as well as for more rigorous accuracy testing (e.g., compare computed labels with your own training set).

Observations

I ran about forty images through five vendors. I recommend viewing that page now to form your own impressions before going further. Here are some of the things I noticed:

Most images are labeled with a correct high level category.
The results were good enough to understand the gist of an image, e.g. is it food, a room, a musical instrument, or animal? That’s pretty impressive considering the wide range of things you can throw at it.

All vendors figured out that this is a a bike. Huzzah!

Labels with more specificity are not as reliable.
If you want details, such as what kind of food, or exactly which animal is in the photo, the results have huge variance. Some are impressive, like recognizing my tablas (classical Indian drums), but many images don’t get more specific labels.

Most vendors recognized drums, but Google knew them to be tablas. Groovy.

Outright errors are rare, but do happen.
In most cases, when the algorithm is confused, it tends to produce labels with low confidence scores, instead of just wrong guesses — so you can safely ignore these. But even in my corpus, 5–10% of images were just incorrectly labeled, sometimes with hilarious results.

Almost no vendor recognized these as binoculars. Hmm…

There is significant variance between vendors.
Because of their public stance, and recent achievements in AI, I went into this thinking that Google’s results would be superior to other vendors, but that’s not necessarily true. In my small corpus, I would rate Clarifai at the top, followed by Google and Microsoft at approximately the same quality, and then well below them IBM. Note that Clarifai is a private company solely focused on providing image recognition APIs, and they have been at this since 2014, well before the giants entered this space.

Clarifai does well, Google and Microsoft too, but IBM does ok. Cloud Sight?! Well…

But one vendor, Cloud Sight seems to be too good to be true…
You may have noticed in the examples above that Cloud Sight’s captions are amazingly good (e.g., in the below, Cloud Sight says “white black and brown long coated dog catching a yellow freesbie on a green grass field during daytime”). If you visit their site, you’ll notice that they talk about their “image recognition technology”, and so you might presume that it’s all driven by an impressive machine learning model. However, it appears that they use a lot more human tagging than their marketing lets on.
For one, the latency for most vendor APIs is sub-second, whereas Cloud Sight’s docs ask you to wait 6–15 seconds (and in practice I see something like 5–30 seconds). Second, I found this buried deep in their privacy docs “Our service employs a proprietary technology that utilizes both computer vision and crowdsourcing”. Third, some more searching turned up this Reddit thread on how they work, and this comment from a human tagger. And finally, I sent in the same image with different filenames, and got different captions (the second time I got “white brown rough collie”), suggesting that human labeling is involved (tip for Cloud Sight: consider using some sort of hash like MD5 to detect duplicates, not just the filename). Oh, and the first label had a spelling error… humans!
So while Cloud Sight’s results are truly impressive, I’m not sure it’s fair to compare them against these other vendors from a purely machine driven vision point of view. Furthermore, their cost is significantly higher at scale, supporting the notion that there’s more manual tagging going on behind the scenes than they let on.

Microsoft’s caption is awesome, but Cloud Sight’s is too good to be true. Likely, it was labeled by a human.

I sent the same image but with a different filename to all vendors. All had the same tags but Cloud Sight. Human labeling?

Incorrect rotation can affect accuracy.
I passed in a rotated version of the collie image shown above, and lo and behold, I got different results. That’s important to be aware of, especially because rotation information in JPEGs is stored in EXIF data and can be processed by in non-standard ways by different applications (gory details here, I used a nifty tool jhead to fix rotation issues).

Beware rotation! Note that the labels are different from the correct orientation, and generally poorer in quality.

Results are better if you can zoom in to areas of interest.
I wanted to see if I could get my security camera to tell me that there’s a package at the door. It didn’t work that well, until I cropped the image to the region I was interested in identifying. If your use-case allows users to identify regions of interest, then consider this approach.

Could my video camera tell me there’s a package at the door? Apparently not…

But if I zoom into where interesting items may lie, then I can tell there’s a package at the door.

It’s affordable at scale.
Let’s say you have 100,000 images to label. Google will cost $500, Clarifai $220, IBM $200, and Microsoft’s pricing is not yet announced. Not too shabby, considering there are more discounts to be had at scale, or for offline processing (vs. realtime). Note that Cloud Sight, the probably-doing-a-lot-of-human-tagging vendor, will cost $3,000, an order of magnitude more than the vendors sticking solely to machine learning.
Let’s take a hypothetical example: if I’m Dropbox and I want to offer image-tagging to my customers, assuming each customer uploads 500 images a month, that will cost me $2 per customer. With a $100 annual plan, that would hurt Dropbox’s margins significantly, but in theory, a company at that scale would get much better discounts, or move to building this in house after proving its value to the market.

Conclusions

I’m impressed with the quality of general classification — it’s good enough to get the gist of an image, is fast, and relatively low cost at scale.

That said, it’s hard to evaluate a solution without a real problem. For instance, what use is general classification (e.g. “this is food”) if what I need is details for my use-case (e.g., “my recipe site needs to know if this is a slab of meat, or a vegetable”). With that in mind, let me posit a few hypotheses on when I think you may find these APIs worth your time.

Areas where these APIs may be immediately useful:

General image organization and analysis.
If your product is a consumer photo sharing site, stock photography site, or just has lots of diverse images that you want to group and search, you may find that the general labeling is good enough to add value. This is also true if you’re trying to get the gist of what’s in a large corpus of images (e.g., “what kinds of things are shown on movie posters and do they differ by genre?”).
Discoverability for E-Commerce.
E-Commerce sites typically end up with many SKUs where they have little information on the product, but may have an image. Without text keywords, these products are largely undiscoverable. Even basic labelling, like “t-shirt” or “summer dress” may boost sales by making this previously hidden product available via search, browse, recommendations etc.
Novel uses of camera data.
There are lots of security/pet/baby cameras out there that could do a lot more with the image data that they are collecting. While many applications will need custom models (e.g., baby monitor Nanit), there may be some that could use these general APIs, e.g. getting a daily log from my security camera telling me when my mailman came, family members entered/exited, neighbor’s dog strolled by, and so on.
Understanding user behavior for engagement, retention, up selling.
In many cases, understand more about the customer’s use case helps you deliver a better product experience, or a chance to up sell. If you have a product where users upload their own images, you may be able to see what they are doing and offer ways to automatically improve their image, sell them a better version, or simply tag the user with a label that helps you market to them better down the road.

Spaces where these APIs won’t be enough:

Highly specific, or non-standard tagging.
If you have a recipe site and need to know whether an image has nigiri, sashimi, or maki in it, then you will probably want to use a custom model. Some vendors already offer specific models for different domains, e.g., Clarifai has custom models for weddings, travel, or food.
More likely, if you are able to produce a sizable annotated corpus, then you will want to train your own classifier. In this case you don’t have to necessarily start from scratch by building your own classifier, various vendors also provide simple APIs to train classifiers with positive examples (e.g., “this is sashimi”) and negative examples (e.g., “this almost looks like sashimi but isn’t”). See IBM’s documentation to see how their custom classifier APIs work. I haven’t tried these, but am eager to hear if others have found them to work well.
Accuracy is important.
The accuracy on general tagging depends on the granularity that your use-case needs. I did not perform a quantitative accuracy assessment, but my small qualitative dataset suggests you can be in the 80–90% ballpark (a wide range). If you need highly accurate tagging, these models may be a good start, but you will want to augment it with some sort of human review. In these cases, Cloud Sight may be a great solution, but also good old Mechanical Turk, or newcomers focused on this problem like Spare5 (these are also great solutions to build your own training corpus to train your own classifier).

Parting thoughts:

Are these image labeling APIs here to stay?
I found it hard to get an idea of how many people are actually using these APIs in production. There are some signals, such as threads like this on Hacker News, or the somewhat active discussion on Azure forums, but I haven’t personally heard from happy users. My gut says that these are valuable for the sorts of cases I described above, and therefore, would bet that we’ll see private and public companies continue to provide these services.
Is generic image labeling a big enough space for Google, MS, etc.?
I’m not sure about generic image labeling, but what I do know is that these are great gateway products for companies who will eventually grow and want to invest more in building custom models. At that point, you can start to take advantage of Google’s Tensor Flow or Microsoft’s various Azure products to help build, train, and deploy your own models; and this is where the real long term strategic bets lie. So while generic image labeling may not be the end goal for the behemoths, it’s a worthy way to capture future customers.
I’d love to hear from real users.
I’m curious to hear from people who need tagging solutions, and have tried any of these vendors. What was your use case? Does it lend itself to using these labeling APIs in the first place or do you need a much more custom solution? Did you decide to go with these vendors or not? What did you go with instead, and how well is it working out for you? If you have the inclination to share, I’d love to hear more, and possibly post an updated blog post to this one.

Thanks for reading this far. As a reward, I present you with one of the funnier labeling results: my own profile photo. I know I’m not the tallest guy around, but Google, did you really have to label me as a jockey? Come on…

Resources

Notes

[1] Why focus only on image labeling? Because unlike face detection and OCR, which have reasonably accurate and easy to use open source solutions (OpenCV, Tesseract), generalized label detection is not something you can cobble together with established open source libraries (or have I just not found it? please tell me).