Friday, April 29, 2016

A Tiny but Brave New Reverse Image Search Engine

A recent post describes how to use Reverse Image Search offered by the major search engines. Basically, you upload an entire image and discover where else it can be found. Our new reverse image search engine, piclookup.com introduces a new twist. It works even if you have only a tiny piece of the original image. After all, with text search a "piece of quoted" text reveals documents containing that phrase. So, why not expect the same thing with images?




Finding the original Flickr picture from a small portion of the left area. This is the Flickr license.

This article describes briefly how our little site came into existence, and how it works. But first, here's an example use case, called "catfishing". Let's say a stranger presents an image as their own selfie. Is it really them, or is it a tiny clipping copied from someone else's online class photo? Unless the photo is famous and heavily visited, today's big search engines will not likely find the original from a small part, so it's hard to catch the faker. Below is a snapshot of our examples page, with image portions taken from Flickr. Just copy one of these images to your clipboard, and paste it into the home page "paste" button. Beyond catfish, our search engine can find any sort of clipping--not just faces, but tree branches, coffee cups or whatever, regardless of rotation. In fact, the fourth image has been rotated.



PicLookup got started when one of us was developing a program for computer vision. The program chopped an image into lots of little pieces and memorized them for recall. Later, when exposed to only a portion of the original image, the program could still identify the image. By memorizing the tiny pieces, it could recall the original from a few pieces. And it could do this for a great number of images. Realizing this, we stopped working on vision and began work on the image search site instead.

We hurried to build the web site, the search engine, web crawler-robots to scour the web for pictures, and a big database to hold data derived from all those pictures. The next few paragraphs offer some details about how we got these pieces to work, starting with the heart of our site, the search engine itself.

The engine is a java program, serving as the backend of the web site. It leverages the amazing, free image-processing library called OpenCV (see the book, "Learning OpenCV") by leading every image through the processing steps, letting OpenCV do the gritty work of finding and extracting pieces, and then sending the final chunks of information to the database.  That's for storage. For recall, it does exactly the same thing, except this time, instead of inserting data into the database, it queries the database for a match.

The database simply holds all the chunks of information about each image, including where the image is located. We are using MySQL, a well established standard database, supported by Oracle. High speed database performance is essential for an online search engine. Today, developers use a trick called caching to give lightning fast recall. Since RAM (your computer's memory chips) works thousands of times faster than a disk drive, the trick is to load your data into RAM, your cache. Alas, RAM expensive. A middle ground is to use solid state drives instead of disk drives whenever possible. We use all three media. For an enormous database, one that is too big for a single machine to contain, there is another essential technique called "sharding". The idea is to distribute the data across a number of separate machines, called "shards". A very informative book, "High Performance MySQL" has been an essential reference for us.

The data is collected by our web crawlers, which were written in java. We combined some online examples, and added our custom code, creating two separate crawlers. The first scans the web looking for images, recording their locations. The second crawler loads the images, extracts image data and stores it in the database. This robot shares image processing code with the search engine, since they both treat images the same way, capturing information about each tiny piece of the image, and where the image was found.

We wrote most of the web page html ourselves, but eventually we were helped by consultants for the finishing touches, like improving CSS, and the enhancing the page layout. Technical web issues we overcame included using AJAX to upload the image, perform the search and return results. This involves javascript, php, and java, a fairly standard "stack", which enables us to solve problems using plenty of online advice from sources like StackOverflow.com.


Conclusion: our image upload is remarkably easy for users. Search for an image by pasting from the clipboard or using file upload. (Confession: we still have work to do for mobile platforms.) So far, our robots have harvested over one million images. However, the web has untold billions of images, and we're wrestling with the resource problem--we need many more servers. We hope to grow by gaining "traction" and investing in more hardware. Thanks for letting us share our experience. We very much appreciate any and all feedback. If you're curious, please try out our site or watch a 2 minute, whirlwind demo YouTube video which dashes through some "reverse image lookups".