There are billions of images on the Internet today. An average image search query consists of about 2.2 terms; it’s up to the search engine to determine the "user intent" (what the user is looking for) and retrieve an appropriate subset of images from this vast collection.
In this article, I will go over some approaches to web image search. In part 1 of the article, we will see some of the traditional approaches and issues related to indexing images, and in part 2, I will discuss a couple of Google papers on image ranking.
Images were traditionally indexed based only on the surrounding keywords in a web page. A "score" is applied to each image based on the surrounding text in the page, with different "score weights" applied for words present in the image's alt text, hyperlink, file name, page title, etc. Based on this score, images are sorted and presented in the image results.
A keyword-based approach is problematic, since a single page could contain multiple images, and also the surrounding text may not indicate the semantic content of the image. The notion of "surrounding text" itself is loosely defined. Web search engines usually assign all keywords in a page to all images, and modify the images’ scores based on their filename, alt text etc. This means that an image could be indexed with a completely unrelated keyword. The search engines are also susceptible to spamming techniques like keyword stuffing.
Besides the main content, there are other page graphics images like- menus, banners, buttons, lines etc. These are usually irrelevant to the search query (unless the user is specifically searching for images of web page graphics). The keyword indexing approach could wrongly attach keywords to these images.
Thus keyword based search alone does not work. Search engines should look into the properties and "content" of an image.
Image file properties
Image properties such as dimensions (width and height) in pixels, can be used to categorize images by size, such as small, wallpaper, icon, medium, large, etc. For example, a search query phrase "wallpaper" can be used to retrieve images having standard wallpaper dimensions (1024x768, 1440x900, etc.) or the search term "icon" can be used to retrieve icon images (16x16, 32x32 etc)
Image dimensions could also be used to weed out insignificant images; for example, an image: 1px wide and 1px high is likely not of any value as a search result.
Content based image analysis is challenging problem. Computers cannot understand images like humans do; they cannot "see" what the image contains. At the basic level, an image processing program can tell you the color of a particular point (pixel) in the image.
This prompted search engines to quickly add color-based search to images. Techniques like color histogram analysis were used to classify images based on their color content. Using color information, we are able to find images by prominent color (the most common color in an image), classify images as black and white images (grayscale), clip art, etc.
Over time, image processing and computer vision algorithms became smarter and applied machine learning algorithms to train computers to detect features in images. Today, after many years of research, and huge amounts of training data, we are able to make computers detect faces in images, with a high degree of accuracy. Face detection is a common feature in image search engines today since a lot of work is already done in this area.
If we can detect faces, why not other objects? In general, training computers to detect any object (ball, apple, phone, etc.) is a time consuming process. Supervised learning requires a huge volume of manually tagged training data (to train how to recognize a particular object). Using current technology and approaches, it is not feasible, in the scale of the web, to train computers to detect hundreds of thousands of popular object categories. Also, an image could contain many objects (a scene)- in different orientations, and even partially occluded. It is hard to extract all of these objects to define the scene accurately.
Ranking of images
We have seen some techniques and issues involved with indexing images. With these known limitations in “understanding” images, what would be a better way to find the most suitable result image(s) for a user query? What will appear in page 1 of the results? How do we handle duplicates or near-duplicates of images? "PageRank for image search" is one solution from Google. I will focus on this solution in part 2 of this article.