Vito Chin

There are many insights that we can gain by considering colors. For example, a search for Manchester football teams will yield more meaningful results if colors close to either primary red or blue is specified, and that certainly makes all the difference.

Searching for colors

We can also use information implicit in colors to fulfill our search requirements without additional search support structure. If we want to search for the Victoria line on London's useful tube map for example, we could search for the light blue color RGB(24, 132, 188) instead of having to set up the vector structures or other means to enable highlighting of the line. With Gmagick, we can highlight search results as we shall see further on in this section.

More specificity in terms of colors and positions certainly helps too. Consider a situation where we are looking for a red roof in a collection of paintings. Knowing the opacity and location of red occurrences helps in narrowing down the scope of search.

The necessary challenge with enhancing specificity is that ultimately latent in all human endeavors, that of resource constrain. A typical image usually contains millions of pixels. Storing and indexing a large amount of images is resource intensive depending on the quantization level. While computing resources in the domain of individual servers had and will continue to increase dramatically, the cloud has presented itself as a practical and available solution for us to do such resource intensive tasks.

We will use Amazon's Elastic MapReduce (EMR) and PHP to add color searching capability to the PictureMe application that we had looked at in an earlier tutorial, PHP and the Cloud. The application will allow flexible definition of search specificity by adjusting the size of a search "cell", the pixel size of a segment within any image that can be searched. The application can be scaled to support finer cell sizes and larger sets of images by riding on the elasticity of the Amazon MapReduce cloud.

Amazon's EMR offering encapsulates a lot of details involved in creating and maintaining a hadoop cluster, presenting a minimal interface to get you running quickly. Coupled with PHP's strength in this area, the pair combines well as a feasible and accessible platform for data processing in the large without the enormous time and hardware cost.

Requirements

There are a few stuffs that you need to follow this tutorial. Most of these had been described in the earlier techPortal article, PHP and the Cloud.

  1. Amazon AWS account with access to S3, CloudFront, EC2 and Elastic MapReduce
  2. Don Schonknecht’s Amazon S3 PHP class
  3. cURL
  4. S3Fox
  5. Amazon EC2 API Tools
  6. The Gmagick PHP extension
  7. The TokyoTyrant PHP extension

To install the PHP extensions, you'll have to first install their dependencies. Instructions to do so is available on their respective project page at http://pecl.php.net/gmagick and http://pecl.php.net/tokyotyrant. Optionally, you'll find that a VM with Hadoop installed is useful as a sandbox for trying out MapReduce locally before running it on Amazon. We'll look at this subsequently.

The color indexing architecture

IMG_EMR_architecture

Here's how the color indexing and searching ecosystem works. We add a _putColorGrid() method to PictureManager that utilises the getFuzzyColorGrid() method of Gmagick_Fuzzy (an extended Gmagick class) and some other core functions of Gmagick to generate a color grid representation of an image. For every picture image that is uploaded to the BUCKET, a file containing the color grid representation of the picture is uploaded to a COLOR_GRID_BUCKET. Color grid files stored within this bucket will be used as input to mapper.php.

The MapReduce side of things will be responsible to churn out an indexed table of all available colors and their corresponding location within the S3 storage. This indexed data will be stored in a COLOR_INDEX_BUCKET that will be accessed by SearchManager when an index updating is performed. TokyoCabinet/TokyoTyrant is used as a fast local storage of the color index. We provide a precise color picker to the user, and allow the user to also specify a search proximity. The getAxesNeighbor() method in Gmagick_FriendlyPixel returns all colors that is within the specified 3D proximity of the chosen color pixel.

The color grid

"One picture is worth a thousand words", so says Fred. R. Barnard. The phrase is literally true when it comes to pixel representation of images, but this depends on the complexity of the image in terms of colors. In the field of computer graphics, RGB, one of the most popular colorspace, can represent up to 16.5 million colors to account for differing opacity, allowing visualisation of precise light intensity, tones and shadows that would had made Vermeer proud. We shall make use of this immense colorspace to construct the color grid of our color search engine.

Color searching at the pixel level can be an enormous task in a reasonably-sized image repository. A typical picture will have at least 1 million pixels. Here's where color grid helps. Instead of considering every pixel within an image, we divide the image into rows and columns of cells, each cell being of a fixed size. The colors in each cell is also quantized to obtain a single representative color of the cell. The following illustration shows how a color grid of an image will look like:

IMG_EMR_grid

The PictureManager contains a method that creates this color grid using an extended version of Gmagick. Besides creating this color grid, the PictureManager also places a file that is representative of this color grid on a defined color grid bucket on S3. Let's take a look:

    protected function _putColorGrid($pictureName, $pictureFile)
    {
        try {
            $fgm              = new Gmagick_Fuzzy($pictureFile);
            $colorGrid        = $fgm->getFuzzyColorGrid(CELL_SIZE);
            $pictureColorGrid = '';
            foreach ($colorGrid as $rowKey => $colorRow) {
                foreach ($colorRow as $colKey => $colorColumn) {
                    $pictureColorGrid .= $pictureName.'['.$rowKey.','.$colKey.
                    ']'.chr(9).substr($colorColumn, 3).PHP_EOL;
                }
            }
            if ($this->_storage->putObject($pictureColorGrid, COLOR_GRID_BUCKET, $pictureName,
                S3::ACL_PUBLIC_READ, array(), 'text/plain')) {
                return true;
            } else {
                return false;
            }
        } catch (Exception $e) {
            return false;
        }
    }

A $fgm Gmagick_Fuzzy object is instantiated, passing in the $pictureFile to the constructor as what we will do with the normal Gmagick class. Next, upon the $fgm object, we call getFuzzyColorGrid() passing in the desired CELL_SIZE, which is the the size of the rectangular cell in pixels. The getFuzzyCollorGrid() method is detailed in the following snippet:

    public function getFuzzyColorGrid($gridSize)
    {
        $colorGrid = array();
        for ($i = 0; $i < $this->getImageWidth(); $i += $gridSize) {
            for ($j = 0; $j < $this->getImageHeight(); $j += $gridSize) {
                $cropped = clone $this;
                $histogram = $cropped->cropImage($gridSize, $gridSize, $i, $j)
                  ->quantizeImage(1, Gmagick::COLORSPACE_RGB, 0, false, false )
                  ->getImageHistogram();
                $colorGrid[$i][$j] = $histogram[0]->getColor();
            }
        }
        return $colorGrid;
    }

getFuzzyColorGrid() is an extended Gmagick method that chops the image into small sections, quantize these smaller sections to a single color and return an array of colors of all these smaller sections.

Back in the _putColorGrid() method in PictureManager, we subsequently create a color grid string. We format the color grid information in pairs of location and color. Each line will denote a pair of location and color, starting with the top left of the image to the bottom right. This color grid string is then placed in a COLOR_GRID_BUCKET on S3. Files in this bucket will be used as input to our mapper function, which we'll look at next.

The Mapper

Our map function receives input allocated to it by Hadoop and performs some reorganising and formatting before passing it on to the reducer.

#!/usr/bin/php
<?php
 
while (($line = fgets(STDIN)) !== false) {
    $line = trim($line);
    list($location,$color) = explode(chr(9), $line);
    if (empty($colorArray["{$color}"])) {
        $colorArray["{$color}"] = $location;
    } else {
        $colorArray["{$color}"] = implode('|', array($colorArray["{$color}"], $location));
    }
}
 
foreach($colorArray as $color => $locations) {
    echo $color, chr(9), $locations.PHP_EOL;
}
 
?>

Inputs sent to the map function is always in key/value pairs, with location as the key and the color at that location as the value. Here's a sample snippet of input lines that is typically received by the mapper:

Image001.jpg[0,0]	(72,71,85)
Image001.jpg[0,10]	(76,65,79)
Image001.jpg[0,20]	(75,69,82)
Image001.jpg[0,30]	(68,62,72)
Image001.jpg[0,40]	(64,62,77)
Image001.jpg[0,50]	(66,60,77)
Image001.jpg[0,60]	(72,56,74)
Image001.jpg[0,70]	(68,53,72)
Image001.jpg[0,80]	(68,53,72)
...

By default, Hadoop will order the input lines by ascending binary value before allocating the input to the mapper function. The mapper takes each line and explode it to obtain the color and location data. A colorArray is used to store a list of locations that has a particular color, with the color as the key. The mapper adds to this colorArray for each line it encounters, building up the list of locations for a particular color. After looking at all the inputs, the mapper outputs each item in the colorArray, indicating each color and their corresponding locations as key/value pairs:

(64,62,77)	Image001.jpg[0,40]
(66,60,77)	Image001.jpg[0,50]
(68,53,72)	Image001.jpg[0,70]|Image001.jpg[0,80]
(68,62,72)	Image001.jpg[0,30]
(72,56,74)	Image001.jpg[0,60]
(72,71,85)	Image001.jpg[0,0]
(75,69,82)	Image001.jpg[0,20]
(76,65,79)	Image001.jpg[0,10]
...

This intermediate output will then be used by the reducer to build a storage-wide index of color locations.

The Reducer

The reduce function merges all locations from the intermediate output with the same intermediate color key together.

#!/usr/bin/php
<?php
 
while (($line = fgets(STDIN)) !== false) {
    $line = trim($line);
    list($color, $locations) = explode(chr(9), $line);
    if (empty($wholeColorArray["{$color}"])) {
        $wholeColorArray["{$color}"] = $locations;
    } else {
        $wholeColorArray["{$color}"] = implode('|', array($wholeColorArray["{$color}"], $locations));
    }
}
 
foreach($wholeColorArray as $color => $locations) {
    echo $color, chr(9), $locations.PHP_EOL;
}
 
?>

Similarly, a $wholeColorArray is used to store a list of color/locations pair but this time, the array stores color/locations that are representative of the whole PictureMe S3 storage, not just a chunk of partitioned inputs. We do the same as with mapper to output all the colors and locations available to the final output bucket of the MapReduce process.

Hint: Notice at the beginning of the map and reduce code how we indicate to Hadoop streaming the executable that we wish to use to execute the script (#!/usr/bin/php).

Sandboxing on a VM with Apache Hadoop

A virtual machine with Apache Hadoop is useful as a test environment for your MapReduce application before deploying them on the cloud. It provides a safe space for you to experiment with, and most importantly, do so in a cost-free manner. We shall go through the process of running the PictureMe indexing MapReduce application to illustrate the use of such a tool. Here, we use Yahoo's Hadoop Virtual Machine. Make sure you run apt-get to install PHP with cURL.

First, we transfer the necessary files to the VM using Secure Copy:

	#scp * hadoop-user@:/home/hadoop-user/Color

Make sure you have mapper.php, reducer.php and some sample input files (color grids) uploaded as well. The source code contains two such input files.

Next, on the VM, we'll need to load the input onto Hadoop's distributed filesystem. Move the input files to a directory, say Color/colorsource within the hadoop-user home directory, then copy the whole directory into the distributed filesystem (HDFS):

	#hadoop fs -copyFromLocal /home/hadoop-user/Color/colorsource/ colorsource

Then make sure hadoop can read and execute the mapper and reducer files.

	#chmod -R 555 Color

With that, you're ready to start:

	#hadoop jar hadoop/contrib/streaming/hadoop-0.18.0-streaming.jar -mapper /home/hadoop-user/Color/mapper.php -reducer /home/hadoop-user/Color/reducer.php -input colorsource/* -output color-output

Make sure the name of the output directory does not already exists within the distributed filesystem. Once the job had completed, you can retrieve it:

    #hadoop fs -copyToLocal colorsource ./

The hadoop fs is the tool to use to interact with the distributed filesystem. For a list of available commands, do #hadoop fs -help.

Creating the JobFlow on Elastic MapReduce

Once the MapReduce application had been tested and verified on the sandbox VM, we shall then move it to the cloud. Since we already have our input color grid file in the COLOR_GRID_BUCKET on S3, doing this is really just a matter of moving the mapper and reducer to a bucket on S3 (not the bucket that stores the input files though!)

We will use Amazon's AWS Management Console (http://aws.amazon.com/console) to launch the indexing job flow as well as to monitor the progress of the job flow. A job flow comprises of the map and reduce functions with consideration to the input and output involved. We will create a Streaming job flow primarily, using Hadoop's streaming utility to use PHP to run our PHP-based mapper and reducer. Once you're logged in to the Elastic MapReduce tab on AWS Management Console, click on the Create New Job Flow button. Specify a name for your job flow, say "Color Indexing" and choose Streaming, then Continue. Next, enter the following values for the specified parameters:

Input Location: s3n://
Output Location:s3n://
Mapper: s3n:///mapper.php
Reducer: s3n:///reducer.php

The output_location name has to be unused, if you want to use a bucket name that already exists, make sure you delete the bucket first before starting your JobFlow, otherwise it will fail.

Continue to the next section, and you'll be asked to specify the number and type of instances to run. Important in this section is the Advanced Options, where you'll be presented with the chance to specify a S3 log path. This log is really useful for debug, development and maintenance use, so do specify a bucket to place these log information.

Once all that is specified and reviewed, start the job and a new row will be created on the Elastic MapReduce console main page. When the state changes to COMPLETED, you will be able to extract the output of the job flow from the output bucket. Make sure that the output location used in the job flow is specified in the COLOR_INDEX_BUCKET of config.inc so that the SearchManager knows that it has to use this bucket. The index will need to be updated in a suitable interval to account for new pictures stored and other possible changes within the color grid file.

Updating the color index to local storage

We will use TokyoCabinet/TokyoTyrant as a local key-value store for our color locations. We need a storage mechanism that is fast and can handle a very large amount of items, so TokyoCabinet/TokyoTyrant is ideal.

Make sure you have the right output bucket name in your COLOR_INDEX_BUCKET defined within config.inc and your TokyoCabinet/TokyoTyrant server instance started:

    #ttserver local.tch  (where local.tch is the database name)

We retrieve a list of all output indexing files from the output bucket:

    $indexFiles = $this->_storage->getBucket(COLOR_INDEX_BUCKET);

Then, we retrieve each of these files and place the key-value data embedded in each line into our local store:

    foreach ($indexFiles as $name => $info) {
        $index               = $this->_storage->getObject(COLOR_INDEX_BUCKET, $name);
        $colorLocationsArray = explode(PHP_EOL, $index->body);
 
        foreach ($colorLocationsArray as $colorLocations) {
            $colorLocations          = trim($colorLocations);
            list($color, $locations) = explode(chr(9), $colorLocations);
 
            if (!empty($locations) &amp;&amp; !empty($locations)) {
                $storedColors["{$color}"] = $locations;
            }
 
        }
        $this->_localStore->put($storedColors);
    }

Since search requests are made with colors, we can quickly retrieve location information for each color requested by using color as the key in our store.

Handling search requests

Search requests on the front-end are handled by the searchColor() method of SearchManager. This method first filters the RGB color values before instantiating a Gmagick_FriendlyPixel.

    Gmagick_FriendlyPixel('rgb({$r},{$g},{$b})');

This extended version of Gmagick has a method called getAxesNeighbor() that returns all neighbors to the pixel that is within the $proximity specified in a 3D colorspace.

	$neighbors = $pixel->getAxesNeighbor($proximity);

The $proximity helps in allowing search users to control the scope of the search while conserving color relatedness.

The next step is to retrieve the locations stored in our local storage for each of the colors in the $neighbors array. Each location consists of the filename and the position (coordinates) in which the color is found. For each of the location retrieved, we extract the filename and coordinates of each position within the file. We then use these information to rebuild a file specific mapping of coordinate positions and searched color on that location that had been found. An example of such a mapping:

    [vermeer.jpg|0.60767500 1253551071] =>
    Array(
                [[70,160]]  => (165,171,167)
                [[620,590]] => (167,168,154)
                [[380,460]] => (167,168,160)
                [[260,20]]  => (168,167,161)
                [[540,540]] => (168,168,156)
                [[50,60]]   => (168,184,173)
                [[490,550]] => (169,172,154)
                [[0,20]]    => (169,186,171)
                [[590,350]] => (170,170,162)
                [[550,540]] => (170,172,159)
                [[500,550]] => (170,173,154)
                [[560,560]] => (170,173,154)
         )

We keep a copy of this mapping for each file in our local store so that we can quickly retrieve it to draw indicators on the image where colors had been found. We'll discuss this in the next section. It is important to note now that the local store is used also for search result caching as well as color indexing. At this point, we index search results with $file.'|'.$searchTime so that we can be specific when retrieving these search results at a later time:

    $searchTime = microtime();
    foreach ($neighbors as $neighbor) {
        $locations = $this->_localStore->get('('.implode(',', $neighbor).')');
        if ($locations !== NULL) {
            foreach (explode('|', $locations) as $location) {
                $file                                    = substr($location, 0, strpos($location, '['));
                $position                                = strstr($location, '[');
                $colorPosition["{$position}"]            = '('.implode(',', $neighbor).')';
                $locationColors["{$file}|{$searchTime}"] =  $colorPosition;
                $fileColors["{$file}|{$searchTime}"]     = serialize($colorPosition);
            }
            $this->_localStore->put($fileColors);
        }
    }

Finally, before returning the result to the user, we sort it by size as a straightforward way of ranking the search result. This way, images with the most cell positions found for colors in the $neighbors array are ranked highest:

    uasort($locationColors, array('SearchManager', 'locationCompare'));
 
    static function locationCompare($a, $b)
    {
       if (sizeof($a) == sizeof($b)) {
           return 0;
       }
       return (sizeof($a) > sizeof($b)) ? -1 : +1;
    }

Mapping matching positions

Remember the microtime-d color positions for specific search results that we cached in the local TokyoCabinet/TokyoTyrant storage? We'll use that to draw a series of rectangle around cell positions where search matches are found before returning a result image to the user. First, we retrieve it:

    $positions = unserialize($this->_localStore->get($pictureSearchResult));
    //$pictureSearchResult consists of the file name and the search time
    //$pictureSearchResult = "{$fileName}|{$searchTime}"

We also instantiate GmagickDraw wand object next, using it to add a series of rectangles for each position retrieved:

    $positionOverlay = new GMagickDraw();
    $positionOverlay->setFillColor('transparent')->setStrokeColor('yellow')
                    ->setStrokeWidth('2');
    foreach ($positions as $position => $color) {
        list($x, $y) = explode(',', rtrim(ltrim($position, '['), ']'));
        $positionOverlay->rectangle($x, $y, $x+CELL_SIZE, $y+CELL_SIZE);
    }

We then instantiate a Gmagick object, passing in the URL of the image to it's constructor.

    list($fileName, $searchTime) = explode('|', $pictureSearchResult);
    $gmPicture = new Gmagick(CF_RESOURCE_URL.'/'.$fileName);

This allows us to manipulate the image in all sorts of ways. Of most interest to us is the ability to draw on the image with the GmagickDraw drawing wand created earlier.

    $gmPicture->drawImage($positionOverlay)

Finally, we write the image to a temporary location and return the URL to this location to the front-end.

    $gmPicture->write('tmp_img/'.$pictureSearchResult); // Ensure write permissions on 	tmp_img
    $picture['url'] = './tmp_img/'.$pictureSearchResult;
    $picture['name'] = $pictureSearchResult;
    return $picture;

The result of these sequences of code, all wrapped in getSearchMatches() of PictureManager can be seen in these figures:

IMG_emr_result_10

Result of searching RGB (179, 22, 33) with proximity: 10

IMG_emr_result_30

Result of searching RGB (179, 22, 33) with proximity: 30

Conclusion

One of the most interesting outcome of the availability of public clouds like Amazon AWS is how it makes computing cost more granular. We took advantage of this granularity in PictureMe by making the search functionality flexible in terms of the level of precision in which we will like the color grid to be and the proximity of colors in which to search. We might had been put-off by the cost of incorporating such flexibility if we had not been presented with the variable cost model of the public cloud. But with the freedom implicit in cloud pricing, developers are able to provide greater choices to the users of cloud-based applications in terms of power, time and cost.

Cloud computing has the potential to change many aspects of enterprises and software design. Amazon is just one of several popular cloud provider, albeit a very important and innovative one. Ivo Jansch, the author of php|architect's Guide to Enterprise PHP Development, and I will be releasing a book exploring the impact of the cloud from a PHP perspective, covering how enterprises can pair PHP with cloud computing. The book will be out in the first quarter of 2010. In the meantime, if you're hungry for more cloud read, do check out these slides by Ivo.

The complete source code for this article can be downloaded here. PictureMe.v2.tar

Digg This
Reddit This
Stumble Now!
Buzz This
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Kick It on DotNetKicks.com
Shout it
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Plus

4 Responses

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.

  1. Great tutorial. I love the use of the cloud to help with storage and retrieval.

  2. Every Amazon S3 related blog gives you an impression that there is only one Amazon S3 client available – S3 fox. However there are many more freeware clients which are much superior. For instance, S3HUB for Mac (http://s3hub.com) and CloudBerry Explorer for Windows (http://s3explorer.cloudberrylab.com/) .

Continuing the Discussion

  1. Amazon S3: How To Publish Your Videos | Internet Marketing Tools FREE linked to this post on November 2, 2009

    [...] Precision color searching with Gmagick and Amazon Elastic … [...]

  2. http://tinyurl.com/y9csgdl
    Precision color searching with Gmagick and Amazon Elastic MapReduce – techPortal

Some HTML is OK

(required)

(required, but never shared)

or, reply to this post via trackback.