AXE is an image repository application where you can search for images by a text query. It will return all images related to the given query. Here is how I created the project:
I found a useful image dataset on Kaggle. It's a collection of ~30K images from Flickr with descriptions of what's in each picture.
I wrote a few scripts and extracted / cleaned the data for my project's purposes. These scripts can be found in bootstrap/
in the repo. Here's a short description of each one:
-
create_compressed_images.py
: This script reads all of the image files in the dataset and compresses them. These compressed images will later come in handy when I want to return search results. Having compressed versions means I don't incur a lot of network cost on a client that just wants to see small thumbnails in search results. -
populate_images_db.py
: This script creates a database with descriptions about an image, its size on disk and its dimensions. Even though I didn't implement a search using image dimensions in my project, the concept is very similar to other aspects of my search. -
populate_posting_list_db.py
: This creates a table calledposting_list
which maps terms to images they're found in. I use this later to respond to queries efficiently!
I decided I was going to work on implementing a "search" function. The search function includes the following:
The project is a web based application implemented using django. It provides a few simple pages: 1) a home page where users can input their search terms and criteria, 2) a results page where users interact with search results, 3) a queries page that shows a quick history of query processing times.
Users can provide a text query, a minimum and maximum size (in KBs) and Axe will return a (potentially long) list of results. I return results in pages of 16 images so as to not overwhelm the user with too many search results. These 16 items are only the compressed versions of these images allowing the page to load faster. Users can choose to load and look at the full quality version of each image result. I also cache the results of each query so the response will be faster the next time someone searches for a similar query.
In order to provide some visibilty into how my implementation improves query response time, I implemented a very simple page that shows a history of query processing times.
Users can also select any number of their results and download those selected images in the form of a zip archive file.
We can talk about these items if we have time.
Get the terms related to each image and calculate the similarity between two images based on their descriptions (based on a similarity method metric like cosine similarity). This would allow the user to search for similar images of an image.
The dataset I used didn't have the information needed to implement this, but if I did, it would open up a lot of possiblities: Search on a map, search using location name ("man with a god in Asia").
Refactor (extract functions, classes), use Django's features to better organize the code (e.g. Django forms, input validation), error handling, and writing tests (?).