Cross-modal retrieval across audio and images

This project aims to retrieve the pertinent sample across images and audio.

Dataset

Wikipedia Dataset [1]
The collected documents are selected sections from the Wikipedia’s featured articles collection. This is a continuously growing dataset, that at the time of collection (October 2009) had 2,669 articles spread over 29 categories. Some of the categories are very scarce, therefore we considered only the 10 most populated ones. The articles generally have multiple sections and pictures. We have split them into sections based on section headings, and assign each image to the section in which it was placed by the author(s). Then this dataset was prunned to keep only sections that contained a single image and at least 70 words.
The final corpus contains 2,866 multimedia documents. The median text length is 200 words.
You can download Wikipedia Dataset from here.

PASCAL sentences Dataset [2]
This dataset contains 1,000 images from 20 categories, and each image has 5 corresponding sentences as exact descriptions.
The original dataset is posted on this website.
You can download this dataset through rupy’s python program.
However, rupy’s python program is based on Python 2. The modified program is based on Python 3, and you can use the program as described in rupy’s program.

IAPR TC-12 Dataset
The image collection of the IAPR TC-12 Benchmark consists of 20,000 still natural images taken from locations around the world and comprising an assorted cross-section of still natural images. This includes pictures of different sports and actions, photographs of people, animals, cities, landscapes and many other aspects of contemporary life.
Each image is associated with a text caption in up to three different languages (English, German and Spanish) . These annotations are stored in a database which is managed by a benchmark administration system that allows the specification of parameters according to which different subsets of the image collection can be generated.
You can download IAPR TC-12 dataset from here.

Text-To-Speech

All three datasets only contains images and the corresponding texts. In order to process cross-modal retrieval across images and audio, I need to turn texts into audio. I use Balabolka for this process. Balabolka is a free TTS program that supports batch processing, which is useful to process huge datasets.

Data Pre-processing

You can download the Python program for pre-processing from here

Wikipedia Dataset
I extract the first sentence of every article and save it in TXT file with the same name as the name of the corresponding image.

PASCAL sentences Dataset
Texts in PASCAL sentences dataset does not need to be preprocessed since they are already in .txt files. Use Balabolka turn them into audio.

IAPR TC-12 Dataset
I only use texts in English for this project. The English texts are in a file extensin of .ENG. The ENG file type is primarily associated with Dictionary, but this type can be treated as an XML file because an ENG file has tags which is the same as an XML file. I use Python 3 to read the ENG file and process them as processing XML files and save every description of the corresponding image in a TXT file with the same name as the name of the corresponding ENG file.

Dataset after pre-processing and TTS

The following datasets contains audio, texts and images. You can download them from Google Drive.

Wikipedia Datast
Text(XML) location: /wikipedia_dataset/texts
Text(TXT) location: /wikipedia_datast/audio_text
Audio location: /wikipedia_datast/audio
Image location: /wikipedia_datast/images

PASCAL sentences Dataset
Text location: /PascalSentenceDataset/sentence/(each category)
Audio location: /PascalSentenceDataset/sentence/(each category)/(each directory)
Image location: /PascalSentenceDataset/dataset

IAPR TC-12 Dataset
Text location: /iaprtc12/text
Audio location: /iaprtc12/audio
Image location: /iaprtc12/images

References:

[1] J. Jeon, V. Lavrenko, and R. Manmatha, “Automatic image annotation
and retrieval using cross-media relevance models,” in International
ACM SIGIR Conference on Research and Development in Information
Retrieval (SIGIR), 2003, pp. 119–126.
[2] Cyrus Rashtchian, Peter Young, Micah Hodosh, and Julia Hockenmaier. Collecting Image Annotations Using Amazon’s Mechanical Turk. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk.
[3] The IAPR Benchmark: A New Evaluation Resource for Visual Information Systems, Grubinger, Michael, Clough Paul D., Müller Henning, and Deselaers Thomas , International Conference on Language Resources and Evaluation, 24/05/2006, Genoa, Italy, (2006)