Automatic Extraction of Keywords for a Multimedia Search Engine Using the Chi-Square Test

We present a method to automatically acquire a set of keywords that characterise a large multimedia collection. Our method compares captions associated with pictures in the collection with a model of general English language. The words that deviate from the model are very specific of the captions and thus make appropriate keywords. Professional annotators evaluated our results and concluded that more than 97% of our top 2,000 one-word keywords were truly descriptive of the collection. We also mined the collection’s query logs and extracted keywords that reflect the most important indexing terms from the users’ perspective. Our method offers a strategy for selecting the keywords that make up the indices of multimedia search engines.

