(PDF) Workshop on Language Resources for

OntoImage 2006 Workshop on Language Resources for Content-based Image Retrieval during LREC 2006 Final Programme Monday 22 May 2006 Magazzini del Cotone Conference Center, GENOA - ITALY 14:30 – 14:45 Introduction (G Grefenstette) 14:45 – 16:30 First Oral session (20mn per speaker + 5mn for questions) 14:45 – 15:10 Allan Hanbury Analysis of Keywords used in Image Understanding Tasks 15:10 – 15:35 Katerina Pastra Image-Language Association: are we looking at the right features? 15:35 – 16:00 Christophe Millet, Gregory Grefenstette, Isabelle Bloch, Pierre-Alain Moellic, Patrick Hede Automatically populating an image ontology and semantic color filtering 16:00 – 16:25 Mark Sanderson, Jian Tian, Paul Clough Testing an automatic organisation of retrieved images into a hierarchy 16:30 – 17:00 Tea / Coffee break 17:00 – 18:40 Second Oral session (20mn per speaker + 5mn for questions) 17:00 – 17:25 Thierry Declerck, Manuel Alcantara Semantic Analysis of Text Regions Surrounding Images in Web Documents 17:25 – 17:50 Diego Burgos, Leo Wanner Using CBIR for Multilingual Terminology Glossary Compilation and Cross-Language Image Indexing 17:50 – 18:15 Michael Grubinger, Paul Clough, Henning Müller, Thomas Deselaers The IAPR TC-12 Benchmark: A New Evaluation Resource for Visual Information Systems 18:15 – 18:40 Judith L. Klavans CLiMB: Computational Linguistics for Metadata Building 18:40 – 19:00 Closing discussion (G Grefenstette) OntoImage'2006 May 22, 2006 Genoa, Italy page 1 of 55. OntoImage 2006 Workshop on Language Resources for Content-based Image Retrieval during LREC 2006 Monday 22 May 2006 Magazzini del Cotone Conference Center, GENOA - ITALY Workshop Organisers and Program Committee Gregory Grefenstette, CEA LIST, France Mark Sanderson, University of Sheffield, UK Françoise Preteux, INT, FRANCE OntoImage'2006 May 22, 2006 Genoa, Italy page 2 of 55. OntoImage 2006 Workshop on Language Resources for Content-based Image Retrieval during LREC 2006 Monday 22 May 2006 Magazzini del Cotone Conference Center, GENOA - ITALY Table of Contents Diego Burgos, Leo Wanner ……………………………………………………………………… 5 “Using CBIR for Multilingual Terminology Glossary Compilation and Cross-Language Image Indexing” Thierry Declerck, Manuel Alcantara …………………………………………………………….. 9 “Semantic Analysis of Text Regions Surrounding Images in Web Documents” Michael Grubinger, Paul Clough, Henning Müller, Thomas Deselaers…………………………. 13 “The IAPR TC-12 Benchmark: A New Evaluation Resource for Visual Information Systems” Allan Hanbury ……………………………………………………………………………………. 24 “Analysis of Keywords used in Image Understanding Tasks” Judith L. Klavans (final manuscript not received in time for publication)…extended abstract 51 “CLiMB: Computational Linguistics for Metadata Building” Christophe Millet, Gregory Grefenstette, Isabelle Bloch, Pierre-Alain Moellic, Patrick Hede …. 34 “Automatically populating an image ontology and semantic color filtering” Katerina Pastra …………………………………………………………………………………... 40 “Image-Language Association: are we looking at the right features?” Mark Sanderson, Jian Tian, Paul Clough ……………………………………………………...... 44 “Testing an automatic organisation of retrieved images into a hierarchy” OntoImage'2006 May 22, 2006 Genoa, Italy page 3 of 55. OntoImage'2006 May 22, 2006 Genoa, Italy page 4 of 55. Using CBIR for Multilingual Terminological Glossary Compilation and Cross-Language Image Indexing Diego Burgos1, Leo Wanner2 1 Iulaterm Group Institut Universitari de Lingüística Aplicada (IULA) Universitat Pompeu Fabra La Rambla 30-32 08002 Barcelona

[email protected]

2 Institució Catalana de Recerca i Estudis Avançats (ICREA) Departament de Tecnologia, Universitat Pompeu Fabra Passeig de Circumval·lació, 8 08003 Barcelona

[email protected]

In this paper, several strategies for cross-language starts form a source language indexed image. web. The text surrounding the target matched means of a discriminant analysis. The number image file name are taken as differentiating these variables. Nouns classified as concrete image in the target document. When a positive for the image in the target document and as the out in specialized domains, a systematic and with their respective cross-language indices. 1. Introduction Images (and, therefore, also Content-Based Image Retrieval, CBIR) play a primary role in specialized discourse. However, for an integral application of CBIR, comprehensive indexed image DBs and, as a consequence, comprehensive lists of suitable index terms are required. The availability of such lists and the availability of the material to index are language dependent. For instance, for English, considerably more resources are available than for Spanish. A study carried out by Burgos (forthcoming) with bilingual Spanish-English terminological dictionaries revealed that the average of retrieved Spanish documents per term from the web was dramatically lower (7,860) than the average of retrieved English documents (246,575). Obviously, one explanation is that the web search space for English is much larger than the search space for Spanish. However, another explanation is that Spanish terms found in traditional terminological dictionaries are not suitable for indexing since they occur with a low frequency in the suitable index terms must be looked for! In the present work, CBIR is proposed as a means for multilingual terminology retrieval from the web for the purpose of compiling a multilingual glossary and building up an image index. All experiments are done so far for English and Spanish. 2. Related Research One of the major goals of CBIR is image indexing which aims at providing images with indices that describe OntoImage'2006 L1 L2 Ref erent Index 1 Index 2 Figure 1: Representation of the BC-hypothesis Preliminary empirical studies (carried out initially for English) buttress this BC-assumption. A total of 20 terms designating concrete entities by noun phrases1 from the automotive engineering field were extracted from issue of the Automotive Engineering International journal’s Tech Briefs section and used to retrieve documents from the web. The 20 terms to be included in this first sample were multiword expressions (MWE), basically noun phrases (NPs) of at least two referents were physical objects, i.e., spare parts or concrete devices belonging or related to the engineering domain. The concrete nature of the referents was confirmed by the definition or language equivalent of each NP provided by a dictionary. When the complete NP was not documented, the last modifier was removed and the remaining NP was searched again, and so on until it was found in the dictionary. For example, supercharger drive pulley was not found as is in the Routledge English Technical Dictionary, but drive pulley was. The intuition and knowledge of a Spanish native speaker (as in the case of the first author) was considered enough to determine that polea conductora is an object. When the intuition did not suffice, the definition prevailed over the equivalent: Pulley: A wheel-shaped, belt-driven device used to drive engine accessories3. The BC-hypothesis was confirmed for 19 terms in 223 visited web sites, i.e., each of the 19 terms co-occurred in a document along with its respective image. The remaining term – volume production engine – did not confirm our hypothesis since it did not designate a concrete entity, as it initially seemed, but a general concept referring to the mass production of an engine. Certainly, the head noun engine would have confirmed the BC, but as its premodification makes it that general, it did not co-occur with any image. It was also observed that certain nouns that designate a group of constituents must be excluded from the study – although they could be considered concrete nouns. The entities designated by words such as engine or system tend to be so general that their boundaries often cannot be clearly determined or that their appearance cannot be accurately predicted. Furthermore, in a bilingual setting, we assume that if in the source language corpus, an image of an object is available along with its index term, an image of the same object along with a term that denotes it (and that can thus 1 See (Quirk et al., 1985: 247) or (Bosque, 1999: 8-28, 45-51) with respect to the interpretation of the concept 2 Cf. http://www.sae.org/automag/, state January, 2006. 3 Definition taken from http://www.autoglossary.com/. OntoImage'2006 matching of images on the web suggest that other alternatives of CBIR must be considered, but that some positive matches in rather homogeneous search spaces provided enough target index term locations to pursue index candidate selection. 4.1. Index Candidate Selection Once the indexing context (monolingual or bilingual) has been determined and the document has been located, the index candidate selection is carried out, ignoring abstract nouns from the text surrounding the image. Certainly, there are some ideal web layouts where the unique surrounding text within reasonable boundaries is the image’s object name, that is, the index. In this case, a rather simple algorithm could extract the index. However, often considerable amounts of text must be parsed and concrete and abstract nouns must be disambiguated. In our study, this issue is being addressed as a classification problem where a set of NPs must be classified as concrete or abstract5. NPs classified as concrete make up the list of potential indices for the relevant image from which an index will be chosen by the index-image alignment process described below. The process of index candidate selection from the surrounding text consists of four phases: 1) Surrounding text chunking, 2) Chunks’ cleaning, 3) Definition of variables for classification, and 4) Classification. To distinguish NPs from VPs and other phrases, a chunker is used. Once all NPs have been chunked and extracted, some cleaning is done in order to prevent problems in the next phase of variable definition. The cleaning consists, first of all, in removing determiners at the beginning of the phrase; lemmatization (if appropriate); discarding NPs whose head noun (HN) is an acronym6; splitting Saxon possessives, and deleting proper nouns and numbers. Consider an example: three development objectives ⇒ development objective FSE’s single direct injector ⇒ single direct injector Obviously, some of the elements removed in the cleaning phase could be important for other purposes. However, for image indexing, their removal proved to be beneficiary Since concrete nouns do not present significant syntactic differences in comparison with abstract nouns, it is difficult to find linguistic variables that would be discriminatory enough to distinguish both types in the output provided by the chunker. Two alternative variables were analyzed: a) the number of images retrieved from the web by each NP, and b) the edit distance between the NP and the image file name (see below). It would be of great relevance to know whether the fact of being concrete or abstract could statistically differentiate the association or proximity of an NP to images in an index like maintained by Google. The first evaluations showed that sometimes even concrete nouns retrieved very general images and abstract nouns retrieved a good number of images too! As a consequence, a second variable was measured with the 5 The experiments in this stage so far have been English. 6 NPs with acronyms as HN are not included at work since often do not reveal whether they designate concrete or abstract entities – which could hinder further validation. OntoImage'2006 be a linguistic one, might help improving the percentage of correctly classified cases of concrete nouns. When a concrete noun is detected, it recovers the modifiers that had been removed for the phase of image retrieval and the complete NP is used in the alignment stage. 4.2. Index-Image Alignment In the previous section, a rather simplistic strategy was described to detect concrete nouns in the text surrounding an image in order to use them as index candidates for the image. The indexing process can be simplified if the image file name matches with any of the detected concrete nouns. For cases where such matching does not following procedure is proposed. For target image indexing, i.e., image-index association, each NP classified as concrete is used to query Google for images. Each of the 20 first retrieved images is compared with the image to be indexed. When a positive image matching occurs, the original image is indexed with the NP that was used to retrieve from the web the image that yielded the positive image matching. Table 3 illustrates this procedure by an example. In the example, the images retrieved by steering wheel and air filter did not match with the original image, but one of the images retrieved by cylinder head did. Therefore the original image is indexed as cylinder head. NP Google Original Matching New index Images image (+/-) steering ☼ → – wheel ☼ → ۩ – – cylinder ◙ → – ۩ head ۩ → ۩ + cylinder → – head air filter ۞ → – ۞ → ۩ – – Table 3. Monolingual image-index alignment procedure. The technique shows that image indices can be assigned taking into account usage, specificity and geographical variants. The fact of indexing the image with a term retrieved from its context assures that the index term is being used. Moreover, this technique tries to retrieve the appropriate degree of specificity that the index of a specific domain image is expected to present – which is often determined by the number of HN-modifiers of MWEs. Likewise, even for specialized discourse, indices should respond to geographical variants. This aspect can be controlled by specifying country domains. 5. Future Work Given that not all process stages of the proposal presented in this paper have been completely integrated and automated, an overall evaluation has not been possible so far. Future work aims at implementing specific CBIR algorithms to be applied in specialized domains and integrated in modules for index candidate selection and index-image alignment. The goal is to be able to compile multilingual specialized glossaries after systematic and recursive exploration of well delimited web segments and storage of images with their respective cross-language indices. Likewise, some other variables to improve discrimination between concrete and abstract OntoImage'2006 May Semantic Analysis of Text Regions Surrounding Thierry DFKI Abstract image indexing and terminological glossary compilation are presented. The process CBIR is proposed as a means to find similar images in target language documents in the image is chunked and the chunks are classified into concrete and abstract nouns by of images retrieved by each chunk and the edit distance between each chunk and each variables; a 74.4% rate of correctly classified labeled examples shows the adequacy of are used to retrieve images from the web and each retrieved image is compared with the matching occurs, the chunk used to retrieve the matched image is assigned as the index target language equivalent for the source image index. As the experiments are carried recursive use of the approach is used to build terminological glossaries by storing images objects clearly differentiated in the images; cf., for instance, parts of an engine. Some relevant work in this area has been done with respect to the segmentation of image regions that roughly correspond to objects (Carson et al., 2002; Barnard et al., 2003). Segmentation helps reducing the semantic gap (Chen et al., 2003; Tsai, 2003). Approaches that apply image retrieval directly to the web (Chang et al., 1997; Chen et al., 1999; Shen et al., 2000) are especially interesting to us since the present study is also carried out for the web. Moreover, these approaches propose HTML code as anchor to capture the semantics of images that could be used to build additional variables to improve the performance of the classification method proposed here. Indexing strategies have been mainly applied to general image collections. Yeh et al. (2004) report on a proposal to retrieve images of tourist sites from the web with a mobile phone whose end goal is similar to ours. The use of terminology for indexing specialized domain images in a bilingual or multilingual setting has not been discussed in previous literature. corpus. More 3. BC Hypothesis We assume language independent bimodal co-occurrence (BC) of images and their index terms in the corpus. This implies that (i) if a well chosen image index term occurs in a document of our corpus, it is likely that the corresponding image will also be available in the same document, and, vice versa, (ii) if an image occurs in a document of the corpus, the corresponding index term will also occur; see Figure 1. May 22, 2006 Genoa, Italy page 5 of 55. serve as its index term) will be available as well in the target language. That is, in order to identify in the target language corpus the equivalent translation term of the source index term, we must (1) recognize that the two images represent the same object; (2) retrieve the term denoting this object in the target language corpus; see again Figure 1. In order to prove the bilingual BC-hypothesis, a number of comparable (i.e., from the same domain) English and Spanish web sites on the automotive engineering field were collected and index terms and images were manually matched. Table 1 shows an example of two manually matched images taken from two different language websites which also serve to illustrate how cross-language equivalences between index terms can be established. a recent Online2 Source (English) Target (Spanish) Image tokens, whose automotive terms’ Index Slip-Ring FD 3G Colector Ford 3G by the target 26.9 mm specialized Table 1. BC-hypothesis for indexing in a bilingual setting. We prototypically implemented the above proposal and ran some preliminary experiments described below. 4. CBIR-Based Image indexing For CBIR-based image indexing, we start from a source language indexed image. An internet segment in the target language is delimited as a corpus (= search space) and the images in this corpus are compared with the source language image using Imatch4, a commercial software package with an embedded CBIR-module. When a positive image matching occurs according to a given threshold, the target language document containing the matched image is marked as a potential target index term location. Given that more noise results from a large search space, the size of the image database is usually one of the major concerns in CBIR-applications. In our work, we observed that the first problem to tackle is the appropriate definition of the web segment that will constitute the search space. The image DB-size and -quality will depend on this definition. Uniformity is more likely, for example, within the photographs of the same site than between the images of two or more sites. Likewise, there will be greater variance of image characteristics between the images of two different domains than within the images of the same domain, and so on. Our proposal relies to a great extent upon the performance of CBIR-techniques for image matching automation. As a CBIR-module has not yet been developed for this study, yet, the CBIR-module settings of Imatch had to be adjusted in order to obtain good results. The most complex Imatch algorithm performs image matching based on color, texture and shape information contained in images. Current results were achieved using this algorithm. The observations made so far with respect to ‘concrete noun’. 4 An evaluation version can be downloaded from http://www.photools.com/. May 22, 2006 Genoa, Italy page 6 of 55. underlying assumption that if an image surrounding text does not contain the NP that led to the retrieval of the image, it is the image file name that should more closely designate the image’s object, and, therefore, serve as and indicator of a concrete noun – provided that this name is not a simple number. Then, if the image file name matches the NP, the latter increases the probability of designating a concrete entity. For the statistical analysis, 100 concrete nouns and 100 abstract nouns were selected according to the criteria mentioned in Section 3. For each of the selected NPs, one modifier was left in order to (i) avoid outliers in the values of the retrieved image frequency, (ii) assure a minimum of domain specificity in the image search and (iii) be coherent with the assumed average length of image file names. Thus, in the case of, e.g., the NP powder-metal connecting rod, instead of searching for images with the full NP (which would lead to the retrieval of 5 images), the search is performed with the shortened NP connecting rod (i.e., the first modifier powder-metal is removed). This leads to the retrieval of 5,940 images, instead of the retrieval of 923,000 images with the head of the NP, rod. To measure the string distance between an NP and an image file name, the Levenshtein edit distance was used. The edit distance can be described as the minimum number of steps (substitutions, insertions or deletions) necessary to convert a word into another. The edit distance is 1 when there are transformations and 0 when no transformations are necessary. To analyze continuous values for this variable, the relative edit distance7 was used to obtain values between 0 and 1. Negative values are assigned when the image file name is longer than the NP. NP Image file name Edit distance rear axle rear axle 0 ignition coil sparky 1 rear axle stanley rear axle -0.470588235 throttle valve throttle 0 oil pan oilpan 0.166666667 selector lever image 0.8 Table 2. Some examples of the relative edit distance. . Table 2 shows some examples of the relative edit distance for some specific cases. Image file names were also cleaned so that underscores, numbers or symbols did not interfere in the measurement. As it can be noticed, spaces also count. If the file name is a substring of the NP, it is marked as a positive matching; if the file name contains at least one of the NP’s characters, a positive score, although not the lowest, is also given. Each NP was compared with a maximum of 20 image names; a relative distance mean was established for each NP. The tests of equality of group means proved a significant difference between the two measured variables, that is, image frequency and relative edit distance. 74.4% of originally grouped cases were correctly classified. A detailed analysis of the results shows that there is bigger variance within the values of concrete nouns than within abstract nouns. This suggests that another variable, may done for this stage of the 7 RD = number of transformation steps / possible maximum transformations. May 22, 2006 Genoa, Italy page 7 of 55. be researched. Even if linguistic specific features are hard to find in both groups, they are not completely discarded. Finally, further experiments will be carried out with other domains than automotive engineering. index-image 6. Acknowledgements This study is part of a wider research work being carried out by the first author within the framework of his PhD thesis at the IULA, Universitat Pompeu Fabra, Barcelona, Spain. It is supported by a grant from the Government of Catalonia according to resolution UNI/772/2003 of the Departament d’Universitats, Recerca i Societat de la Informació dated March 10th, 2003. occur, the 7. References Barnard, K., Duygulu, P., Forsyth, D., de Freitas, N., Blei, D., Jordan, M. (2001). Matching Words and Pictures. Journal of Machine Learning Research, 3, pp. 1107 - 1135. Bosque, I. (1999). El nombre común. In Bosque, I., Demonte, V. (eds) Gramática descriptiva de la lengua castellana. Madrid: Espasa Calpe, pp. 3-75. Burgos, D. (forthcoming). Concept and Usage-Based Approach for Highly Specialized Technical Term Translation. In Gotti, M., Sarcevic, S. (eds) Forthcoming 2006. Insights into Specialized Translation. Bern: Peter Lang. Carson, C., Belongie, S., Greenspan, H., Malik, J. (2002). Blobworld: Image Segmentation Using Expectation- Maximisation and its Application to Image Querying. IEEE Trans. Pattern Analysis and Machine Intelligence 24(8), pp. 1026-1038. Chang, S., Smith, J. R., Beigi, M., Benitez, A. (1997). Visual Information Retrieval from Large Distributed Online Repositories. Communications of the ACM 40(12). 63-71. Chen, F., Gargi, U., Niles, L., Schutze, H. (1999). Multi- Modal Browsing of Images in Web Documents. Document Recognition and Retrieval VI, Proceedings of SPIE 3651, pp. 122-133. Chen, Y., Wang, J. Krovetz, R. (2003). CLUE: Cluster- Based Retrieval of Images by Unsupervised Learning. IEEE Transactions on Image Processing, Vol. 14 (8) pp. 1187-1201. Quirk, R., Greenbaum, S., Leech, G. Svartvik, J. (1985). A Comprehensive Grammar of the English Language. London: Longman. Routledge English Technical Dictionary. Copenhaguen: Routledge. 1998. Shen H.T., Ooi B.C., Tan K.L. (2000). Giving Meanings to WWW Images. In: Proceedings of the 8th ACM international conference on multimedia, 30 October - 3 November 2000, Los Angeles, pp 39-48 Tsai, C. (2003). Stacked Generalisation: a Novel Solution to Bridge the Semantic Gap for Content-Based Image Retrieval. Online Information Review, Vol. 27 (6), pp. 442-445. Yeh, T., Tollmar, K., Darrell, T. (2004). Searching the Web with Mobile Images for Location Recognition. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'04), Vol. 2, pp. 76-81. nouns will 22, 2006 Genoa, Italy page 8 of 55. Images in Web Documents Declerck, Manuel Alcantara GmbH, Language Technology Lab Stuhlsatzenhausweg,3 D-66123 Saarbrücken, Germany

[email protected]

Abstract In this paper we present some on-going work and ideas on how to relate text-based semantics to images in web documents. We suggest the use of different levels of Natural Language Processing (NLP) to textual documents and speech transcripts associated to images for providing structured linguistic information that can be merged with available domain knowledge in order to generate additional semantic metadata for the images. An issue to be specifically addressed in the next future concerns the automation of the detection of relevant text/speech transcripts for a certain image (or video sequence). Beyond the time code approach, with its shortcomings, we expect from the discussion in this workshop on lexical characteristics of the language that can or should be used to describe image content an improvement of the approaches we are dealing with for the time being. as they can be extracted from audio-video material, into a 1. Introduction high-level features organization as one can typically find We started our work within a past European project, in a (domain) ontology. Esperonto. The Esperonto project was dealing with The EU co-funded project aceMedia is offering a very annotation services for bridging the gap between the good example of such an approach. In this project, actual (html based) Web and the emerging Semantic Web. ontologies, which are typically describing knowledge as A smaller task of the project was dedicated to the expressed in words, are extended in order to include the investigation on how to automatically provide for low-level visual features resulting from state-of-the-art semantic annotation for images present in a web page. A audio-video analysis systems. For the description of low- possible strategy we investigated was to provide for level features, the project uses as its background the ontology-driven semantic annotation of the text MPEG-7 standard, and proposes links from the MPEG-7 surrounding an image in a web page. descriptors to high-level (domain) ontologies (see This work is being continued and extended within a Athanasiadis, 2005). So in a sense no full integration is recently started Network of Excellence, called K-Space proposed here, but a linkage between the MPEG-7 (Knowledge Space of Semantic Inference for Automatic description scheme and ontologies represented in a Annotation and Retrieval of Multimedia Content, Semantic Web language, and interoperability of http://kspace.qmul.net/), in which some Labs are descriptions of audio-video material is indirectly realized. specifically dedicated to the contribution of Human Another closely related approach (see the papers by Language Technology (HLT) for the semantic indexing Jane Hunter) is proposing a reformulation of the semantic (and possibly retrieval) of multimedia content. K-Space, metadata of MPEG-7 descriptors in machine- which will be described in more details in this paper, is understandable language (MPEG-7 Description Schema offering a more integrated approach for multimedia being only a machine-readable language) and use RDFs semantics, aiming at a formal integration of low-level or OWL. This step is ensuring a better interoperability of features extracted from multimedia material on the base of semantic multimedia descriptions. But here the cross- state-of-the-art audio-video analysis, and high-level media aspect is missing, since no textual analysis and/or features resulting from text analysis coupled with speech transcripts are taken into account. semantic web technologies. A new iniative, the K-Space European Network of Excellence, has started recently. This project is dealing 2. Background: Multimedia Semantics with semantic inferences for semi-automatic annotation and retrieval of multimedia content. The aim is to narrow The topic of Multimedia Semantics has gained a lot of the gap between content descriptors that can be computed interest in recent years, and large funding agencies issued automatically by current machines and algorithms, and the calls for R&D proposals on those topics. So for example a richness and subjectivity of semantics in high-level human recent call of the European Commission, the 4th call of the interpretations of audiovisual media: the so-called 6th Framework, was dedicated to the merging of results Semantic Gap. gained from R&D projects on knowledge representation The project deals with a real integration of knowledge and cross-media content. The goal being in making the structures in ontologies and low-level descriptors for (semantic) descriptions of multimedia content re-usable audio-video content, taking also into account knowledge on the base of a higher interoperability of media that can be extracted from sources that are complementary resources, which has been so far described mainly at the to the audio/video stream, mainly speech transcripts and level of XML syntax, as can be seen with the MPEG-7 text surrounding images or textual metadata describing a standard for encoding and describing multimedia content. video or images. The integration takes place at 2 levels: In the line of the recent developments in the fields of the level of knowledge representation, where features Semantic Web technologies, one approach consists in associated with various modalities (image, text/speech looking at ways for encoding so-called low-level features, transcripts, audio) should be interrelated within conceptual OntoImage'2006 May 22, 2006 Genoa, Italy page 9 of 55. classes in ontololgies (from domain-specific to general the representation of the structure of the content of purpose ontologies), and the level of processing, where multimedia documents. Work will also be dedicated to high-level semantic features should be integrating for research on ontologies for low-level visual features, guiding (and so possibly improving) the automatic concentrating on a model for the concepts and properties analysis of audio-video material and the corresponding that describe visual features of objects, especially the extraction of semantic features. visualizations of still images and videos in terms of low- As such the K-Space activities are mostly dedicated to level features and media structure descriptions. Also, a the analysis of multimedia and cross-media data and the prototype knowledge base will be designed to enable feature extraction out of such data. Navigation, search and automatic object recognition in images and video retrieval in the field of semantic cross-media archives are sequences. Prototype instances will be assigned to classes not primarily addressed. and properties of the domain specific ontologies, An interesting project with respect to K-Space is containing low level features required for object MESH, which seems to build an application scenario on identification. the base of the multimedia and cross-media knowledge Partners of K-Space dealing with textual analysis will structures discussed and proposed by K-Space and integrate into this ontology infrastructure the typical aceMedia. The domain of application is given by the features for text analysis, also proposing ontology classes News domain. The project will deal with the ontology- at a higher-level, that supports the modeling of interrelated driven semantic integration of content features extracted cross-media features (multimedia and text). We will base from video, images, speech transcripts and text. Multi- our work on the proposal made by (Buitelaar et. al 2005). and cross-media reasoning is an important issue here, insuring consistency and non-redundancy of the integrated 3.2. Use of Textual Information and Knowledge cross-media features. A major issue will consist in Bases for Semantic Feature Extraction proposing an appropriate syndication of the semantically from Audio Signal encoded material for distribution to distinct (mobile) end- In K-Space some work will be dedicated to the user hardware, also under consideration of personalization extension of state-of-the-art processing and analysis aspects. Supporting thus the distribution of relevant multi- algorithms to handle high-level, conceptual and cross-media content. representations of knowledge embedded in audio content The 2006 edition of TRECVid is offering an based on reference ontologies and semantically annotated interesting development, since one of its tasks is associated text (including speech transcripts, when the addressing searching within a multimedia database, quality of the transcripts allows it). whereas interaction with the user is also foreseen. We can K-Space will consider all types of audio sources expect here that the user will input his/her queries in ranging from speech to complex polyphonic music natural language, whereas the use of certain lexical items signals. The description schemes of the MPEG-7 standard should guide he intelligent search in large archives define how audio signals can be described at different containing cross-media material. abstraction levels: from the lowest level primitives, such as temporal or audio spectrum centroids, spectrum 3. An integrative approach in the K-Space flatness, spectrum spread, inharmonicity, etc., to the Network of Excellence highest level, related to semantic information. Semantic The projects mentioned above (and some others, not information is related to textual information on audio such listed here for reason of place), are given us important as titles of songs, singers’ names, composers’ names, information about methodologies and technologies for the duration of music excerpt, etc. “ontologization” of low-level audio-video features This textual information is often encoded using the extracted from multimedia content. Here we describe in text annotation tool of the Linguistic Description Scheme some more details the K-Space project and the activities (LDS) of MPEG-7. An example of such a (manual) related to the use and analysis of sources complementary annotation related to a video sequence is given just below: to audio-video material. First we describe the foreseen ontology infrastructure, which will give the base for the integration of low-, mid- and high-level features extracted from audio-video and associated text/speech transcripts. <VideoSegment id="shot1_13"> <MediaTime> <MediaTimePoint>T00:01:40:11008F30000</MediaT 3.1. Development of a multimedia ontology imePoint> infrastructure <MediaDuration>PT10S26326N30000F</MediaDurat The multimedia ontology infrastructure of K-Space ion> will contain qualitative attributes of the semantic objects </MediaTime> that can be detected in the multimedia material, e.g. color <TextAnnotation confidence="0.500000"> homogeneity, in the multimedia processing methods, e.g. <FreeTextAnnotation> color clustering, and in the numerical data or low-level TRACKS STOPPED ROLLING NOSE AND FORMALLY features, e.g. color models. The ontology infrastructure FILED A HIGHWAY WITH EIGHT DAILY NEW YORK will also contain the representation of the top-level NEWSPAPERS WHERE THE VOID OF NEWSPAPERS THE structure of multimedia documents in order to facilitate a VOID OF CUSTOMERS full-scale annotation of multimedia documents. R&D </FreeTextAnnotation> work will be dedicated to the specification and </TextAnnotation> development of a multimedia content ontology supporting </VideoSegment> OntoImage'2006 May 22, 2006 Genoa, Italy page 10 of 55. • Title of the document Interesting to note here, is that the media time is also • Caption text: „Click on the image to enlarge“ (a given, so that this can be used as a way to look for non relevant item, to be filtered by the tools, also alignment of the low-level features and the high-level on the base of lexical properties of the words). features that can be extracted from the text. • Content of the HTML „Alt“ tag: “'VEGETABLE Our work will consist here in proposing a linguistic GARDEN WITH DONKEY'” and semantic analysis of all the available free text • Content of the HTML „Src“ tag: annotations used in the semantic representantion of audio http://www.spanisharts.com/reinasofia/miro/burro signal, and mapping this onto either the structured _lt.jpg annotation scheme of LDS (specifying the “who”, the • Abstract text “what”, the “why”, the “when” etc in an explicit way), or • Running text to provide for an ontology based semantic annotation (in term of instances of ontology classes). On the base of this, we wrote a tool that supports the We will also use TRECVid data, using aligned speech manual selection of such textual regions, and send those to transcripts and video shots, and looks for ways to extracts a linguistic processing engine. The linguistic processing high-level semantics from the transcripts (which are engine has been augmented with metadata sepcifying the attached to the audio-video stream using also the LD type of text to be processed (we expect for example the scheme). For sure the quality of transripts is often bad, Title and the “Alt” text to consist mostly of phrases.) and here we will use robust NLP methods and limits ourself to the detection of basic textual chunks. For improving the alignment of text/transcripts with the audio (or video) signal, we try to identify typical lexical items that link directly such text/transcripts to the signal (“here you can see” etc.). 3.3. Analysis of Complementary Textual Sources for adding Semantic Metadata to Multimedia Content The human understanding of multimedia resources is often facilitated by usage of complementary sources. In order to simulate this attitude, K-Space will implement mining methods and tools for such complementary resources in order to reduce the semantic gap by deriving annotations from those sources, and so to reach a more complete annotation of (sequences of) images. The project will address mining and analysis for semantic features extraction within two different types of Figure 1 Example of a web page with images of paintings. Various resources: text regions are offering different kind of “metadata” to the • Mining and analysing primary resources: Analysis of the primary resources that are attached to the multimedia data, e.g. texts around pictures, subtitles of movies, etc. 5. The Linguistic Analysis of the Various • Mining and analysis of secondary and tertiary Text Regions resources: Analysis of data and text related to the In the following lines, we show some of the (partial) multimedia data under consideration, e.g. a results of the linguistic analysis, as applied to the various programme guide for a TV broadcaster or a web text segments. Our tools are delivering a dependency site displaying similar pictures. annotation: 4. Linguistic Analysis of relevant Text Regions • „Alt“ text: 'VEGETABLE GARDEN WITH We report on a first experiment made within the DONKEY' Esperonto project, where also a small ontology on <NP HEAD=“garden” PRE_MOD=“vegetable” artworks has been made availble to the project parners. In <POST_MOD CAT= “PP” HEAD=“with” this ontology, typical terms were associated to every class NP_COMP_HEAD=“donkey”</POST_MOD></NP> (so for example the terms “surrealism” and “cubism” are • Abstract/Running text: “…This picture depicts the associated to the class “artistic_movement”. rural landcape of Montroig …” In the Esperonto scenario, we first defined the possibly <SENT SUBJ=“This picture” PRED=“depicts relevant text regions for the semantic annotation of the OBJ=“the rural lansdscape of Montroig”</SENT> image (see below in Figure 1 the example of such an • Detailled annotation of the direct_object: <NP image, in a web page dedicated to the painter Miro, the HEAD=“landscape” PRE_MOD=“rural” first image being the base for our indexing prototype tool). <POST_MOD CAT=“PP” HEAD=“of” We identified following text regions (in both the text and NP_COMP_HEAD=“Montroig”</POST_MOD> in the html code): </NP> OntoImage'2006 May 22, 2006 Genoa, Italy page 11 of 55. 6. The Semantic Annotation We have described some approaches that take On the base of a mapping between the linguistic advantages of so-called complementary sources dependency and the terms associated to the classes of the (text/transcripts) for automatically adding semantic ontology (whereas we accomodated the classes of the metadata to image material. Till now we concentrated on ontology to be associated with patterns (for coping for the linguistic processing aspect, with a very small lexical example with date expressions), we could provide for a base. Lexical consideration would allow to extend our semantic annotation of the texts associated with the approach and to really evaluate it. More principled lexical picture. information would also support the automatic detection of text parts that are referring directly to the content of the 6.1. The (Toy) Art Ontology (schematized) image under consideration, and not to metadata related to this image (in which museum is the picture, wo made it • Object > Artork > Painting [has_creator, etc.) or on topics not related to the image at all. has_name, has_subject, has_dimension, has_material, has_genre, has_date...] We will also have to think at principled ways for • Person > Artist > Painter [has_name, integrating the lexical knowledge into the multimedia has_birth_date, part_of_artistic_movement …] infrastructure. At the beginning we would follow a similar approach that has been proposed for the integration of 6.2. The Instantiation of Classes lexical information in domain specific ontologies, and proposed in the SmartWeb project. • Title: Vegetable garden with donkey • Creator: Miro • Date: 1918 8. Acknowledgments • Genre: naïve (if correctly extracted by some "The research programm described in this paper is reasoning on the linguistically and semantically supported by the European Commission, contract FP6- annotated text) 027026, Knowledge Space of semantic inference for • Subject: rural landscape of Montroig + garden automatic annotation and retrieval of multimedia content - and donkey (if the association between the title K-Space." and the explanation given by the art expert can be grouped). 9. References • Dimension: 65x71 Buitelaar P., Sintek M., Kiesel M (2005).. Feature • Material: Oil on canvas Representation for Cross-Lingual, Cross-Media Semantic Web Applications. In: Proceedings of the 6.3. Some remarks ISWC 2005 Workshop “SemAnnot”.. This result was possible due to various facts. First, the Athanasiadis T., Tzouvaras V., Petridis K., Precioso F., system “knew” that the text was about art, and we Avrithis Y. and Kompatsiaris Y. (2005). Using a assumed that the text is related to the picture. Second, we Multimedia Ontology Infrastructure for Semantic had an ad-hoc relation of terms to the concepts of the Annotation of Multimedia Content. In proceedings of ontology (for example “Oil”). Third we had defined the ISWC 2005 Workshop “SemAnnot”. typical patterns realising some concepts (date, material Jane Hunter: Enhancing the semantic interoperability of etc.). But our focus was more on syntactic analysis (in fact multimedia through a core ontology. IEEE Trans. dependency analysis). So the Subject of the sentence Circuits Syst. Video Techn. 13(1): 49-58 (2003) “This picture” together with the typical verb “depicts” and Jane Hunter: Adding Multimedia to the Semantic Web: its DirectObject allowed here to “map” the whole Building an MPEG-7 ontology. SWWS 2001: 261-283 DierctObject to the “subject” of the picutre (what the AceMedia project: http://www.acemedia.org/aceMedia picture is about). The dependency analysis of the BUSMAN project: http://busman.elec.qmul.ac.uk/ DirectObj allows us to further precise the topic of the Esperonto Project: http://www.esperonto.net picture: it is a rural (mod) Lanscape (head) of Montroig K-Space Project: http://kspace.qmul.net (post_nom_mod), thus introducing quite fine granularity SmartWeb Project: http://www.smartweb-projekt.de in the indexing of the image. TRECVid: http://www-nlpir.nist.gov/projects/trecvid/ The missing point here: there is no principled relation between the terms in the ontology and the results of the image analysis (in term of low-level features). We think here that a domain ontology taking into account the specific features for the multi-modal analysis components could help in establishing this relationship, not only at lexical level but also maybe at the syntactic level (the dependency relations in linguistic fragments of texts refering to images could give some hints about the distribution of objects in the picture). But clearly one has to think first of a specific classification of lexical items in terms of possible indices of multimedia content, before looking a syntactic properties of text related to images. 7. Conclusions OntoImage'2006 May 22, 2006 Genoa, Italy page 12 of 55. The IAPR TC-12 Benchmark: A New Evaluation Resource for Visual Information Systems Michael Grubinger1, Paul Clough2, Henning Müller3 and Thomas Deselaers4 1 School of Computer Science and Mathematics, Victoria University of Technology PO Box 14428, Melbourne VIC 8001, Australia

[email protected]

2 Department of Information Studies, Sheffield University Western Bank, Sheffield, S1 4DP, UK

[email protected]

3 Medical Informatics, University and Hospitals of Geneva 24, rue Micheli-du-Crest, 1211 Geneva 14, Switzerland

[email protected]

4 Human Language Technology and Pattern Recognition, Computer Science Department, RWTH Aachen University, Aachen, Germany

[email protected]

Abstract In this paper, we describe an image collection created for the CLEF cross-language image retrieval track (ImageCLEF). This image retrieval benchmark (referred to as the IAPR TC-12 Benchmark) has developed from an initiative started by the Technical Committee 12 (TC-12) of the International Association of Pattern Recognition (IAPR). The collection consists of 20,000 images from a private photographic image collection. The construction and composition of the IAPR TC-12 Benchmark is described, including its associated text captions which are expressed in multiple languages, making the collection well-suited for evaluating the effectiveness of both text- based and visual retrieval methods. We also discuss the current and expected uses of the collection, including its use to benchmark and compare different image retrieval systems in ImageCLEF 2006. visual objects that can be made available to the research 1. Introduction community are required, e.g. the effort described in Standard datasets are vital for benchmarking the (Jörgensen, 2001) to create annotated databases for system performance of information retrieval systems and evaluation, but the outcome of these efforts is still sparse. allowing the comparison between different approaches or methods (Over et al., 2004; Müller et al., 2001; 1.1. Collections available for Evaluation Narasimhalu et al., 1997; Smith, 1998). For example, For a long time, the de–facto standard for image initiatives such as TREC1 (Text REtrieval Conference, retrieval evaluation was the Corel Photo CDs. However, Harman, 1996) and CLEF2 (Cross-Language Evaluation they are problematic: the CDs are expensive to obtain, are Forum, Braschler & Peters, 2004) have provided the protected by copyright and legal restrictions on use and necessary resources to enable comparative evaluation of therefore difficult to distribute for large-scale evaluation, Information Retrieval (IR) systems. These initiatives have they have limited written metadata which makes them less motivated and encouraged research and have clearly suitable for evaluating methods of text-based image contributed to the advancement of information retrieval retrieval, and the CDs are currently unavailable to buy and systems over the past years. therefore not available to researchers. It was also shown A core component of any benchmark is a set of that subsets of this database can easily be tailored to show documents (e.g. texts, images, sounds or videos) that are improvements (Müller, Marchand-Maillet & Pun, 2002). representative of a particular domain (Markkula et al., An alternative database that is free of charge, not 2001). However, finding such resources for general use is restricted by copyright restrictions, and previously used often difficult, not least because of copyright issues which for evaluation is the collection built by the University of restrict the distribution and future accessibility of data. Washington5. It contains approximately 1,000 images, This is especially true of visual resources that are often clustered by the location that images were taken from. more valuable than written texts and therefore subject to Other databases are available for computer vision limited availability and access for the research research, but rarely used for image retrieval6 because they community. For example, consider the Corbis Image do not represent realistic retrieval data. The Benchathlon7 Database3 or Getty Images4, large collections of images, created an evaluation resource, but without search tasks or but because of being commercial datasets they are ground truth. ALOI8 (Amsterdam Library of Object generally inaccessible for research purposes. To evaluate Images) and LTU (LookThatUp) Technologies9 have aspects of visual information systems (e.g. automatic created large databases with colour images of small annotation, retrieval or pattern recognition), collections of 5 http://www.cs.washington.edu/research/imagedatabase 1 6 http://trec.nist.gov/ http://homepages.inf.ed.ac.uk/rbf/CVonline/CVentry.htm 2 7 http://www.clef-campaign.org/ http://www.benchathlon.net/ 3 8 http://pro.corbis.com/ http://staff.science.uva.nl/~aloi/ 4 9 http://www.gettyimages.com/ http://www.ltutech.com/ OntoImage'2006 May 22, 2006 Genoa, Italy page 13 of 55. objects with varied viewing (and illumination) angles, but for multimedia retrieval and began an effort to create a primarily designed for pure pattern recognition evaluation freely available database of images with associated and less for information retrieval. There are a few royalty- annotations. This started by developing a set of free databases available in specialised domains like recommendations and specifications of an image Casimage10 and IRMA11 for medical imaging, or the St. benchmark (Leung & Ip, 2000). Based on this criteria, a Andrews collection12 that is copyrighted but was made first version of a benchmark consisting of 1,000 multi- available for retrieval evaluation of historic (mainly black object colour images, 25 search requests (or queries), and and white) photographs. Many web pages actually make a collection of performance measures was set up in 2002. images available in large quantities and with copyright Developing a benchmark is an incremental and notices attached such as FlickR13 or Morguefile14. ongoing process. The IAPR TC-12 Benchmark was Although many of these images are available without refined, improved and extended to 5,000 images in 2004, many copyright restrictions for simple use, it is often not using a benchmark administration system (Grubinger & allowed to redistribute them particularly not combined in Leung, 2003). At the end of that year, an independent large numbers. Intellectual property rights with respect to travel organisation (viventura18) provided access to around digital content (and particularly images) are currently not 10,000 of their images including multilingual annotations always clear. of varying quality in three languages (English, German, The TRECVID (TREC video retrieval track, Smeaton Spanish). This increased the total number of images in the et al., 2004) image collections have increasingly been benchmark to 15,000. Of course, a benchmark is not used for image retrieval in the last two years as well. The beneficial unless actually used by the research key frames can indeed be used for image retrieval and community. Therefore in 2005, discussions began for object recognition, and the tasks created correspond well involving the IAPR TC-12 Benchmark as part of an image to simple journalists search tasks. As the videos also retrieval task in CLEF. ImageCLEF has begun using the contain the speech of the video, multimodal retrieval collection and is expected to continue using it for future evaluation is possible on these datasets as well. tasks (see Section 4). With 10,000 additional images from The IAPR collection described in this paper is an the travel organisation, the total number of available example of another collection, specifically created with images rose to 25,000 images (Grubinger, Leung & the following aims in mind: to provide a realistic Clough, 2005) but was soon reduced to 20,000 images collection of images suitable for a wide number of annotated in three languages. evaluation purposes, to provide images with associated written information representing typical textual metadata 2.2. Origin and Selection of Images that can be used to explore the semantic gap between The majority of the images are provided by viventura, images and words, metadata expressed in multiple an independent travel company that organizes adventure languages15. The goal is to provide a dataset that is free of and language trips to South-America. At least one travel charge and copyright restrictions and therefore available guide accompanies each tour and they maintain a daily to the general research community. This paper describes online diary to record the adventures and places visited by the creation and composition of the IAPR TC-12 the tourists (including at least one corresponding photo). Benchmark and discusses how the collection is currently Furthermore, the guides provide general photographs of being used within ImageCLEF16 for the evaluation of each location, accommodation facilities and ongoing multilingual and multimodal image retrieval systems. social projects. Not all of the images provided are suitable for a benchmark and must undergo a selection process 2. The Image Collection (Grubinger & Leung, 2003). In total, 20,000 images were At present, the IAPR TC-12 image collection consists selected and added to the IAPR TC-12 Benchmark. of 20,000 images (plus 20,000 corresponding thumbnails) taken from locations around the world and comprising a 2.3. Example Images varying cross-section of still natural images. The image collection includes pictures of a range of sports (Fig. 1) and actions (Fig. 2), photographs of people 2.1. History of the IAPR benchmark (Fig. 3), animals (Fig. 4), cities (Fig. 5), landscapes In 2000, the Technical Committee 12 (TC-12) of the (Fig. 6) and many other aspects of contemporary life. International Association for Pattern Recognition (IAPR17) recognized the need for a standard benchmark 10 http://www.casimage.com/ 11 http://irma-project.org/ 12 http://www-library.st-andrews.ac.uk/ 13 http://www.flickr.com/ 14 http://morguefile.com/ 15 Considering annotations in multiple languages is an important aspect of text-based image retrieval as real-life collections such as FlickR are intrinsically multilingual. Figure 1: Examples for sports photos 16 http://ir.shef.ac.uk/imageclef/ (Tennis, Motorcycling, Snowboarding) 17 http://www.iapr.org/ 18 http://www.viventura.de/ OntoImage'2006 May 22, 2006 Genoa, Italy page 14 of 55. are repeated on a regular basis and have fixed itineraries. Thus, the tours always visit the same tourist destinations where the guides usually take photos of tourists in varying poses (see Fig. 7) and/or of tourist attractions with varying viewing angles (Fig. 8), weather conditions (Fig. 9) or at different times of the day (Fig. 10). Hence, this makes the benchmark also well-suited for content-based retrieval tasks as it allows a range of prototypical searches to explore retrieval effectiveness with these varying settings. Figure 2: Examples for action pictures (Pushing, Celebrating, Drinking) Figure 3: Examples for people shots Figure 7: Tourists from three different tour groups at the (Peruvian Children, Korean Guards, Russian Singers) Salt Lake of Uyuni in Bolivia Figure 4: Examples for animal photos (Humpback Whale, Kangaroos, Galapagos Giant Turtle) Figure 8: The Cathedral of Cuzco, Peru, in different viewing angles (right, left and front) Figure 5: Examples for city pictures (Sydney Opera House, The Eiffel Tower, Las Vegas Strip) Figure 9: The Inca ruins of Machu Picchu in bright sunshine, on an overcast day and in foggy and rainy conditions Figure 6: Examples for landscape shots (Grand Canyon, Montañita Beach, Volcano Licancabur) 2.4. Diversity of the Image Collection Figure 10: A cyclist riding a racing bike at night, in the The IAPR TC-12 photographic collection contains morning and during the day many different images of similar visual content, but varying illumination, viewing angle and background. This is because most of the tours offered by the travel company OntoImage'2006 May 22, 2006 Genoa, Italy page 15 of 55. 2.5. Image Statistics 30 This section provides information on a range of attributes which characterise the image collection (e.g. the 25 size of images, image formats, and temporal and PERCENTAGE 20 geographical extent of the collection). 15 2.5.1. Sizes of Images and the Collection 10 The photographs provided by the travel organisation exhibit the following differences based on the technology 5 used to capture the images: photographs taken with digital cameras which have a 4:3 relation of width to height 0 PER AUS ECU BOL BRA CHI ARG USA KOR COL RUS VEN ROC FRA MISC (96x72 pixels for thumbnails; 480x360 pixels for larger versions), and photographs taken with a non-digital (or COUNTRY traditional) camera which have been subsequently scanned and have a 3:2 relation of width to height (92x64 pixels for thumbnails; 480x320 pixels for larger versions). Figure 12: Variation across countries Thumbnails require between 2 and 10 KB each (an (with more than 100 images) average file size of 5.69 KB); the larger versions range from 20 to 200 KB (an average size of 85.25 KB), Most of the images originate from Peru (28.4 %), depending upon their content and colour composition. The followed by Australia (21.3 %) and Ecuador (11.6 %), total size of the image collection is 1.66 GB (and 111 MB reflecting the geographic location of contributors. The for the corresponding thumbnails). All images are stored collection comprises a total of 11 countries contributing in the JPEG image format. more than 1 % to the collection, and 14 countries with at least 100 images or 0.5% of the collection. 2.5.2. Temporal Range Most photographs have been taken since 2001 and 3. Image Annotations Fig. 11 shows the temporal distribution of images between 2001 and 2005. The earliest photo in the collection dates back to 2000; the most recent taken in July 2005. The 3.1. Original Annotations mean date is June 2003, the standard deviation is 1.12 Tour guides are supposed to add a short caption for years and the median is January 2004. each image they include with their diaries. These captions include a title for the image, a short description, a location and date of creation. Most annotations are written in 40 35 German as the travel company viventura targets the German-speaking market. However in some cases, guides PER C EN TA GE 30 25 also use Spanish, Portuguese or English. 20 15 Title: Praia do 10 Flamengo 5 Description: Der Praia 0 do Flamengo gilt als - 2001 2002 2003 2004 2005 einer der schönsten YEAR Strände Brasiliens! Location: Salvador, Brasilien Figure 11: Temporal Range Date: 2. Oktober 2004 2.5.3. Geographical Range Figure 13: Example of an original annotation The IAPR TC-12 collection is spatially diverse, with pictures taken in more than 30 countries worldwide Fig. 13 shows an example image with a mixed- including Argentina, Australia, Austria, Bolivia, Brazil, language original annotation in Portuguese and German. Chile, Colombia, Ecuador, France, Germany, Greece, The Portuguese title states briefly what the image is about Guyana, Korea, Peru, Russia, Spain, Switzerland, Taiwan, (in this case the name of the beach “Flamingo Beach”); Trinidad & Tobago, Uruguay, USA, and Venezuela. Fig. the description of the image is in German and provides 12 shows the proportion of images taken in these countries further detail (“Flamingo Beach is considered as one of (represented in their international three letter code19): the most beautiful beaches of Brazil!”). Both location (“Salvador, Brazil”) and the date (“October 2nd, 2004”) are expressed in German language and form. Since most of the tour guides are local employees from South-America and therefore native Spanish or Portuguese speakers, the quality of the annotations (and also their detail) varies 19 Abbreviations of the International Olympic Committee tremendously. OntoImage'2006 May 22, 2006 Genoa, Italy page 16 of 55. 3.2. Revised Annotations In order to provide a consistent set of annotations for benchmarking, the original annotations of images selected for inclusion in the IAPR TC-12 Benchmark have been manually checked, corrected and completed in compliance with slightly modified image annotation rules (Grubinger & Leung, 2003). These rules specify the use of the right terminology, annotation precision, cardinality, image settings and number of annotation sentences and also restrict the level of subjective interpretation. Figure 15: Complete Annotation for Image 16019 The information on the screen is divided into two parts: the left (see Fig. 16) displays the image, its unique identifier (see Section 3.3.1) and part of the image meta- data: the photographer, the location (see Section 3.3.5) and the date (Section 3.3.6). Figure 14: Benchmark Administration System Fig. 14 shows a screenshot of a custom-built Benchmark Administration System used to carry out the revision process (see (Grubinger & Leung, 2004) for details of its specification, architecture and implementation). In particular, information provided about the location was checked and the image description divided into two separate fields: one part to describe visible information in the image; the other providing additional notes which are not part of visual content visible within the image. The original (German) annotations were corrected, missing text and notes from Figure 16: The left half of the annotation: image meta-data the images completed, and all annotations translated into English and Spanish. The right part of the screen (see Fig. 17) contains multi-lingual free-text annotations of the title (Section 3.3. Finalised Annotations 3.3.2), the image description (Section 3.3.3) and the notes The final set of images and consistent data for the (Section 3.3.4). Benchmark associates each photograph with a semi- structured text caption consisting of the following seven fields: - a unique identifier, - a title, - a free-text description of the semantic (and visual) contents of the image, - notes for additional information, - the name of the photographer, - fields describing where and when the photograph was taken. These annotations are stored in a MySQL database and managed by the Benchmark Administration System. Fig. 15 shows a complete annotation for an example image. Figure 17: The right half of the annotation: multi-lingual free-text annotations in English, German and Spanish OntoImage'2006 May 22, 2006 Genoa, Italy page 17 of 55. These free-text annotations (and also the location and syntax (Tam & Leung, 2001) and studies show that users date information) are currently available in three tend to create short narratives to describe images when languages, with the German and English versions in a unconstrained from a retrieval task (Jörgensen, 1996; release status and the Spanish version currently being O'Connor B., O'Connor M. & Abbas, 1999). verified. The German version uses Austrian vocabulary and spelling because the annotation creator is Austrian. Number of Words English German Spanish Australian vocabulary and spelling (almost equivalent to average 23.06 18.92 N/A British English) for the English version is used because standard deviation 10.35 8.48 N/A the annotation process was undergone in Melbourne, minimum 2 2 N/A Australia. The author did, in cases of doubt, ask local median 22 18 N/A native speakers for translations or vocabulary. maximum 85 74 N/A 3.3.1. Unique Image Identifiers Table 2: Word statistics for the description field. Each image is assigned a unique identifier. For instance, the unique identifier of the example in Figure 15 The average length of the description field is 23.06 is “16019”, which determines the filename of the image words (with a standard deviation of 10.35 words). The (“16019.jpg”) and of the annotation files (“16019.eng” for shortest description comprises two words; the longest is English, “16019.ger” for German and “16019.spa” for 85 words, with a median of 22 words (see Table 2). Spanish). Again, the German descriptions use fewer words than the English version (see section 3.3.2). 3.3.2. Title The title field contains a short statement describing what the image is about. This can include proper names like “Flamingo Beach”, general noun phrases like “cyclist at night”, or a combination of both such as “llamas at Machu Picchu”. The title can also be a short sentence such as “Max is surfing in Torquay”. Figure 18: the description field of image 16019 This title field is equivalent to descriptive annotations Number of Annotation Sentences. Obviously, there found in many personal photographic collections (i.e. is no limit to how semantically rich one could make the annotations that typical users might add to their own description of an image. Most of the annotations have photographs). In most cases the title field is not very between one and five more or less complex annotation different to the original annotations. The average length of sentences (Fig. 18, for instance, has four). In many the title field for English is 5.35 words, with a standard annotations, two or more of these sentences are conjunct deviation of 2.37 words. The shortest title consists of one (and), hence, a statistic evaluation of the number of word; the longest consisting of 17 words. Table 1 displays sentences is not representative for the annotations. statistics for different versions of the titles. Sentence Order. The semantic descriptions of the image follow a certain priority pattern: The first Number of Words German English Spanish sentence(s) describe(s) the most obvious semantic Average 4.85 5.35 5.97 information (like “a photo of a brown sandy beach”). The standard deviation 2.10 2.37 2.68 latter sentences are used to describe the surroundings or Minimum 1 1 1 settings of an image, like smaller objects or background Median 5 5 6 information (“a blue sky with clouds on the horizon in the Maximum 14 17 19 background”). Linguistic Patterns. Many of these annotation Table 1: Word statistics for the title field. sentences or noun phrases follow one of the main linguistic patterns P (or a more different combination German titles are on average shorter in length (and based on these) shown in Table 3. Spanish titles longer) than the English titles. This does not necessarily mean that the Spanish titles are more complex Pattern P Example than the German ones; it is more likely due to the fact that S a red rose composite nouns that can be described in one word in S–V a boy is singing German (e.g. “Flamingostrand”) are often expressed by S–TA a boy at night two words in English (“Flamingo Beach”), whereas S–PA a boy in a garden Spanish requires three words (“Playa del Flamenco”). S–PA–TA a boy in a garden at night S–V–TA a boy is singing at night 3.3.3. Description S–V–PA a boy is singing in a garden The description field contains a semantic description S–V–PA–TA a boy is singing in a garden at night of the image contents, or in other words, it describes in S–V–O a girl is kissing a boy short sentences and noun phrases (terminated by semi- S–V–O–TA a girl is kissing a boy at night colons) what can be recognized in an image without any S–V–O–PA a girl is kissing a boy in a garden prior information or extra knowledge. Keywords alone are S–V–O–PA–TA a girl is kissing a boy in a garden at night not used as they are not very precise due to the lack of Table 3: Linguistic Pattern of Descriptions. OntoImage'2006 May 22, 2006 Genoa, Italy page 18 of 55. Any of these patterns P mentioned in Table 3 are also which means that there are more than 16 million possible used for background and foreground information and can colour values for each pixel. Although the perceptual be further specified as to where they lie within the image ability of humans allows a much lower level of granularity (see Table 4): for the visual differentiation of colour, there exist an immense number of colour names for ever so slightly Pattern Example different shades, saturations or intensities of colours (see P–PA P on the left Coloria20 for a very impressive list and representation of P–BG P in the background many colour names in several languages). P–FG P in the foreground Consequently, the more colour names are used in P–BG–PA P in the background on the right annotations, the smaller the difference between the colour P–FG–PA P in the foreground on the left names and therefore the harder it will be to provide a consistent use of colour attributes among all the annotations. This is further made difficult by the fact that Table 4: Linguistic Pattern of the Descriptions. one and the same colour can appear to be different in many images due to different surrounding colours. Table 5 provides an overview and a description of the It is also known (Berlin & Kai, 1969) that significant symbols used in Tables 3 and 4. differences exist between naming colours in different languages and cultures. For example, a kind of sea green, Symbol Description called “aoi” in Japanese, in English is generally regarded S subjects (with or without adjectives) as a shade of "green", while in Japanese what an English V verbs (with or without adverbs) speaker would identify as “green” can be regarded as a O objects (with or without adjectives) different shade of the kind of “sea green”. PA place adjunct(s) with place preposition A study by Berlin and Kay (1969) has shown that there TA time adjunct(s) with time preposition are substantial regularities in naming colours across many P any pattern or combination of patterns described in languages. In the study, a concept of the following basic Table 3 colour terms has been identified: black, grey, white, pink, FG in the foreground red, orange, yellow, green, blue, purple and brown. All BG in the background other colours are considered to be variants of these basic colours. Due to these reasons, colour attributes are just using Table 5: Symbols. the aforementioned eleven basic colour terms. Variations in intensity are expressed by adding the labels light and Appropriate Tense. Annotations describe actions or dark (like “a dark green palm tree”). The suffix –ish is situations in images at certain times. The grammatically used if the colour is similar to one of the base colours (“a correct tenses, therefore, are the present continuous tense greenish palm tree”). Objects with a colour between two in English, the Präsens in German and estar + gerundio basic colour terms are described with a combination of the in Spanish. The auxiliary verbs for English (be) and two (like “a yellowish-orange drink”). Spanish (estar) are omitted in some annotations. Adjectives. As with the number of annotation sentences, there is obviously no limit how detailed each 3.3.4. Notes object could be described by the use of adjectives. In This field contains additional free-text information general, the fewer objects there are in the image, the more about images such as background information and these adjectives are used to describe such an object and vice fields do not follow any underlying patterns or annotation versa (Fig. 19). rules. Figure 20: the notes field of image 16019 This can include information like original names in other languages (Fig. 20), historical information, eventual results of sports events (Fig. 21) or any other description that is not visible in the image and requires prior or deeper knowledge of the image contents. Figures 19: Examples for the use of adjectives Use of Colour Attributes: Most of the annotation nouns have received at least one colour attribute if the pattern was not too complicated. However, the use of colour attributes for nouns in image annotations is not as trivial as it might seem. The colour value of a pixel is usually stored using 24 bits in the RGB colour space 20 http://www.coloria.net/bonus/colornames.htm OntoImage'2006 May 22, 2006 Genoa, Italy page 19 of 55. Year 12.65% Day, Month, Year Month, Year 50.73% 36.63% Figures 21: Examples for historical and sports events Not all images have note fields. In fact, just 10.3 % of the images hold additional, non-visible information, with an average length of 11.88 words per notes field and a standard deviation of 7.99. The longest notes field Figure 22: Percentages of the time granularity levels contains 55 words, the shortest just one, with a median of eleven words (see Table 6). There are three different time granularity levels: 51 % of the images have a complete date (day, month, year), Number of Words English German Spanish 37 % contain have month and year, and 12 % of the average 11.88 10.84 N/A annotation just state the year (see Fig. 22). standard deviation 7.99 7.26 N/A minimum 1 1 N/A 3.4. Generated Annotations median 11 9 N/A Annotations are stored in a database which is also maximum 53 59 N/A managed by a benchmark administration system that allows the specification of parameters according to which Table 6: Word statistics for the notes field. different subsets of the image collection can be generated. Fig. 23 shows an example of an annotation format 3.3.5. Locations generated for ImageCLEF. The location field describes the place where the image has been taken and is divided into two parts: (1) the exact location (e.g. Salvador) and (2) the country where this <DOC> location belongs to (e.g. Brazil). Some images (2.35 %) <DOCNO>annotations/16/16019.eng</DOCNO> only have country information in cases where the exact <TITLE>Flamingo Beach</TITLE> location in that country could not be verified. <DESCRIPTION> a photo of a brown sandy beach; Location names are stored in three languages. The the dark blue sea with small breaking waves question of whether place names are to be translated or behind it; a dark green palm tree in the not is a special challenge in se as there is no general foreground on the left; a blue sky with clouds answer for this question. While most countries do have on the horizon in the background; their own version in each of the three languages like </DESCRIPTION> “Brazil” (English), “Brasilien” (German) and “Brasil” <NOTES> Original name in Portuguese: "Praia (Spanish), there is no pattern as to whether, for example do Flamengo"; Flamingo Beach is considered as city, names are translated or not. In many cases it is true one of the most beautiful beaches of Brazil; that the more unknown a place is, the less likely it will be </NOTES> translated into a foreign language. However, this rule of <LOCATION>Salvador, Brazil</LOCATION> thumb is not always applicable. Consider the places Rome <DATE>2 October 2002</DATE> and Buenos Aires for example, both big and famous cities: <IMAGE>images/16/16019.jpg</IMAGE> the Argentine capital is the same in all the three languages <THUMBNAIL>thumbnails/16/16019.jpg</THUMBNAIL> (“Buenos Aires”), whereas the Italian capital has a </DOC> different version in each of the languages: “Rome” in English, “Rom” in German and “Roma” in Spanish. Hence, since there is no general rule, each location or Figure 23: The generated English annotation file place had to be checked individually whether there is an official translation or not, no matter how big or famous the Since the annotations are saved in three languages, one location. of these parameters is the annotation language. The annotation files can, at this stage, be generated in three 3.3.6. Dates different languages (and it is also possible to randomly The date field contains the date when the image was select the annotation language). Figures 24 and 25 show taken, with each of the languages having its own version the German and Spanish equivalents to the English and format: German (e.g. "2 Oktober 2004"), English annotation in Fig. 23. (e.g. "2 October, 2004") and Spanish (e.g. "2 de octubre de 2004"); OntoImage'2006 May 22, 2006 Genoa, Italy page 20 of 55. English document collection with queries in a variety of <DOC> languages. ImageCLEF 2004 added a visual retrieval task <DOCNO>annotations/16/16019.ger</DOCNO> on a medical image collection and increased the <TITLE>Der Flamingostrand</TITLE> participation from the visual retrieval community. <DESCRIPTION> ein Photo eines braunen ImageCLEF 2005 (Clough et al, 2005) provided tasks for Sandstrands; das dunkelblaue Meer mit kleinen system-centred evaluation of retrieval systems in two brechenden Wellen dahinter; eine dunkelgrüne domains: historic photographs and medical images. These Palme im Vordergrund links; ein blauer Himmel domains offer realistic scenarios in which to test the mit Wolken am Horizont im Hintergrund; performance of image retrieval systems and offer various </DESCRIPTION> challenges and problems to participants. One purely visual <NOTES> Originalname auf portugiesisch: task was offered on the automatic annotation of medical "Praia do Flamengo"; Der Flamingostrand gilt images. An interactive image retrieval tasks was also als einer der schönsten Strände Brasiliens; offered. </NOTES> The ImageCLEF benchmark aims to evaluate image <LOCATION>Salvador, Brasilien</LOCATION> retrieval from multilingual document collections and a <DATE>2 Oktober 2002</DATE> major goal is to investigate the effectiveness of <IMAGE>images/16/16019.jpg</IMAGE> multimodal retrieval (visual image features and textual <THUMBNAIL>thumbnails/16/16019.jpg</THUMBNAIL> description combined). ImageCLEF has already seen </DOC> participation from both academic and commercial research groups worldwide from communities including the following: Cross-Language Information Retrieval Figure 24: The generated German annotation file (CLIR), Content-Based Image Retrieval (CBIR), medical information retrieval and user interaction. Campaigns such as CLEF and TREC have proven invaluable in providing <DOC> standardised resources for comparative evaluation for a <DOCNO>annotations/16/16019.eng</DOCNO> wide range of retrieval tasks and ImageCLEF aims to <TITLE>La Playa del Flamenco</TITLE> provide the research community with similar resources for <DESCRIPTION> una foto de una playa marrón; image retrieval. el mar azul oscuro con pequeñas olas que están quebrando detrás; una palmera de color verde 4.2. ImageCLEF 2006 oscuro en primer plano a la izquierda; un ImageCLEF has been provided with a subset of the cielo azul con nubes en el horizonte al fondo; IAPR TC-12 Benchmark for its upcoming evaluation </DESCRIPTION> event (ImageCLEF 200621) for a task concerning the ad- <NOTES>Nombre original en portugués: "Praia do hoc retrieval of images from photographic image Flamengo"; La Playa del Flamenco es collections (called ImageCLEFphoto). Participants are considerado una de las playas más bonitas de provided with the full collection of 20,000 images; Brasil; </NOTES> however they will not receive the complete set of <LOCATION>Salvador, Brasil</LOCATION> annotations, but a range from complete annotations to no <DATE>2 de octubre de 2002</DATE> annotation at all. Data will be provided in English and <IMAGE>images/16/16019.jpg</IMAGE> German in order to enable the evaluation of multilingual <THUMBNAIL>thumbnails/16/16019.jpg</THUMBNAIL> text-based retrieval systems. In addition to the existing </DOC> text and/or content based cross-language image retrieval task, ImageCLEF will also use the IAPR TC-12 Benchmark in an extra task for content-based image Figure 25: The generated Spanish annotation file retrieval. Other tasks offered in ImageCLEF 2006 include: Other parameters of the flexible annotation generation - an interactive retrieval evaluation using a module of the Benchmark Administration System include database provided by FlickR; (1) a range of annotation formats, (2) the level of - a medical image retrieval task with a database in annotation quality by suppressing the generation of certain three languages and varied annotation; fields, (3) varying levels of location information and (4) - a medical automatic annotation task (or image the introduction of spelling mistakes. classification). - a non-medical image annotation task (object 4. IAPR TC-12 Benchmark at ImageCLEF recognition). The IAPR TC-12 Benchmark will be used for an ad- hoc image retrieval task at ImageCLEF, the text and/or 4.3. ImageCLEF 2007 and onwards content-based image retrieval track of CLEF from 2006 ImageCLEF has also expressed interest in having just onwards. one text annotation file with a randomly selected language for each image for ImageCLEF 2007, making full use of 4.1. Introduction to ImageCLEF the benchmark's parametric nature. ImageCLEF conducts evaluation of cross-language image retrieval and is run as part of the CLEF campaign. The ImageCLEF retrieval benchmark has previously run 21 in 2003 with the aim of evaluating image retrieval from http://ir.shef.ac.uk/imageclef/2006/ OntoImage'2006 May 22, 2006 Genoa, Italy page 21 of 55. Based on the discussions at the ImageCLEF workshop, 6. References the exact format of the benchmark will be decided as the Berlin B. & Kay P. (1969). Basic Color Terms: Their most important goal is to include the research community Universality and Evolution. University of California into the task development process. Press. Braschler, M & Peters, C (2004). Cross-Language 5. Conclusion Evaluation Forum: Objectives, Results, Achievements. Publicly available benchmark efforts are an important Information Retrieval 7(1-2): pp. 7 - 31. part of research fields that are growing up. The goal is to Clough, P., Müller, H., Deselaers, T., Grubinger, M., ease for researchers the effort of evaluation of their Lehmann, T., Jensen, J. & Hersh, W. (2006). The CLEF algorithms and to provide a platform for information 2005 Cross-Language Image Retrieval Track. In exchange and discussions among researchers. Sometimes Proceedings of the Cross Language Evaluation Forum these efforts are even done on a national level 2005, Springer Lecture Notes in Computer Science - to (ImageEval22, France) to supply active researchers with a appear. common evaluation structure for their algorithms. If Grubinger, M. & Leung, C. (2003). A Benchmark for benchmarks are well made according to the needs of Performance Calibration in Visual Information Search. researchers, the participation will follow. In Proceedings of The 2003 International Conference An important part of the benchmark is the dataset and on Visual Information Systems (VIS 2003), Miami, FL, this is certainly no exception in the case of visual USA, pp. 414 – 419. information systems. The benefits of the collection Grubinger, M. & Leung, C. (2004). Incremental described in this paper are: Benchmark Development and Administration. In - high-quality colour photographs; Proceedings of The Tenth International Conference on - pictures from a range of subjects and settings; Distributed Multimedia Systems (DMS'2004), - high-quality multilingual text annotations which Workshop on Visual Information Systems (VIS 2004), together make the collection suitable to evaluate San Francisco, CA, USA, pp. 328 – 333. a range of tasks; Grubinger, M., Leung, C. & Clough, P. (2005). The IAPR - no copyright restrictions enabling the collection Benchmark for Assessing Image Retrieval Performance to be used in general by the research community. in Cross Language Evaluation Tasks. In Proceedings of It is recognised that benchmarks are not static as the MUSCLE/ImageCLEF Workshop on Image and Video field of visual information search might (and will) Retrieval Evaluation, Vienna, Austria, pp. 33 - 50. develop, mature and/or even change. Consequently, Hanbury, A. (2006). Analysis of Keywords in Image benchmarks will have to evolve and be augmented with Understanding Tasks. In Proceedings of the OntoImage additional features or characteristics depending on the workshop at the International Conference on Language researchers needs, and the IAPR TC-12 Benchmark will REsources and Evaluation (LREC) – to appear. be no exception here. Apart from the planned completion Harman, D. (1996). Overview of the Fourth Text Retrieval of annotations in Spanish, and a possible extension to Conference (TREC-4). In Proceedings of the Fourth other annotation languages like French, Italian or Text Retrieval Conference (TREC-4), Gaithersburg, Portuguese, the addition of several different annotation MD, USA. formats following a structured annotation defined in Jörgensen, C. (1996). The applicability of existing MPEG-7, an ontology-based keyword annotation classification systems to image attributes: A selected (Hanbury, 2006) or even non-text annotations like an review. Knowledge Organisation and Change, 5, pp. audio annotation are viable. 189 – 197. The method of generating various types of visual Jörgensen, C. (2001). Towards an image test bed for information might produce different characteristics in the benchmarking image indexing and retrieval systems. In future, and databases might have to be searched in Proceedings of the International Workshop on different ways accordingly. Hence, benchmarks with Multimedia Content–Based Indexing and Retrieval, several different component sets geared to different Rocquencourt, France. requirements will be necessary, and the parametric Leung, C. & Ip, H. (2000). Benchmarking for Content- IAPR TC-12 Benchmark has taken a significant step Based Visual Information Search. In Proceedings of the towards that goal. Fourth International Conference on Visual Information The IAPR TC-12 collection is also targeting an Systems (VISUAL’2000), Lyon, France: Springer important market, that of personal picture collections. Verlag, pp. 442 – 456. While desktop search for text is becoming a common Markkula, M., Tico, M., Sepponen, B., Nirkkonen, K., utility, the search in private picture collections is still Sormunen, E. (2001). A Test Collection for the awaiting easy-to-use tools. With the large majority of Evaluation of Content-Based Image Retrieval pictures now taken in digital form, this is a field that is Algorithms—A User and Task-Based Approach, very likely to develop, creating a need for well-performing Information Retrieval 4(3-4), pp. 275 – 293. tools. ImageCLEFphoto can be a first test for such Müller, H., Müller, W., Squire, DM., Marchand-Maillet, algorithms to prove their performance for real-world use. S. & Pun, T. (2001), Performance Evaluation in Content-Based Image Retrieval: Overview and Proposals, Pattern Recognition Letters (Special Issue on Image and Video Indexing), 22(5). H. Bunke and X. Jiang Eds. pp. 593 - 601 Müller, H., Marchand-Maillet, S., Pun, T. (2002). The 22 truth about Corel – evaluation in image retrieval. In http://www.imageval.org/ OntoImage'2006 May 22, 2006 Genoa, Italy page 22 of 55. Proceedings of the International Conference on the Challenge of Image and Video Retrieval (CIVR 2002), Springer Lecture Notes in Computer Science (LNCS 2383) London, England, pp. 38-49. Narasimhalu, AD., Kankanhalli, MS. & Wu, J. (1997). Benchmarking Multimedia Databases, In Multimedia Tools and Applications 4, pp. 423 - 429. O'Connor, B., O'Connor, M., Abbas, J. User Reactions as Access Mechanism: An Exploration Based on Captions for Images. Journal of the American Society For Information Science, 50(8), pp 681-697. Over, P., Leung, C., Ip, H. & Grubinger, M. (2004). Multimedia Retrieval Benchmarks. Digital Multimedia on Demand, IEEE Multimedia April-June 2004, pp. 80 - 84. Smeaton, AF., Kraaij, W. & Over, P. (2004). The TREC VIDeo Retrieval Evaluation (TRECVID): A Case Study and Status Report. In Proceedings of RIAO 2004, pp . Smith, JR. (1998). Image Retrieval Evaluation IEEE Workshop on Content-based Access of Image and Video Libraries, Santa Barbara, California, USA, pp 112-113. Tam, A. & Leung, C. (2001). Structured Natural- Language Descriptions for Semantic Content Retrieval of Visual Materials. In Journal of the American Society for Information Science and Technology, 52(11), pp. 930 – 937. OntoImage'2006 May 22, 2006 Genoa, Italy page 23 of 55. Analysis of Keywords used in Image Understanding Tasks Allan Hanbury Pattern Recognition and Image Processing Group (PRIP) Institute of Computer-Aided Automation Favoritenstraße 9/1832, A-1040 Vienna, Austria

[email protected]

Abstract In the field of computer vision, automated image annotation and object recognition are currently important research topics. It is hoped that these will lead to improved general image understanding which can be usefully applied in Content-based Image Retrieval. In this paper, an analysis of the keywords that have been used in automated image and video annotation research and evaluation campaigns is presented. The outcome of this analysis is a list of 525 keywords divided into 15 categories. Given that this list is collected from existing image annotations, it could be used to check the applicability of ontologies describing entities which are portrayable in images. 1. Introduction form at the same level as a human annotator. The current state-of-the-art in automated annotation tends to operate at The usual reason to annotate data (i.e. add metadata to it) an extremely low level — for example, there is still no al- is to simplify access to it. This is particularly important for gorithm that can make an error-free distinction between im- the semantic web. The metadata added to documents or im- ages of cities and images of landscapes, or which can make ages allow for more effective searches. The problem with an error-free decision as to the presence or absence of hu- adding metadata manually is that it is an extremely labour- man faces in an image. intensive and time-consuming task. In the field of com- Evaluating the abilities of current algorithms requires a puter vision, automated image annotation and object recog- rather low level of annotation. Even though different nition are currently important research topics (Barnard et modalities of annotation exist, such as description using al., 2003; Carbonetto et al., 2004; Csurka et al., 2004; Li keywords, annotations based on ontologies and free text de- and Wang, 2003; Winn et al., 2005). This automatic gener- scription, the majority of these annotations are done by as- ation of image metadata should allow image searches and signing keywords to images. For object recognition tasks, Content-Based Image Retrieval (CBIR) to be more effec- controlled vocabularies are often used, with the vocabulary tive. For example, an image database could be annotated being defined by the capabilities of the object recognition offline by running a keyword annotation algorithm. Every algorithm used (Winn et al., 2005). In applications which image containing a cup would then have the keyword “cup” aim to do a more general image labelling using a larger associated with it. If a user wishes to find images of a spe- number of keywords, the vocabulary is often uncontrolled, cific cup in this database, he/she would select a region con- as in (Li and Wang, 2003). For example, the TRECVID taining the target cup from an image. An object recognition 2005 high-level feature detection task tested automatic de- algorithm could then categorise the selected region as a cup tection of only 10 concepts. The IBM MARVEL Multi- and a text search could be carried out to find all images media Search Engine1 extracts only six concepts in the on- in the database with an associated keyword “cup”. This line image retrieval demo version2 (face, human, indoor, would significantly reduce the number of images in which outdoor, sky, nature). Carbonetto et al. (2004) use a vo- it would be necessary to attempt to recognise the specific cabulary of at most 55 keywords. The largest number of cup selected by the user. keywords have been used by Li and Wang (2003), who as- To measure progress towards successfully carrying out this signed 433. task, evaluation of algorithms which can automatically ex- A good way of collecting keywords which would be use- tract this sort of metadata is required. For successful eval- ful in an ontology describing images is to analyse the vo- uation of these algorithms, reliable ground truth is neces- cabularies used in the ground truth of image annotation sary. This ground truth should be a semantically rich de- and object recognition tasks. In this way, one can find out scription of the objects in an image (Leung and Ip, 2000). which words are important in applications and which words There is obviously almost no limit to how semantically rich correspond to objects which can be detected using current one could make the description of an image. Indeed, for state-of-the-art image understanding algorithms. After an manual annotation of such documents destined to aid in on- overview of some approaches to collecting manual image line searching for them, semantic richness is an advantage. annotations (Section 2.), we analyse the annotations which For images, one can create complex ontologies allowing the have been used in image and video understanding publi- specification of objects and actions. For example, Schreiber cations and evaluation campaigns in Section 3. The list of et al. (2001) create such an ontology for annotating pho- collected keywords is at the end of the paper in Section 6. tographs of apes. One can specify the type of ape, how old it is and what it is doing. Nevertheless, it should be borne in 1 http://www.research.ibm.com/marvel mind that the automated content description and annotation 2 http://www.alphaworks.ibm.com/tech/ algorithms being developed cannot yet be expected to per- marvel OntoImage'2006 May 22, 2006 Genoa, Italy page 24 of 55. 2. Manual annotation collection methods by a verification step by the database administrators. At The manual annotation of images is a very labour-intensive present12 , there are 101 verified keywords, the majority of and time-consuming task. Various systems to simplify the which are shown in Table 2. The incentive to annotate the collection of image annotations or to receive input from a images is that the annotator may then download the latest large number of people have been set up. annotations. An interesting experiment is taking place on the Gimp- Savvy Community-Indexed Photo Archive website3 . This 3. Analysis of Keywords used in Annotation archive contains more then 27 000 free photos and images, Experiments and the users of the site are requested to annotate the im- ages using keywords which they are free to choose (tips on In this section we analyse the keywords that have been choosing keywords are made available4 ). That this “free used in image annotation, categorisation and object recog- annotation by all” approach has not been totally success- nition experiments and evaluation campaigns. To begin, a ful can be seen by the extremely large number of “junk” brief discussion on the difference between annotation and keywords on the master list5 as well as the over-annotation categorisation is presented in Section 3.1. Some methods (assignment of too many keywords) of many of the images. currently used for collecting manual annotations of images On the Flickr6 photo archive, people who upload photos are listed in Section 2. We then present an analysis of the may also assign keywords to them. These are then used keywords that have been used in image annotation experi- to search for images. Other users may add comments to ments. The analysis was carried out in two steps. The first the images. There is no standardised keyword list, so this step consisted of creating a list combining all the keywords database represents a good example of the annotation prac- used in the experiments, datasets and evaluations consid- tice of amateur photographers on their own images. ered and removing the unsuitable words (Section 3.2.). An innovative approach to collecting annotations of im- The second step was the categorisation of keywords (Sec- ages by keywords has been developed by Ahn and Dabbish tion 3.3.). From a practical point of view, it is useful if the (2004). In their ESP game7 , they aim to make the anno- keywords are sorted into categories. When one is annotat- tation of images enjoyable. Players access the ESP game ing images, this simplifies the choice of a word from the server and are paired randomly. They have no way of com- keyword list — one can select the category that the image municating with each other. Pairs of players are shown 15 belongs to in order to reduce the choice of keywords. The images during the game, with the aim being for both play- result of this analysis is a list of 525 keywords assembled ers to type in the same keyword for an image so as to ad- from various sources and divided into 15 categories. vance to the next. This is an intelligent way of avoiding the 3.1. Annotation and Categorization problem of “junk” keywords, as the pairs of players verify the keywords. Keywords which are typed often for an im- There are two approaches to associating textual informa- age are added to a “taboo” list shown for that image, and tion with images described in the literature: annotation can no longer be entered as keywords by the players. The and categorisation. In annotation, keywords or detailed keywords entered correspond to the whole image, although text descriptions are associated with an image, whereas in the authors have discussed implementing, for example, a categorisation, each image is assigned to one of a num- “shooting game”, where the players have to click on the ber of predefined categories (Chen and Wang, 2004). This requested object. The Peekaboom game8 from the same re- can range from more general two category classification, search group is of this type. An image search engine based such as indoor/outdoor (Szummer and Picard, 1998) or on the keywords collected from the ESP game for about city/landscape (Vailaya et al., 2001) to more specific cat- 30 000 images is accessible on the web9 . egories such as African people and villages, Dinosaurs, An online annotation application aimed at collecting key- Fashion and Battle ships (Chen and Wang, 2004). Cate- words for image regions is the LabelMe tool10 . Here the gorisation can be used as an initial step in image under- user clicks the vertices of a polygon around an object and standing in order to guide further processing of the image. then enters a keyword describing the object. As the vocab- For example, in (Wang et al., 2001) a categorisation into ulary is not controlled, multiple keywords and misspelled textured/non-textured and graph/photograph classes is done keywords often occur, as can be seen by examining the key- as a pre-processing step. Recognition is concerned with word statistics on the webpage11. This problem is solved the identification of particular object instances. Recogni- tion would distinguish between images of two structurally 3 http://gimp-savvy.com/PHOTO-ARCHIVE/ distinct cups, while categorisation would place them in the 4 http://gimp-savvy.com/PHOTO-ARCHIVE/ same class (Csurka et al., 2004). Recognition also has its tips_on_indexing.html uses in annotation, for example in the recognition of family 5 http://gimp-savvy.com/cgi-bin/ members in the automatic annotation of family photos. masterkeys.cgi Categorisation can be considered as annotation in which 6 http://www.flickr.com 7 one must choose from a fixed number of keywords (the cat- http://www.espgame.org 8 egories) and one is limited to assigning one keyword to each http://www.peekaboom.org/ 9 http://www.captcha.net/esp-search.html image. The discussion of annotation and categorisation is 10 http://people.csail.mit.edu/brussell/ therefore combined in this section. research/LabelMe/intro.html 11 12 400 keywords on the 29th of July 2005. 27 July 2005 OntoImage'2006 May 22, 2006 Genoa, Italy page 25 of 55. 3.2. Overview of Visual Keywords It is, of course, possible to greatly extend the number of We present a collection of groups of keywords which have categories if one is recognising specific objects, such as in already been used for testing automated image annotation the Caltech 101 category database19 (Fei-Fei et al., 2004), algorithms or in automated image and video annotation which contains images of objects in the categories shown evaluation campaigns. in Table 3. The 10 features which were tested in the TRECVID 2005 If one restricts oneself to such specific categories, it is ob- high-level feature detection task are described in Table 1. viously possible to create many thousands. A set of 16 All 40 news concepts defined for TRECVID 2005 are avail- broader categories has been defined for the 15 200 images able for download13 (they are part of the LSCOM creation in the CEA-CLIC database (Moëllic et al., 2005). These task (Hauptmann, 2004)). are shown in Table 4. Two categorisation tasks are part of the ImagEVAL14 cam- A number of papers on automatic image or image region paign: for the general image description task, the hierarchi- annotation have also been published. The following three cally organised global image categories shown in Figure 1 all use parts of the Corel image database along with key- will be tested. There is also an object detection task, al- words usually extracted from the annotations accompany- though the list of objects to be tested has not been finalised ing the Corel images. The 55 keywords used by Carbonetto yet. The examples given are car, tree, chair, Eiffel Tower et al. (2004) are given in Table 5. Li and Wang (2003) and American Flag. used the largest number of keywords. They defined 600 The PASCAL Visual Object Classes Challenge 2005 con- categories of image, and to each category assigned on av- sisted of classification and detection tasks for four objects: erage 3.6 keywords. Each of the 100 images in each cate- motorbikes, bicycles, people and cars. However, in the gory was then assigned the same keywords associated with database collection set up as part of this challenge15, five the category. For example, all images in the “Paris/France” databases are provided with standardised ground truth ob- category were assigned the keywords “Paris, European, his- ject annotations. The keyword list arising from this stan- torical building, beach, landscape, water”, the images in the dardisation is shown in Table 2. “Lion” category were assigned the keywords “lion, animal, As part of the EU LAVA project16 , a database consisting wildlife, grass” and the images in the “eagle” category were of 10 categories of images was made available17 . These assigned the keywords “wildlife, eagle, sky, bird”. Barnard categories are: bikes, boats, books, cars, chairs, flowers, et al. (2003) used 323 keywords. These lists are not repro- phones, roadsigns, shoes and soft toys. duced in this paper due to lack of space, but can be seen in Chen and Wang (2004) classified images into 20 categories: (Hanbury, 2006). African people and villages, Beach, Historical buildings, Buses, Dinosaurs, Elephants, Flowers, Horses, Mountains 3.3. Analysis of Visual Keywords and glaciers, Food, Dogs, Lizards, Fashion, Sunsets, Cars, The aim of this analysis is to create a list of keywords which Waterfalls, Antiques, Battle ships, Skiing and Deserts. reflect the current interest in automated image annotation Two databases have been released by Microsoft Research with keywords. These keywords could then serve as an ini- in Cambridge18. The “Database of thousands of weakly tial controlled vocabulary for re-annotating the image col- labelled, high-res images” contains images divided into lections used in previous experiments and for annotating the following 23 categories: aeroplanes, cows, sheep, new image collections. benches and chairs, bicycles, birds, buildings, cars, chim- neys, clouds, doors, flowers, forks, knives, spoons, leaves, 3.3.1. Creation of a combined keyword list countryside scenes, office scenes, urban scenes, signs, The first step of the analysis consisted of creating a list trees, windows, miscellaneous. Some of these are di- combining all the keywords and categories used in the ex- vided into sub-classes, such as different views of cars. The periments, datasets and evaluations covered in Section 3.2. “Pixel-wise labelled image database” contains 591 images We then removed words which were considered to be un- in which regions are manually labelled using the follow- suitable. These include place names, such as “Australia”, ing 23 labels: building, grass, tree, cow, horse, sheep, sky, “Boston” and “New Zealand”, which, even for a human, mountain, aeroplane, water, face, car, bicycle, flower, sign, are very difficult to assign to images for which one has no bird, book, chair, road, cat, dog, body, boat. The major- supplementary information. Confusing keywords, such as ity of the images are roughly segmented, although accurate “history” and “north”, and keywords requiring too high a segmentations of some of the images are available. level of a priori semantic information, such as “landmark” and “rare animal” were also removed. We have not yet col- 13 http://www-nlpir.nist.gov/projects/ lected statistics on how often a single keyword appears in tv2005/LSCOMlite_NKKCSOH.pdf different lists. 14 http://www.imageval.org 15 http://www.pascal-network.org/ 3.3.2. Categorisation of keywords challenges/VOC/ From a practical point of view, it is useful if the keywords 16 http://www.l-a-v-a.org 17 are sorted into categories. When one is annotating images, ftp://ftp.xrce.xerox.com/pub/ftp-ipc/ 18 this simplifies the choice of a word from the keyword list — Downloadable here: http://www.research. microsoft.com/vision/cambridge/recognition/ 19 default.htm. Version 1 of the pixel-wise labelled image http://www.vision.caltech.edu/Image_ database has been ignored here, as it forms a subset of version 2. Datasets/Caltech101/Caltech101.html OntoImage'2006 May 22, 2006 Genoa, Italy page 26 of 55. Keywords Segment contains video of ... People walking/running more than one person walking or running Explosion or fire an explosion or fire Map a map US flag a US flag Building exterior the exterior of a building Waterscape/waterfront a waterscape or waterfront Mountain a mountain or mountain range with slope(s) visible Prisoner a captive person, e.g., imprisoned, behind bars, in jail, in handcuffs, etc. Sports any sport in action Car an automobile Table 1: The 10 features which were tested in the TRECVID 2005 high-level feature detection task. Colourised Black & Black & White Photo Colour Photo White Photo Artistic Reproduction Indoor Outdoor Day Night Urban Scene Natural Scene Urban Scene Natural Scene Figure 1: The hierarchy of keywords used in the global image characteristics task of ImagEVAL. one can select the category that the image belongs to in or- categories, for example, “grass” appears in the “Texture” der to reduce the choice of keywords. The 16 categories of and “Nature and Landscapes” categories. A table show- the CEA-CLIC database (Moëllic et al., 2005), with some ing the keywords assigned to each category is given in Sec- minor changes, turn out to be well-suited to grouping the tion 6. A histogram of the number of keywords per category combined list of keywords.The changes are: is shown in Figure 2. One can see from this histogram that the categories “Ob- • the fusion of the “Architecture” and “City” categories jects”, “Nature and Landscapes” and “Zoology” contain the to form an “Architecture / City” category. This was most keywords, which could be an indicator that these cat- done as it is often difficult for an annotator to decide egories have received the most attention in past research between these two categories. on automated image annotation and categorisation. This • the addition of an “Abstract / Global” category to con- could be because of the image databases used — the Corel tains words such as “female” and “exterior”. databases, for example, appear to contain a high propor- tion of natural and animal images. The man-made objects • the removal of the “Mathematics” category, which has appear to be more prevalent in the databases designed for no members in the list of keywords collected. object categorisation experiments. • the removal of the “linguistic” category, as this is an image category and not a keyword category. 4. Conclusion We analyse the keywords which have been used to anno- • the addition of the “Anatomy and Medicine” category, tate images in a number of image retrieval publications and which at present includes one keyword, but can be ex- evaluation campaigns. A significant contribution is the cre- panded later. ation of a combined keyword list based on these keywords. The list of categories and their descriptions are in Table 6. From this analysis one can see that the main automated an- We assigned each of the keywords in the combined list to at notation effort has been directed at images of everyday ob- least one category. A few keywords were assigned to two jects; nature and landscapes; and animals (zoology). As OntoImage'2006 May 22, 2006 Genoa, Italy page 27 of 55. aeroplaneSide apple background bicycle bicycleSide bookshelf bookshelfFrontal bookshelfPart bookshelfSide bookshelfWhole bottle building buildingPart buildingRegion buildingWhole can car carFrontal carPart carRear carSide cd chair chairPart chairWhole coffeemachine coffeemachinePart coffeemachineWhole cog cow cowSide cpu desk deskFrontal deskPark deskPart deskWhole donotenterSign door doorFrontal doorSide face filecabinet firehydrant freezer frontalWindow head keyboard keyboardPart keyboardRotated light motorbike motorbikeSide mouse mousepad mug onewaySign paperCup parkingMeter person personSitting personStanding personWalking poster posterClutter pot printer projector roadRegion screen screenFrontal screenPart screenWhole shelves sink sky skyRegion sofa sofaPart sofaWhole speaker steps stopSign street streetSign streetlight tableLamp telephone torso trafficlight trafficlightSide trash trashWhole tree treePart treeRegion treeWhole walksideRegion wallClock watercooler window Table 2: The keywords in the PASCAL Object Recognition Database Collection (the prefix “PAS” has been removed from each keyword). Faces Faces easy Leopards Motorbikes accordion airplanes anchor ant barrel bass beaver binocular bonsai brain brontosaurus buddha butterfly camera cannon car side ceiling fan cellphone chair chandelier cougar body cougar face crab crayfish crocodile crocodile head cup dalmatian dollar bill dolphin dragonfly electric guitar elephant emu euphonium ewer ferry flamingo flamingo head garfield gerenuk gramophone grand piano hawksbill headphone hedgehog helicopter ibis inline skate joshua tree kangaroo ketch lamp laptop llama lobster lotus mandolin mayfly menorah metronome minaret nautilus octopus okapi pagoda panda pigeon pizza platypus pyramid revolver rhino rooster saxophone schooner scissors scorpion seahorse snoopy soccer ball stapler starfish stegosaurus stop sign strawberry sunflower tick trilobite umbrella watch water lilly wheelchair wildcat windsor chair wrench yin yang Table 3: The 101 categories used by Fei-Fei et al. (Fei-Fei et al., 2004). these keywords were extracted from annotations of existing WordNet (Zinger et al., 2005). This should provide a use- image datasets, they should be well-suited to a more precise ful link between possible portrayable objects and those that re-annotation of these same datasets. For the same reason, are often found in images, or that are of interest to image they are also suited to verify the applicability of newly de- understanding researchers. veloped image ontologies intended to represent portrayable entities and objects. 5. References A disadvantage is that while the keywords in this list cer- Luis von Ahn and Laura Dabbish. 2004. Labeling images tainly correspond well to the images used in image anno- with a computer game. In Proc. ACM CHI, pages 319– tation experiments so far, there is no guarantee that these 326. images are representative of all possible electronic images. Kobus Barnard, Pinar Duygulu, Nando de Freitas, David It would therefore be useful to compare this collection of Forsyth, David Blei, and Michael I. Jordan. 2003. keywords to an ontology constructed in a more rigorous Matching words and pictures. Journal of Machine way, such as the ontology of portrayable objects based on Learning Research, 3:1107–1135. OntoImage'2006 May 22, 2006 Genoa, Italy page 28 of 55. Category Description Food Images of food, and meals. Architecture Images of architecture, architectural details, castles, churches, Asian temples. Arts Paintings, sculptures, stained glass, engravings. Botanic Various plants, trees, flowers. Linguistic Images containing text areas. Mathematics Fractals. Music Images of musical instruments. Objects Images representing everyday objects such as coins, scissors, etc. Nature & Landscapes Landscapes, valley, hills, deserts, etc. Society Images with people. Sports & Games Stadiums, items from games and sports. Symbols Iconic symbols, roadsigns, national flags (real and synthetic images) Technical Images involving transportation, robotics, computer science. Textures Rock, sky, grass, wall, sand, etc. City Buildings, roads, streets, etc. Zoology Images of animals (mammals, reptiles, bird, fish). Table 4: The 16 categories in the CEA-CLIC image database and their descriptions (Moëllic et al., 2005). airplane astronaut atm bear beluga bill bird boat building cheetah church cloud coin coral cow crab dolphin earth elephant fish flag flowers fox goat grass ground hand horse house lion log map mountain mountains person pilot polarbear rabbit road rock sand sheep shuttle sky snow space tiger tracks train trees trunk water whale wolf zebra Table 5: The 55 keywords used by Carbonetto et al. (Carbonetto et al., 2004). # Category Description 0 Abstract / Global Words which describe the whole image or which are applicable to more than one class of objects. 1 Food Food and meals. 2 Architecture / City Architecture, architectural details, castles, churches, Asian temples, build- ings, roads, streets, etc. 3 Arts Paintings, sculptures, stained glass, engravings. 4 Botanic Plants, trees, flowers. 5 Objects Everyday objects such as coins, scissors, etc. 6 Nature & Landscapes Landscapes, valley, hills, deserts, etc. 7 Society People, groups of people, activities undertaken by society (celebrations, pa- rades, war, etc.). 8 Sports & Games Stadiums, items from games and sports. 9 Symbols Iconic symbols, roadsigns, national flags 10 Technical Transportation, robotics, computer science. 11 Textures Words which describe a texture. 12 Zoology Animals (mammals, reptiles, birds, fish). 13 Anatomy and Medicine Biological organs, anatomical diagrams, etc. 14 Music Musical instruments. Table 6: The 15 categories of the combined keyword list and their descriptions. The first column contains a category number. OntoImage'2006 May 22, 2006 Genoa, Italy page 29 of 55. 100 90 80 70 Number of Keywords 60 50 40 30 20 10 0 ity y y e ls od t c s ic l l ts es es ba ca Ar og et ni pe in bo us ec C am ur Fo ta ci ni ic lo ol ca e/ M m bj xt So ed ch Bo /G Zo G ur O Sy ds Te Te M ct ct & n ra ite & La ts st y ch or Ab & om Sp Ar e at ur An at N Category Figure 2: The number of keywords in each category. Peter Carbonetto, Nando de Freitas, and Kobus Barnard. telligence, 25(9):1075–1088. 2004. A statistical model for general contextual object Pierre-Alain Moëllic, Patrick Hède, Gregory Grefenstette, recognition. In Proceedings of the ECCV 2004, Part I, and Christophe Millet. 2005. Evaluating content based pages 350–362. image retrieval techniques with the one million images Yixin Chen and James Z. Wang. 2004. Image categoriza- clic testbed. In Proceedings of the Second World Enfor- tion by learning and reasoning with regions. Journal of matika Congress, WEC’05, pages 171–174. Machine Learning Research, 5:913–939. A. Th. (Guus) Schreiber, Barbara Dubbeldam, Jan Wiele- Gabriella Csurka, Christopher R. Dance, Lixin Fan, Jutta maker, and Bob Wielinga. 2001. Ontology-based photo Willamowski, and Cedric Bray. 2004. Visual categoriza- annotation. IEEE Intelligent Systems, 16(3):66–74. tion with bags of keypoints. In Workshop on Statistical M. Szummer and R. W. Picard. 1998. Indoor-outdoor im- Learning in Computer Vision (at ECCV). age classification. In Proc. IEEE International Workshop L. Fei-Fei, R. Fergus, and P. Perona. 2004. Learning gen- on Content-based Access of Image and Video Databases, erative visual models from few training examples an in- pages 42–51. cremental bayesian approach tested on 101 object cate- A. Vailaya, M. A. T. Figueiredo, A. K. Jain, and H.-J. gories. In Proceedings of the Workshop on Generative- Zhang. 2001. Image classification for content-based Model Based Vision, June. indexing. IEEE Transactions on Image Processing, Allan Hanbury. 2006. Review of image annotation for the 10(1):117–130. evaluation of computer vision algorithms. Technical Re- James Z. Wang, Jia Li, and Gio Wiederhold. 2001. SIM- port PRIP-TR-102, PRIP, TU Wien, January. PLIcity: Semantics-sensitive integrated matching for Alexander G. Hauptmann. 2004. Towards a large scale picture libraries. IEEE Transactions on Pattern Analysis concept ontology for broadcast video. In Proceedings of and Machine Intelligence, 23(9):947–963. the Third Intl. Conf on Image and Video Retrieval, pages J. Winn, A. Criminisi, and T. Minka. 2005. Object cate- 674–675. gorization by learned universal visual dictionary. In Pro- Clement H. C. Leung and Horace Ho-Shing Ip. 2000. ceedings of the International Conference on Computer Benchmarking for content-based visual information Vision(ICCV). search. In Proceedings of the 4th International Confer- S. Zinger, C. Millet, B. Mathieu, G. Grefenstette, P. Hède, ence on Advances in Visual Information Systems, pages and P.-A. Moëllic. 2005. Extracting an ontology of por- 442–456. trayable objects from WordNet. In Proceedings of the Jia Li and James Z. Wang. 2003. Automatic linguistic in- MUSCLE/ImageCLEF Workshop on Image and Video dexing of pictures by a statistical modeling approach. Retrieval Evaluation, pages 17–23, Vienna, Austria, IEEE Transaction on Pattern Analysis and Machine In- September. OntoImage'2006 May 22, 2006 Genoa, Italy page 30 of 55. 6. Combined Keyword List The following table lists the combined keyword list. It is a simple two-level hierarchy, with 15 headings at the top level (in bold). Note that some words are repeated under more than one heading. Abstract / Global background black black and white blue color exterior female fractal green group indoor interior male nature orange outdoor pattern red shadow yellow Food apple cuisine dessert drink feast food fruit grapes herb spice orange pizza pumpkin strawberry vegetable wine Architecture / City arch architecture building castle chimney church city college column courtyard dock fountain harbor historical building hotel house hut industry kitchen market minaret monument mosque museum office pagoda palace park pillar restaurant roof ruin shop skyline stairs statue street studio temple tower town village window Art Objects art carving decoration design drawing graffiti mosaic mural painting photo poster sculpture statue still life Botanic apple bonsai botany branch bush cactus flower foliage fungus grapes leaf lichen log moss mushroom orchid palm perenial petal plant pumpkin rose seed strawberry sunflower tree tulip water lily Objects (man-made everyday) anchor antique atm balloon barbecue barrel bath bead bench bicycle binoculars book bookshelf bottle camera can candy card cd cellphone chair clock cloth coffee machine cog coin cup currency decoration desk dish dogsled doll door dress OntoImage'2006 May 22, 2006 Genoa, Italy page 31 of 55. Easter egg fabric fan fence file cabinet fire hydrant firearm firework flag floor freezer furniture glass gun hat headphones horn jewelry keyboard lamp light map marble mask medicine money mousepad mug paper paper cup parking meter pill pot printer projector relic scissors screen shelves shoe sink sofa speaker sponge stamp stapler table telephone textile tool toy traffic light trash umbrella wall watch watercooler wheelchair wood wrench Nature and Landscapes agriculture autumn barnyard bay beach canyon cave cliff cloud coast coral crop crystal dawn desert dune dusk earth farm field flowerbed forest frost frozen garden gem glacier grass ground hill ice iceberg island lake landscape maritime meadow mountain night ocean pastoral path peak plain planet polar pyramid rapids reef reflection river road rock ruin runway rural sail sand shell shore shrine sky smoke snow space spring star steam stone sub sea summer sun sunset surf tree tropical tundra valley vegetation vineyard volcano wall water waterfall wave wind winter woodland Society astronaut baby ballet barbecue battle builder business child Christmas costume couple diver face fashion festival fight glamour graffiti guard hand head holiday home hunter leisure man model occupation parade person pilot pomp and pageantry religion royal sacred science travel tribal war woman work worship youth Sports and Games fitness football game golf kungfu play polo race rafting recreation rodeo ski sport tennis wind surfer Symbols OntoImage'2006 May 22, 2006 Genoa, Italy page 32 of 55. public sign road sign sign do not enter sign stop sign oneway sign yield Technical aeroplane aviation balloon battle ship boat bridge bus cannon canoe car communication engine ferry helicopter highway jet lighthouse locomotive machine military molecule motorcycle pathology railroad road runway sailboat ship space shuttle street tallship train transportation vehicle Textures fabric fire glass grass ground ice marble sand skin stone textile texture wood Zoology anemone angelfish animal ant antelope antlers bear beaver beetle bird bobcat bull butterfly camel caribou cat caterpillar cheetah coral cougar cow coyote crab crayfish crocodile cub deer dinosaur dog dolphin dragonfly eagle elephant elk feline fish flamingo foal fowl fox giraffe goat hawk hedgehog herd hippopotamus horn horse iguana insect jaguar kangaroo kitten leopard lion lizard llama lobster lynx mammal moth mouse nest ocean animal octopus owl panda penguin pet pigeon polar bear predator primate rabbit reptile rhinoceros rodent rooster scorpion seahorse seal sheep skin snake sponge squirrel starfish tiger turtle whale wildcat wildlife wolf young animal zebra Anatomy and Medicine brain Musical Instruments accordion cello double bass electric guitar guitar horn mandolin piano piano grand saxophone trombone trumpet tuba viola violin OntoImage'2006 May 22, 2006 Genoa, Italy page 33 of 55. Automatically populating an image ontology and semantic color filtering Christophe Millet∗† , Gregory Grefenstette∗ , Isabelle Bloch† , Pierre-Alain Moëllic∗ , Patrick Hède∗ ∗ CEA/LIST/LIC2M 18 Route du Panorama, 92265 Fontenay aux Roses, France {milletc, grefenstetteg, moellicp, hedep}@zoe.cea.fr † GET-ENST - Dept TSI - CNRS UMR 5141 LTCI Paris, France

[email protected]

Abstract In this paper, we propose to improve our previous work on automatically filling an image ontology via clustering using images from the web. This work showed how we can automatically create and populate an image ontology using the WordNet textual ontology as a basis, pruning it to keep only portrayable objects, and clustering to get representative image clusters for each object. The improvements are of two kinds: first we are trying to automatically locate the objects in images so that the image features become independent of the context. The second improvement is a new method to semantically sort clusters using colors: the most probable colors for an object are learnt automatically using textual web queries, and then the clusters are sorted according to these colors. The results show that the segmentation improves the quality of the clusters, and that meaningful colors are often guessed, thus displaying pertinent clusters on top, and bad clusters at the bottom. 1. Introduction the context, and cluster them into coherent groups. In order Since available annotated image databases or ontologies are to reduce the noise in images returned from the Web using still only a few and are far from representing every object textual queries, we can refine the query adding the category in the world, we are working on automatically constructing of the desired object. This will be described in Section 2. an image ontology using a textual ontology on the one Then, we would like to sort the obtained clusters in order to hand, and the Internet as a huge but incompletely and try to have the most relevant images first, and optionally to inaccurately annotated image database on the other hand. eliminate clusters that do not contain the expected object. Such approaches have been first proposed by (Cai et al., We propose to apply a semantic color filtering. The idea 2004) and (Wang et al., 2004). is to give more importance to the images containing the (Wang et al., 2004) developed a method to automatically probable colors of an object. For example, if we are use web images for image retrieval. An attention map is querying for images of bananas, we are expecting to see used to find the object in an image, and the text surrounding yellow images first. A list of possible colors of an object the image is matched to the region level instead of the is retrieved automatically from the web. We have also image level. Then, regions are clustered and each cluster developed a matching between the name of colors and is annotated using the text-region matching. Results are the HSV values of a pixel allowing us to compare the promising and can be improved with query expansion. (Cai colors contained in an image with the possible colors of et al., 2004) proposed to cluster images from the web using the object it is supposed to be depicting. This is explained three kinds of representation: textual information extracted in Section 3. from the text and links appearing around the image in the Eventually, we will discuss our results in Section 4. web pages, visual features, and a graph linking the regions 2. Obtaining image clusters from the Web of the image. The application given is to show web image search results grouped into clusters instead of giving a list 2.1. Pruning Wordnet that mixes different topics. However, no work has been The objects we are interested in are picturable objects. done to try to semantically sort clusters by relevance. Some words such as happiness or employment are concepts In this paper, we propose to improve our previous work that cannot really be pictured, so we have to prune the (Zinger et al., 2006) on automatically filling an image WordNet ontology in order to keep only the picturable ontology via clustering, first by trying to automatically objects. These objects are mostly the ones that can be found locate the objects in images, and then by proposing a as being hyponyms of the node physical objects which has method to semantically sort clusters using colors. The two definitions in WordNet: a tangible and visible entity skeleton of the image ontology is built using a textual and an entity that can cast a shadow. However, some ontology as a basis: WordNet1 . of these hyponyms have to be removed manually because Not all words are picturable objects, so this ontology has to the WordNet ontology contains some inconsistencies. For be pruned before we try to fill its nodes with images. The example, tree of knowledge appears as a kind of tree which next step is to get the images from the Web, try to isolate is an hyponym of physical objects. Once this pruning is the object in these images so that it becomes independent of completed, from the original 117097 nouns contained in the WordNet ontology, about 24000 leaves candidates for 1 images are left. http://wordnet.princeton.edu/ OntoImage'2006 May 22, 2006 Genoa, Italy page 34 of 55. 2.2. Using the right set of keywords Now that we have the skeleton of our ontology, we would like to populate the ontology with images from the web. In order to retrieve images from the web, we use text queries, such as Google or Yahoo! image search engines, where the name of the pictures and the text surrounding the pictures in the web pages have been used as a textual indexing. For some requests, we notice that the amount of noise can be quite important, and furthermore, we would like to disambiguate the query to obtain images representing only one object: asking for jaguar on an image search engine returns a mix of animals and cars because the word jaguar is polysemic. Here, the ontological information extracted from WordNet helps to obtain more accurate Figure 1: Automatic segmentation of a car image. The top images. Adding an upper node of the ontology in the text left image is the original image. The top right image is the queries allows disambiguating the query, and gives better result of the segmentation in 20 regions. After removing results even for words that are not ambiguous. For the the regions touching the edges of the image and merging jaguar example, we will have two separate queries: jaguar the other regions we obtain the bottom image. There are car and jaguar animal. The precision is increased, but the two connected regions in this image, and we will keep the recall is decreased: Google Image Search returns 3 750 largest one corresponding to the red car. Here, the second images for jaguar animal and 40 100 images for jaguar car, connected region is also a car, but it is often noise. which is to be compared with the 553 000 images returned for jaguar most of which are either animals or cars: we only obtain a tenth of the images. group together the images that have the same nearest 2.3. Segmentation neighbors. The texture features used are a 512bins local Since we want to construct an ontology that can be used edge pattern histogram, and the color features are a 64bins for learning, we are interested in images where the object color histogram. Clusters containing less than 8 images are we are looking for is big enough for image processing (the discarded. more pixels the better), but small enough to be entirely We have noticed that the segmentation step improves the contained in the image: we do not want to add part of quality of the clusters for most queries, mostly because it objects in the ontology, we want to add pictures of the makes it independent of the context. We will show some whole object. Furthermore, we would like to index only the examples in Section 4. object of interest, without taking the context into account: a blue car on green grass, and a blue car on a gray road 3. Color sorting should be recognized as the same object. We are making the following three hypotheses on the 3.1. Obtaining the colors of images images: The HSV (hue, saturation, value) color space is used, as it is more semantic than RGB, and therefore makes it easier • there is only one object in the image, to deduce the color name of a pixel. Each component of the HSV space has been scaled between 0 and 255. A negative • the object is centered, hue is assigned to pixels with a low saturation (S < 20) • its surface is greater than 5% of the image surface. meaning that the pixel is achromatic. Since we are computing statistics over an image, the The method proposed here is to automatically segment the definition of the color do not need to be accurate on each image and keep only the central object. The following steps pixel. Being accurate would mean using fuzzy logic where are accomplished: the image is segmented into 20 regions there is a fronteer between two colors. The correspondance using a waterfall segmentation algorithm (Marcotegui and between HSV and the color name presented here has been Beucher, 2005), the regions touching the edges of the image designed to be simple and fast to compute. Only 11 colors are discarded, and the other regions are merged together. are considered: black, blue, brown, green, grey, orange, The largest connected region is considered as the object pink, purple, red, white, yellow. More complicated and and used for further processing. Only the images where accurate methods can be designed but our simple method an object larger than 5% of the image in surface are kept. proved to be sufficient for our purpose. The main criteria used to name the color of a pixel from its 2.4. Clusterisation HSV values are explicited in Table 1. Brown and orange These segmented images are then clustered with the shared (14 < hue < 29) are the hardest colors to distinguish. We nearest neighbor method (Ertz et al., 2001), using texture propose the following rule: given the two points B : (S = and color features (Cheng and Chen, 2003). The shared 184, V = 65) and O : (S = 255, V = 125) and the L1 nearest neighbor clustering algorithm is an unsupervised distance in the (S, V ) plane, a pixel whose hue is in the algorithm mostly used in text processing which tries to range 14-29 is considered orange if it is closer to O than to OntoImage'2006 May 22, 2006 Genoa, Italy page 35 of 55. Hue Color “color banana” “color banana” fruit <0 black/grey/white blue (201000) orange (72300) 0 − 14 red green (140000) green (35300) 14 − 29 orange/brown orange (134000) yellow (26600) 29 − 60 yellow/green/brown yellow (109000) red (21900) 60 − 113 green red (66200) blue (11500) 113 − 205 blue 205 − 235 purple Table 2: Colors returned for “banana” using Google Search 235 − 242 pink and method 1 242 − 255 red Hue< 0 “banana is color” “banana is color” fruit Saturation Color yellow (594) yellow (288) 0 − 82 black green (217) green (51) 82 − 179 grey purple (107) black (24) 179 − 255 white black (94) brown (21) white (93) blue (16) 29 < Hue < 60 Saturation, Value Color Table 3: Colors returned for “banana” using Google Search S > 80, V ≥ 110 yellow and method 2 S > 80, V < 110 green S ≤ 80 brown The banana example is representative of what we observed Table 1: Getting the color from the HSV space in general for other objects: method 2 provides more accurate results, but less answers than method 1. However, method 1 can be disturbed with proper nouns. For example, B, and brown otherwise. These thresholds were choosen “blue banana” is the name of several websites, and “white experimentally from the observation of many images. It house” will return a lot of results. Phrases will have the works well when the color of a pixel is obvious, that is when same influence: “blue whale” will give whales as mostly everybody would agree on the same color for that pixel. We blue, and “white chocolate” will have more hits than “black do not deal with the fronteers of colors where the name of chocolate” or “brown chocolate”. Also, in the specific the color is subjective and can vary for different observers. example of banana, in “orange banana”, orange can be a noun (the fruit) instead of an adjective (the color). 3.2. Obtaining the colors of objects These three issues do not arise using method 2. However, sometimes method 2 does not return any color, as for The colors of objects can be obtained from a huge text example with the word “passerine” (a type of bird), and corpus, and we propose to use the web to do so. The idea in that case, the method 1 can be of help. is to study if the objects and the color often appear together or not in the corpus. We have experimented two methods to 3.3. Giving a score to the cluster get the color of an object. For example, let us imagine that The probable colors for an object, and the histogram we want to get the color of a banana. of colors Himg of the images img in each cluster are The first one is to ask “yellow banana”on a web text query compared to assign a score to each cluster. where yellow can be any color, and get the number of pages For each image img, the score of the image is the sum over returned. The second way is by asking “banana is yellow”. all colors C of the number of pixels Himg (C) that have the Then again, the category of the object can be used to reduce color C in img, multiplied by the number of occurrences the noise, so instead of the examples given above, we can N (C|object) of the color C for the studied object. This ask “yellow banana” fruit and “banana is yellow” fruit. score is normalized by the number of pixels of the image. We use 14 color words for web querying: black, blue, For each cluster clust, the score Scl of the cluster clust is brown, gray, green, grey, orange, pink, purple, red, rose, the mean score of the images it contains: tan, white, yellow. This is more than the 11 colors used in image color description, but some colors are merged together: gray and grey are synonyms, brown/tan and 1 X X Himg (C) ∗ N (C|object) rose/pink are also considered as synonyms. For these Scl = size(clust) surf ace(img) img∈clust C colors, the corresponding number of results are summed up giving a the number of occurrences N (C|object) of color 4. Results C for a given object. In Tables 2 and 3, we show the top five colors returned for 4.1. Segmentation improves clusters quality banana using Google Search, and the number of results in Figures 2 and 3 show an example of the differences we parentheses. Yellow and green (in that order) are the two can have between clusters without or with segmentation. main colors we expect to get, and this is what is returned The query was made using the word “porsche” and the by method 2. category “car” on Google Image Search. We downloaded OntoImage'2006 May 22, 2006 Genoa, Italy page 36 of 55. 800 images. For the experiment without segmentation, further used to build a database for learning. Sorting about 460 have been clustered in 14 clusters. For the second clusters allows us to decide which are the good clusters experiment, about 700 images were left after segmentation, to keep, and which are the bad clusters to discard. In this 500 of which have been clustered in 16 clusters. Here, we application, having a good precision means having relevant are showing 3 of these clusters for each experiment (only 8 images in the first clusters. Having a good recall means not images per cluster are displayed) to illustrate the advantage discarding good images, that is, not having good images in of using the segmentation. the last clusters. We do not want to have as many images as possible, but we do want to keep only relevant images. Thus, what is important is the precision regardless of the recall. In Figure 4, the first five clusters obtained for the query banana fruit are displayed. The first three clusters contain mostly bananas, some of which have been badly segmented. The other two clusters are not as good, so, only the first three clusters should be included in our database for learning, which gives 64 images of bananas. Figure 2: Results without segmentation. The first, second and third clusters contain 8, 94 and 21 images respectively. Figure 3: Results with segmentation. The first, second and third clusters contain 45, 10 and 21 images respectively. The full images are shown so that we can notice it is independent of the context. In Figure 2, the red car cluster depends on the context. The second cluster mixes several car types, and the third one is composed of objects that are not entirely contained in the image. In Figure 3, we see the improvements on the Figure 4: The first 5 clusters out of 17 for the query banana red car cluster which becomes independent of the context, fruit are given here. The top ranked clusters are the one and thus contains more images. At the same time, the grey containing the more yellow images. The second color for car cluster has been split and is now more consistent. The banana is green, which justifies the presence of cluster 5 in yellow car cluster is new, and could not be formed without that position. the segmentation because of the context again. Another cluster has disappeared because the segmentation removes the images which do not contain a centered object. This is really a low number if we consider that we asked for 1000 images on the Internet (Google and Yahoo! image 4.2. Sorted clusters search engines do not allow to retrieve more than 1000 We are presenting here results of sorted clusters, using images). After downloading (some links are broken), and the automatic segmentation and the second method for segmenting (segmentation discards some images), we still guessing the color of an object. Since up to 500 images can had 566 images, 403 of which have been clustered. be clustered for a query, we cannot show all the clusters for At least two ways could be used to get more images. They each query, therefore we decided to show only the first five both are about asking more queries to the web image search clusters for several queries. engine. The first one would be to ask the query in multiple Anyway, our aim here was: given the name of an object, languages, using automatic translation and then grouping we want to obtain images of that object which could be the clusters together. This method can multiply the number OntoImage'2006 May 22, 2006 Genoa, Italy page 37 of 55. of images by the number of considered languages. The other way is to use more accurate queries, which would be here the different species of bananas. For the precise example of banana, we would have to use Latin: bananas are Musa, and subspecies are for example Musa acuminata and Musa balbisiana 2 . The obtained images are then considered as bananas, since we do not want to be that accurate in our database, and since too accurate queries will return fewer answers, and the clustering does not work with too few images. This second method may only double the number of images. The algorithm works well in general with objects that have mostly one color, such as swan animal (Figure 5). Disambiguation works well, as can be seen for jaguar on Figures 6 (jaguar car) and 7 (jaguar animal): animals and cars are not mixed in clusters. The jaguar car query also shows that the clustering sorting will work for objects that be of any color, as are man-made objects in general. But we will lose some possible colors of objects. For example, jaguar cars can be blue, but the first blue jaguar car cluster is in 14th position. Thus, for man-made objects, we should explicitly ask for a certain color when retrieving images of an object: since the object can take many colors, people tend to specify it in their annotation, contrary to objects that have mainly only one color, such as fruits or animals. Some objects are textured with many colors, and for these objects, the algorithm will not perform well. This happens for example with the jaguar (Figure 7), often described as orange-yellow colored, rosette textured. Since black jaguars also exist, they will alter the results. The probable colors found for jaguar animal are black (10), orange (4), blue (3), tan (3) and yellow (3): the black cluster appear first. A cluster with orange-yellow jaguars appears in fourth position. 5. Conclusion In this paper, we have designed a system which, given the name of an object, is able to download images from the web that are likely to illustrate that object. It then automatically segments the images in order to isolate the object from the context. Since results from the web are very noisy, clustering is used to group similar images together, and reject single images. Not all clusters are relevant, therefore we proposed a method to semantically sort these clusters: the probable colors for an object are guessed automatically from the Web, and the clusters are sorted according to these colors. Further work will be achieved to see if we can find a threshold on cluster scores to separate good clusters from bad clusters. It would also be interesting to test this automatically generated database on real applications such as object recognition and measure its performances. 6. References Deng Cai, Xiaofei He, Zhiwei Li, Wei-Ying Ma, and Ji- Rong Wen. 2004. Hierarchical clustering of www image Figure 5: The first 5 clusters for the query swan animal. search results using visual. In Proceedings of the 12th The clusters do not contain many white objects are not about swan and do not appear here 2 Tens of species of bananas exist and are listed on Wikipedia. OntoImage'2006 May 22, 2006 Genoa, Italy page 38 of 55. Figure 6: The first 3 clusters out of 16 for the query jaguar car. The 3 most probable colors are in that order: green (cluster 1 shows green F1 cars), red (cluster 2) and black. annual ACM international conference on Multimedia, pages 952–959, New York, USA. Ya-Chun Cheng and Shu-Yuan Chen. 2003. Image classification using color, texture and regions. Image Vision Comput., 21(9):759–776. Levent Ertz, Michael Steinbach, and Vipin Kumar. 2001. Finding topics in collections of documents: A shared nearest neighbor approach. In Text Mine ’01, Workshop on Text Mining, First SIAM International Conference on Data Mining, Chicago, Illinois. Beatriz Marcotegui and Serge Beucher. 2005. Fast implementation of waterfall based on graphs. In C. Ronse, L. Najman, and E. Decencière, editors, Mathematical Morphology: 40 Years On, volume 30 of Computational Imaging and Vision, pages 177–186. Figure 7: The first 5 clusters out of 9 for the query jaguar Springer-Verlag, Dordrecht. animal. The black clusters appear first. The 4th cluster looks better, but the segmentation is not satisfactory. Xin-Jing Wang, Wei-Ying Ma, and Xing Li. 2004. Data- driven approach for bridging the cognitive gap in image retrieval. In Proceedings of the 2004 IEEE International Conference on Multimedia and Expo (ICME 2004), Electronic Imaging 2006, San Jose California, USA, pages 2231–2234, Taipei, Taiwan, June. January. Svetlana Zinger, Christophe Millet, Benoit Mathieu, Gregory Grefenstette, Patrick Hède, and Pierre-Alain Moëllic. 2006. Clustering and semantically filtering web images to create a largescale image ontology. In Proceedings of the IS&T/SPIE 18th Symposium OntoImage'2006 May 22, 2006 Genoa, Italy page 39 of 55. Image-Language Association: are we looking at the right features? Katerina Pastra Institute for Language and Speech Processing Artemidos 6 and Epidavrou, Maroussi, 151-25, Greece

[email protected]

Abstract The ever growing popularity and availability of multimedia information has rendered automatic image-language association essential in a number of multimedia integration applications. Bridging the gap between the two media requires an appropriate feature-set for describing their common reference; one that will be both distinctive of the entities referred too and feasible to extract automatically from visual media. In this paper, we suggest an alternative –to current approaches- feature set, which has been used in OntoVis, a domain model for a prototype that describes three-dimensional (3D) indoor scenes. We argue that it is worth employing this feature-set in a larger scale for image-language association and investigating the feasibility of doing so and of detecting such features automatically even beyond 3D visual data, in 2D images. some point, mature enough for being embedded in 1. Introduction multimedia prototypes and mainly in indexing and retrieval prototypes. The approaches are either Internet Protocol Television, image and video-blogs and probabilistic (Barnard 2003, Wachsmuth et al. 2003) or image and video-search engines are just a few of the latest logic-based (Dasiopoulou et al. 2004, Pastra 2005, ch. 5.). technology-trends which become more and more popular, Learning approaches require properly annotated training rendering digital multimedia content pervasive. Within corpora (Lin et al. 2003, Everingham et al. 2005) for such a context, the need for intelligent tools for efficient learning the associations between images/image regions access to multimedia content has boosted research efforts represented in feature-value vectors and corresponding and interest in automatic image-language association. The textual labels, while symbolic logic approaches rely on research issue is not new, of course; it spans a number of feature-augmented ontologies (Dasiopoulou et al. 2004, decades and a wide range of application areas, from Simou et al. 2005). Srikanth et al. (2005) report also on Winograd’s SHRDLU system in 1972, which verbalized the use of both training corpora and ontologies for visual changes in a 2D blocks scene, to medium achieving automatic image annotation. translation systems (e.g. automatic sports commentators) In all these cases, the features used for describing and to conversational robots of the new millennium (cf. a image content are low-level ones, such as shape, colour, review in Pastra and Wilks 2004 and Pastra 2005, ch.3). texture, position (2D coordinates of image region), size Association in these multimedia prototypes took (portion of image covered by image region), i.e. features mainly the form of correlating visual information and used by image analysis components for automatic object accompanying text/speech or translating one modality into detection. Justification of this choice is obvious: the need another (i.e. verbalizing visual information or visualizing for relying on such features for automating object linguistic information). In most cases, the systems made detection within image-language association tasks. use of a priori known vision-language associations or However, is it a coincidence that all these approaches are used simple inference-mechanisms on small-scale exemplified in mini-worlds (e.g. soccer games, where the association resources, resorting at the same time to either identification of the ball and the playground is quite manually abstracted visual information or just working straight-forward through shape and colour descriptors)? with miniworlds/blocksworlds. Lack of scalability and How distinctive could such features be for more complex heavy human intervention for the task was among the objects, such as e.g. furniture (which comes in many most significant criticisms (cf. Pastra and Wilks 2004). different shapes, colours, textures) scenes (i.e. In this paper, we focus on the features used for configurations of objects) and therefore, how scalable describing/detecting common visual and linguistic could the corresponding image-language association references to entities in real-world scenes. We first look mechanisms be? Actually, what kind of object into the limitations of the features used in state of the art features/properties could one use, so that: image-language association mechanisms, and then present three different types of features used in OntoVis, a feature • they are distinctive of object classes (i.e. they augmented ontology for logic-based verbalization allow differentiation among a large number of (description) of 3D indoor scenes in the VLEMA object types), and prototype (Pastra 2006). We discuss the possibilities of • their values can be detected and used by an scaling the use of the suggested feature set and of image analysis module for automatic object detecting it automatically in 2D visual data. segmentation? 2. Image-Language Association approaches These questions lead back to an old problem in cognitive linguistics, that of the use of features for conceptual representation (Lakoff 1987; Barsalou and Hale, 1993); in In the last few years, image-language association meaning analysis/decomposition, feature-based methods mechanisms as such are being developed for automatic define finite sets of conditions or attributes which image/keyframe annotation, with the vision of being, at determine the reference of a word. As pointed out in OntoImage'2006 May 22, 2006 Genoa, Italy page 40 of 55. criticisms of feature-based representations, no set of props(chair(X),[has_xclusters (X,1)]). features can fully represent an entity, cf. ch. 7 in (Lakoff, props(chair(X),[has_ yclusters_equalMoreThan(X,2)]). 1987). However, from within many possible props(chair(X),[has_ yclusters_equalLessThan(X,4)]). abstractions/features a certain feature-set can be more or props(chair(X),[has_zclusters_equalMoreThan(X,2)]). less successful in fixing the reference of a concept. props(chair(X),[has_zclusters_equalLessThan(X,3)]). props(chair(X),[on_floor(X,yes)]). 3. The OntoVis suggestion props(chair(X),[has_surface(X,yes)]). Props(chair(X),[size(X,XCLUSTER_YValue,TableYDIM OntoVis is a domain model (domain ontology with _UpperConstraint)]). corresponding knowledge-base) for interior scenes. It has stemmed out of OntoCrime, a domain ontology for indoor Table 2: part of the "chair" object profile and outdoor scenes (Pastra et al., 2003), built through priming with the Common Data Model of the UK Police Looking at tables 1 and 2, one realises that the two objects Information Technology Organisation (PITO), the latter (sofas and chairs) are similar in most of their properties; being an attempt to standardize the wording used in all both of them intersect with the floor 1, they have a surface tasks that involve police forces. OntoVis includes the part on which other objects may be placed, and they can of OntoCrime which refers to indoor scenes, augmented structurally be decomposed into 2 or 3 parts in their Z with properties for a number of entities that one can find dimension (these being the back, the seat+legs part that in sitting-rooms in particular. The ontology is touches the floor, and optionally the arms, if there are implemented in the form of a directed acyclic graph any). Similarly, they can be decomposed into 2-4 parts (DAG) through the use of the XI Knowledge along their Y dimension (back, seat, arms, and legs, the Representation Language (Gaizauskas and Humphreys, last two are optionally present). However they differ in 1996). their decomposition along their X dimension: a sofa has The same ProLog-based representation language is always more than one X-parts (more than one seats), used for the OntoVis knowledge-base. The object- while a chair may have only one seat. Size is a variable property assertions in the latter form a kind of “object- (changeable) property for sofas, and it is actually profiles” at the “basic-level” of categorization (Rosch determined by the number of seats that the object has, 1978, Lakoff 1987), which cover for each object all while size for chairs makes normally sense only in terms following types of features/properties: of the height of the chair (e.g. short chairs for children, tall stool-like chairs etc.). • physical structure: An “armchair” has the same object profile with a chair, the number of parts into which an object is expected to be apart from the fact that it will always have three or four Y- decomposed in different dimensions, e.g. a sofa is always clusters (back, seat, arms and optionally legs), and always decomposed into more than one parts along its X three Z-clusters (back, seat, arms), i.e., arms are not dimension (each one corresponding to a seat) as opposed optional, they must be present. Furthermore, an armchair’s to a chair. relative size does not make sense to be expressed in terms • visually verifiable functionality: of its height; it is so for chairs, because they are expected visual characteristics an object may have which are related to “co-locate” with tables/bars (the height of which may to its function e.g. whether an object has a surface on vary considerably), and the “chair’s” height is constrained which things can be placed/fixed, and by a table’s height. Table 3 presents part of the object • interrelations: profile of “tables”: these refer mainly to (allowable) spatial configurations of objects and object parts (e.g. whether an object could be props(table(X),[has_xclusters(X,1)]). on the floor or not), the dimension according to which size props(table(X),[has_yclusters(X,2)]). comparisons would be meaningful etc. props(table(X),[has_zclusters(X,1)]). props(table(X),[on_floor(X,yes)]). Here is an example of the property profiles of two quite props(table(X),[has_surface(X,yes)]). similar objects, both of which belong to the same class, props(table(X),[size(X,YDIM,XDIM, that of “furniture”: Relative_to_Room_YXDIM)]). props(sofa(X),[has_xclusters_moreThan(X,1)]). Table 3: part of the "table" object profile props(sofa(X),[has_yclusters_equalMoreThan(X,2)]). A table has a surface (table-top) which can be identified props(sofa(X),[has_ yclusters_equalLessThan(X,4)]). along its X dimension, it has two yclusters (table-top and props(sofa(X),[has_ zclusters_equalMoreThan(X,2)]). legs) and one zcluster (the whole table). Its length and props(sofa(X),[has_zclusters_equalLessThan(X,3)]). height are relative to the corresponding dimensions of the props(sofa(X),[on_floor(X,yes)]). room it is found in. props(sofa(X),[has_surface(X,yes)]). As seen in the above examples, there is a whole props(sofa(X),[size(X,XCLUSTERS)]). network of interrelations between objects in OntoVis, the detection and identification of each of which contributes Table 1: part of the "sofa" object profile 1 The floor, as well as other room-parts/walls, is defined, in its turn, as the one-dimensional object (surface) with the lowest Y- values in an indoor-scene. OntoImage'2006 May 22, 2006 Genoa, Italy page 41 of 55. to the detection and identification of the other. The The VLEMA prototype works with automatically profiles include also assertions regarding the object parts reconstructed in 3D images of sitting rooms. It includes a objects are being formed of (and which are not included in module that performs object segmentation in 3D space by the above tables due to space restrictions). For example, a extracting physical structure-related information (clusters sofa consists of a back, more than one seats and optionally of faces forming part of an object in each dimension) to legs and arms; these object parts are themselves defined in detect objects and/or object parts. An object-naming a similar way, using the property types suggested above. module refines this detection results by either naming a There are many arguments in favour of the suggested candidate object or/and suggesting the clustering of feature selection for object naming in the literature. In candidate object parts into one object which it also names. particular, the need for defining objects through their The module relies on OntoVis for drawing inferences for physical structure and their functionality/purpose has been object naming; the inference mechanisms take advantage argued by many researchers, such as Minsky (1986). of the rich visual information that can be extracted in the Structural properties were described by Minsky as ones 3D space (i.e., 3D coordinates of the candidate objects, which do not change “capriciously", while functional ones relative information on their spatial interrelations, size capture intentional aspects of the objects and both are etc., as well as lack of occlusion, registration, viewpoint important when defining visual objects. On the other problems etc.) to check whether the property assertions in hand, Landau and Jackendoff have explicitly argued that the OntoVis object profiles actually hold (cf. ch. 5 in spatial representations imply properties of the objects Pastra 2005a and Pastra 2006). This means that the involved (1993); for example, an “on" relation between specific feature set suggested in the previous section two objects requires that the reference object is one with a stemmed out of a prototype that worked on 3D visual data, surface or line boundary on which the figure object is and it actually includes features that can be more easily located. identified in 3D space. While not panacea, the suggested feature set could In Computer Vision, research on the automatic 3D assist scaling image-language associations beyond mini- reconstruction of real indoor and outdoor visual scenes, as worlds, and actually allow for: well as on the automatic transformation of 2D images into 3D worlds points to optimistic prospects of taking • going beyond differences in the appearance of advantage of the rich information one could extract in 3D similar objects (e.g. different styles of sofas) space, in real-world application scenarios rather than naming these objects in the same way, and merely in manually built virtual worlds. While OntoVis was used in such a real-world setting and it was applied on • generalizing over viewpoint differences e.g. visual data that had been reconstructed in 3D identifying a sofa as such even when seen automatically, these reconstruction mechanisms and the from the side (rather than en face) ones that transform 2D into 3D are still immature. The question then becomes, whether the OntoVis suggestion These are generalizations that current image-language could be applied to 2D images, on which the vast majority association algorithms cannot do easily (or at all). of state of the art vision-language association mechanisms Identifying objects which differ in appearance as ones of run. the same type is something that cannot be achieved even While this is an issue that should be thoroughly with a very large amount of training data (cf. e.g. the explored with computer vision experts, there is some first visual ontology by Zinger et al. 2005), if a similar evidence that automatic techniques for detecting such (or example is not present in the training data. Similarly, a reduced version of) visual information in 2D images current approaches cannot deal with viewpoint differences exist. For example, there are methods for detecting spatial in the appearance of an object and there is an almost relations between objects in 2D images (cf. e.g. the work infinite number of different images of the same object by Regier and Carlson, 2001), and there is also research which may result from differences in the viewpoint on identifying object structure/parts in 2D images and (viewing angle and distance) from which the object is seen associating them with textual labels (cf. Wachsmuth et al. in a complex scene. 2003). 4. Using OntoVis 5. Future Plans While the effectiveness of each feature type individually Currently, OntoVis includes “object profiles” for twenty has been argued in the literature, their use in conjunction basic-level objects (with their corresponding parts); our and their incorporation in a domain model has not been plans for the immediate future are to extend this resource attempted before. Actually, in the case of OntoVis, the to concrete-objects of indoor and outdoor scenes and test feature set has been determined by the visual data itself, their discriminative power in a corpus of manually- and the need to perform automatic object naming within constructed virtual reality scenes. Mechanisms for an application scenario that goes from vision to language. detecting the specified object features in these scenes OntoVis was created for the development of VLEMA, a automatically for object naming purposes will also be system that attempts to test the extend to which one may applied, as an extension to the work done in the VLEMA currently “emancipate” a vision-language integration prototype. prototype, in order for the prototype to work with real Given the advantages that could be gained, we believe visual scenes, to analyze its visual data automatically, and that it is also worth investigating the possibility of using have inference mechanisms for scalable vision and the suggested feature set in state of the art image-language language association abilities. association mechanisms for 2D images; it is towards this OntoImage'2006 May 22, 2006 Genoa, Italy page 42 of 55. direction that we tend to head our research efforts Pastra K. (2006), “An alternative suggestion for vision- towards. language integration in intelligent agents”, in Proceedings of the International Hellenic Artificial 6. Conclusions Intelligence Conference, Athens, Greece. In this paper we presented a feature-set for the Pastra K. (2005), “Vision-Language Integration: a representation of real world objects and scenes, within Double-Grounding Case”, PhD thesis, Department of tasks that attempt to bridge the gap between low-level Computer Science, University of Sheffield. visual information and high-level (conceptual) linguistic descriptions of entities. The suggestion has been Pastra K. and Y. Wilks (2004), “Vision-Language implemented in OntoVis, a domain model for building- Integration in AI: a reality check”, in Proceedings of interior scenes; the suggested features have been detected the 16th European Conference on Artificial Intelligence automatically in 3D visual data and have been used for the (ECAI), pp. 937-941, Valencia, Spain. verbalization of this data. We argue that the feature set could be an alternative or complimentary one to feature Pastra, K., H. Saggion, and Y. Wilks, (2003), “Intelligent sets used in state of art image-language association indexing of crime-scene photographs”, IEEE Intelligent mechanisms and would like to invoke cooperation and Systems, 18(1):55-61. collaboration towards this direction of research. Regier, T. and L. Carlson, (2001), “Grounding spatial 7. References language in perception: An empirical and computational investigation”. Journal of ExperimentalPsychology, 130(2):273-298. Barnard K., Duygulu P., Forsyth D., de Freitas N., Blei D., Jordan M. (2003), “Matching words and pictures”, Simou N., Tzouvaras V., Avrithis Y., Stamou G., Kollias in Machine Learning Research, 3:1107-1135. S. (2005), “A visual descriptor ontology for multimedia reasoning”, in Proceedings of the Workshop on Image Dasiopoulou S., Papastathis V., Mezaris V., Kompatsiaris analysis for Multimedia Interactive Services I., Strintzis M. (2004), “An ontology framework for (WIAMIS). knowledge-assisted semantic video analysis and annotation”, in Proceedings of the International Srikanth M., Varner J., Bowden M., Moldovan D. (2005), workshop on Knowledge markup and semantic “Exploiting ontologies for automatic image annotation, International Semantic Web Conference. annotation”, in Proceedings of SIGIR. Everingham M., Van Gool L., Williams C. and A. Wachsmuth S., Stevenson S., Dickinson S. (2003), Zisserman (2005), “PASCAL Visual Object Classes “Towards a framework for learning structured shape Challenge Results”, Technical Report, PASCAL models from text-annotated images”, in Proceedings of Network of Excellence, http://www.pascal- the HLT-NAACL workshop on Learning word meaning network.org/challenges/VOC/voc/results_050405.pdf from non-linguistic data. Gaizauskas, R. and K. Humphreys, (1996), “XI: A Simple Zinger S., Millet C., Mathieu B., Grefenstette G., Hede P., Prolog-based Language for Cross-Classiffication and Moellic P. (2005), “Extracting an ontology of Inheritance”, In Proceedings of the 7th International portrayable objects from WorNet”, in Proceedings of Conference in Artifficial Intelligence: Methodology, the MUSCLE/Image CLEF workshop on Image and Systems, Applications, pages 86-95. Video Retrieval Evaluation. Jackendoff, R. (1987). “On beyond Zebra: the relation of linguistic and visual information”, Cognition, 20:89- 114. Lakoff, G., (1987). “Women, Fire, and Dangerous Things”. The University of Chicago Press. Landau, B. and R. Jackendoff (1993) “What" and “Where" in spatial language and cognition”, Behavioural and Brain Sciences, 16:217-265. Lin C., Tseng B. and J. Smith (2003), “Video Collaborative Annotation Forum: Establishing Ground- Truth Labels on Large Multimedia Datasets”, in online Proceedings of TRECVID 2003. Minsky, M. (1986). The Society of Mind. Simon and Schuster Inc. OntoImage'2006 May 22, 2006 Genoa, Italy page 43 of 55. Testing an automatic organisation of retrieved images into a hierarchy Mark Sanderson, Jian Tian, Paul Clough Department of Information Studies, University of Sheffield, Regent Court, 211 Portobello St, Sheffield, S1 4DP, UK (m.sanderson|p.d.clough)@shef.ac.uk Abstract Image retrieval is of growing interest to both search engines and academic researchers with increased focus on both content-based and caption-based approaches. Image search, however, is different from document retrieval: users often search a broader set of retrieved images than they would examine returned web pages in a search engine. In this paper, we focus on a concept hierarchy generation approach developed by Sanderson and Croft in 1999, which was used to organise retrieved images in a hierarchy automatically generated from image captions. Thirty participants were recruited for the study. Each of them conducted two different kinds of searching tasks within the system. Results indicated that the user retrieval performance in both interfaces of system is similar. However, the majority of users preferred to use the concept hierarchy to complete their searching tasks and they were satisfied with using the hierarchical menu to organize retrieved results, because the menu appeared to provide a useful summary to help users look through the image results. associating terms extracted from a document set has been 1. Introduction successfully used to help users searching and browsing for One process that users must perform when information documents (Joho, Sanderson, Beaulieu, 2004). In this seeking is to examine and interpret the search results. In simple method, words and noun phrases (called concepts) most Information Retrieval (IR) systems, results are are extracted from passages of the top n documents and ranked in order of relevance to the query. However, if organized hierarchically based on document frequency many search results are returned it can be difficult for the and a statistical relation called subsumption. user to examine them all. In addition, reliably providing Given the simplicity of this method and its success for an intuitive summary of the search results is an obvious document retrieval, in this paper we apply concept benefit to any user of an IR system. Hearst (1999) hierarchies to textual metadata associated with images for discusses various interface techniques for summarising image retrieval and user test the resulting system. There results to make the document set more understandable to are many instances of when images are associated with the user. These include: visualising the relationship of some kind of text semantically related to the image (i.e. documents to the query, providing collection overviews metadata or captions). For example, collections such as and highlighting potential relationships between historic or stock-photographic archives, medical documents. databases, art/history collections, personal photographs A variety of clustering techniques have been (e.g. Flickr.com) and the Web (e.g. Yahoo! Images). developed in IR to group documents. This can help users Retrieval from these collections is typically supported by to browse through the search results, obtain an overview text-based searching which has shown to be an effective of their main topics/themes and help to limit the number method of searching images (Markkula & Sormunen, of documents searched or browsed in order to find 2000). To enhance such systems, various approaches have relevant documents (i.e. limit exploration to only those been explored to organize search results based on either clusters likely to contain relevant documents). Two textual and visual features (or a combination of both). A common variations are: (1) to group documents by summary of related work is provided in section 2. In associated terms (i.e. a set of words or phrases define a practice, given the proliferation of textual metadata, cluster and membership is based on its containing a investigating methods to exploit this text (e.g. for sufficient fraction of a cluster’s terms), and (2) to assign organizing results) is beneficial. documents to pre-defined thematic categories (manually The paper is ordered as follows: in section 3 we or automatically). Scatter/Gather (Cutting et al, 1992) and describe how we used concept hierarchies as a method for the Vivisimo1 metasearch engine are an example of the presenting image search results by displaying extracted former and Yahoo! Categories an example of the latter. concepts within a hierarchical structure. We describe the Organizing a set of documents automatically based methodology and results of two user experiments to test upon a set of categories (or concepts) derived from the the system and finally conclude. documents themselves is an obviously appealing goal for IR systems: it requires little or no manual intervention 2. Related Work (e.g. deciding on thematic categories) and like For image retrieval, clustering methods have been used unsupervised classification, depends on natural divisions to organize search results by grouping the top n ranked in the data rather than pre-assigned categories (i.e. images into similar and dissimilar classes. Typically this is requiring no training data). In this paper we make use of based on visual similarity and the cluster closest to the such an approach for organizing search results called query or a representative image from each cluster can then concept hierarchies (Sanderson & Croft, 1999; Sanderson be used to present the user with very different images & Lawrie, 2000). This simple method of automatically enabling more effective user feedback. For example Park et al. (2005) take the top 120 images and cluster these 1 http://vivisimo.com using hierarchical agglomerative clustering methods OntoImage'2006 May 22, 2006 Genoa, Italy page 44 of 55. Figure 1: Example fragment from generated menu for the query “church” (HACM). Clusters are then ranked based on the distance of the cluster from the query. The effect is to group 3. Building Concept Hierarchies together visually similar images in the results. The approach of building a concept hierarchy Other approaches have combined both visual and proposed by Sanderson and Croft (1999) aims to textual information to cluster sets of images into multiple automatically produce, from a set of documents, a concept topics. For example, Cai et al. (2004) use visual, textual hierarchy similar to manually created hierarchies such as and link information to cluster Web image search results the Yahoo! categories. The main difference being that into different types of semantic clusters. Barnard and concepts are in fact words and phrases (referred to as Forsyth (2001) organize image collections using a terms) found within the given set of documents and not statistical model which integrates semantic information categories defined manually. In their method of building provided by associated text and visual features provided concept hierarchies, word and noun phrases (called by image features. During a training phase, they train a concepts) are extracted from retrieved documents and generative hierarchical model to learn semantic used to generate a hierarchy. Concepts are associated relationships between low-level visual features and words. based on the set of documents indexed by the two The resulting hierarchical model associates segments of an concepts: the more documents two terms share, the more image (known as blobs) with words and clusters these into similar they are. However, concept hierarchies go beyond groups which can then be used to browse the image simple grouping of terms by discovering whether concepts collection. are also related hierarchically. Document frequency and a Approaches using only semantic information derived statistical relation called subsumption is used to generate a associated text have also been used to organize search hierarchy by detecting whether a parent term refers to a results and to aid browsing. For example, Yee, et al. related, but more general concept than its children (i.e. (2003) describe Flamenco, a text-based image retrieval whether the parent’s concept subsumed the child’s). Using system in which users are able to drill-down results along document frequency (DF) to determine the semantic conceptual dimensions provided by hierarchically faceted specificity of concepts is commonly used for weighting metadata. Categories are automatically derived from terms in IR based on Inverse Document Frequency (IDF). Wordnet synsets based on texts associated with the With subsumption, concept Ci is said to subsume images, but assignment of those categories to the images concept Cj when a set of documents in which Cj occurs is is then manual. Finally, Rodden et al. (2001) performed a subset of the documents in which Ci occurs. Or more usability studies to determine whether organization by formally, when the following conditions are held: P(Cj|Ci) visual similarity is actually useful. Interestingly, their ≥ 0.8 and P(Ci|Cj) < 1. The assumption is that Ci is likely results suggest that images organized by category/subject to be more general than Cj because, first, the former labels or were more understandable to users that those appears more frequently than the latter [13], and second, grouped by visual features. the former subsumes a large part of Cj’s document set. Also they are likely to be related since they co-occur frequently within documents. The results can be visualised OntoImage'2006 May 22, 2006 Genoa, Italy page 45 of 55. Figure 2: Example of the menu interface using cascading menus where more general terms are 4. Randomly select an image from the cluster to placed at a higher level followed by related but more represent the cluster visually and create the specific terms (Figure 1). menu. Sanderson and Croft analysed a random sample of parent-child relations and found that approximately 50% For our image retrieval prototype, we used a version of of the subsumption relationships within the concept the CiQuest system created to investigate user interaction hierarchies were of interest and that the parent was judged with a standard textual document collection (Bernard & to be more general than the child. In particular, 49% of Forsyth, 2001). The system uses a probabilistic retrieval children were judged to reflect an aspect of the parent (a model based on the BM25 weighting function (Robertson holonymic relation), e.g. actor is an aspect (or part) of a et al 1995) to perform initial retrieval. A DHTML menu is movie, 23% judged as a type of the parent (a hypernymic generated dynamically representing the concept hierarchy, relation), e.g. a poodle is a type of dog, 8% judged to be enabling users to interact with and browse the search the same as the parent, 1% as opposite to the parent, and results (Figure 1). The number in parenthesis is document 19% to be an unknown relation. We discuss relations frequency. A number of parameters can be adjusted in the commonly found using image captions in section 5. In prototype including: summary, to generate a concept hierarchy for image browsing, the following steps are followed after an initial 1. menu_depth: maximum depth of menu; retrieval: 2. menu_height: maximum height of menu; 3. top_n: number of documents to extract concepts 1. Extract concepts (words and noun phrases) from from. up to the top n image captions. 2. Compare each concept with every other concept 4. Experimental methodology and test for subsumption relationships. The current study is primarily concerned with 3. Order concepts hierarchically based on DF scores evaluating the utility of the concept hierarchy menus to (general to specific) and subsumption relation organise retrieved results and observe user interaction (concepts with no parent – no other concept with the concept hierarchy menu based on a user-oriented subsumes - are top-level concepts). task. To elaborate: OntoImage'2006 May 22, 2006 Genoa, Italy page 46 of 55. Figure 3: Example of the list interface • evaluate the usability of concept hierarchy menus evaluation3 campaign for cross-language image retrieval, used in image retrieval from a user’s perspective; see Clough, Mueller, and Sanderson (2005). • obtain participants’ perceptions of using concept hierarchy menus to group image retrieval results; The methodology of the study was by means of • gather participant’s general impressions of menu conducting usability tests, including, task records, interface (see Figure 2), compared with traditional list observation notes, pre- & post-session questionnaire and interface (see Figure 3); and post-search interviews in order to get the perception of the • analyse participants’ searching behaviour with the participants. In the user test, each participant will be concept hierarchy menu in image retrieval system. presented with two different version of the CiQuest interface and be asked to perform two user tasks on each. 4.1. Test Image Collection The dataset used consisted 28,133 historic photographs 4.2. Participants from the library at St Andrews University2. All images are A total of 30 participants were recruited for doing the accompanied by a caption consisting of 8 distinct fields user test. The majority of the participants (23) were (short title, long title, description, location, date, graduate students of the Department of Information photographer, notes and topic categories) which can be studies, University of Sheffield, and the rest were from used individually or collectively to facilitate image other Departments of University. They consisted of 14 retrieval. The 28,133 captions consist of 44,085 terms and females and 16 males. The age of the participants ranges 1,348,474 word occurrences; the maximum caption length from 20 to 31 with an average of 25. All participated in is 316 words, but on average 48 words in length. All the study as volunteers. captions are written in British English and contain colloquial expressions and historical terms. 4.3. Experimental Tasks Approximately 81% of captions contain text in all fields, Task one was designed as real life retrieval task, the rest generally without the description field. In most participants were required to search for images about a cases the image description is a grammatical sentence of pre-specific topic using the CiQuest system with its around 15 words. The majority of images (82%) are black different interfaces. In task two, participants were shown and white, although colour images are also present. The three photos taken from the St Andrews historic dataset has been used for previous image retrieval photographic collection and were required to find them experiments, the most notable being the ImageCLEF using the CiQuest system with two different interfaces respectively. This task in real life can be described as, 2 3 http://specialcollections.st-and.ac.uk/ http://ir.shef.ac.uk/imageclef/ OntoImage'2006 May 22, 2006 Genoa, Italy page 47 of 55. users trying to search for a specific image they have in participants interaction with concept hierarchy menu, we mind; however, they do not know the exact keyword can found the automatically generated concept hierarchy information to find it, so they need to describe the image menu really helped users to narrow their result set down. by themselves. This task could be used to measure usability of experimental system, focusing on the Task 1 Menu List effectiveness and efficiency. Av. Time to complete task (min.) 10.2 12.4 In order to minimize order effects, users were shown How easy to judge relevance 4.0 3.2 either the menu interface first, or the list. How confident in judgements 4.1 3.8 Satisfied with the results 4.1 3.8 5. Results and Analysis Table 2: Mean score of five topics The results and analysis of current study are presented as follows. Also according to the table, the majority of participants thought it was easier to judge relevant images 5.1. Task One using the menu interface. The next question showed on the table was designed to evaluate how confident In task one, each participant needed to work with both participants were with their relevant image choice. The interfaces. Participants were asked to find 15 photos using mean score of using menu interface was 4.1, which CiQuest that were relevant to pre-designed topics. Based slightly higher than mean score 3.8 of using list interface. on their actual searching performance, participants were With information gained from the results of the required to answer questions to evaluate the two different experiments in Task one, we moved onto the second Task. interfaces of the system. The participants were asked to work through 5 queries each. Results are presented in Table 1. 5.2. Task two In task two, each participant again tested both the list Mean score for task one Menu List and menu interfaces, with the aim of locating a “known Av. number of pages user browsed 5 8 item” image in the collection. All participants were asked Av. number of queries type into system 1.6 3 to locate 3 images: half searched the menu interface first Table 1: Mean score of five topics (referred to here as the menu group) and the other half used the list interface first (the list group). Results of the As can be seen, in the list interface, users browsed experiment are shown in Table 3 more pages and entered more queries than when using the menu system. When participants use the list interface to Task 2 Menu List search for photographs, they type the initial query into Av. Time taken to find image 3.0 4.0 system and then at least examined one page of returned Av. number of result pages user results to judge whether or not they need to reformulate browsed before finding the image 9.7 13.3 their initial query. Based on author observation during the Av. number of queries 3.7 7.0 test, the majority participants were noted to browse at Success retrieval rate 91% 78% least two pages of results before they changed their query. Table 3: Mean score for task 2 So, if they change queries frequently, they must spend a lot times to view results. Therefore, in general the number As can be seen as with task one the average number of of queries is proportional to the number of result pages. pages viewed and queries entered was smaller for the When using the menu interface, the majority of hierarchy interface than it was for the list, also (as before) participants spent time with the terms chosen for the menu the time users took to find the image on the menu system as opposed to submitting a new query or going to view was shorter. What is more striking is the success rate of results page by page. The majority of participants used the users in locating their known image: users were noticeably menu interface usually to browse the first page of more successful in finding their target image with the retrieved results in response to their initial query at first. menu system than they were for the list system. This result Then if they could not find the relevant images they indicates that the concept hierarchy menu could provide required, they prefer to view the concept menu before they some useful clues to help participants to find images. The went to the next page. They try to find appropriate terms concept hierarchy menu can improve retrieval on the menu to limit their initial retrieved results, and then effectiveness. they click term to browse associated results. If they could not find the photos, they went back to concept menu and 5.2.1. User behaviours tried other terms. According to notes taken while observing users, the majority of participants in the menu group spent a lot of 5.1.1. Questionnaire time browsing the menu. They seemed to prefer to view Participants’ general impressions of the two interfaces all parts of the menu, in order to find some similar images. were gathered. Participants indicated on how easy or hard They were particularly pleased when the required image it was to find relevant images and how confident they was found with this strategy. Participants appeared to were when locating images. The average time spent on prefer searching through the menu than to re-formulate completing this task was also shown in the table below. their query. It would appear that building a simple term As Table 2 shows, participants using the list interface hierarchy coupled with presenting that hierarchy in a spent more time on searching than using the menu quick browsing form is liked by users interface a probable consequence of needing to enter more queries to complete their task. From observation of OntoImage'2006 May 22, 2006 Genoa, Italy page 48 of 55. 6. Study findings was encouraging. They were satisfied with the search We analyzed the qualitative and quantitative results results and retrieval performance. Although both about they experimental system. By combining all results, interfaces of experimental system had the similar some findings can be detected in this study. capability to retrieve relevant images in response to users’ The overall research aim of this study was to establish query, majority participants prefer to use menu interface if the image retrieved results organized by automatically to organize their retrieved results in current study. generated concept hierarchy menu is usable from the user Participants indicated that concept hierarchy menu could perspective. provide an intuitive preview for large numbers of According to the task one result, image retrieval retrieved results that gave them a better idea of the topics performance using menu interface was slightly better than of image retrieved. So they can effectively narrow a lot using list interface. Although there was no significant returned retrieved results by choosing specific relevant difference between them, the results illustrated that the topic, in order to avoid wasting so many time on automatically generated hierarchy menu does support the browsing large numbers of results page by page. image retrieval process. The concept hierarchy menu Participants also prefer to consider browsing concept could group the image retrieved results by specific term hierarchy menu as an alternative way to help them related to the participants’ initial query, in order to narrow successfully and effectively retrieve images, especially the number of results returned to the screen. Based on the when their queries did not work well. observation note, when participants used the menu interface, majority of them prefer to browse concept 8. Acknowledgements hierarchy menu choosing appropriate term instead of The work in this paper was part of a student Masters changing query or viewing a large number of results page dissertation project conducted at the University of by page. According to the evaluation questionnaire, the Sheffield and was also supported in part by the EU’s results illustrated that participants using menu interface BRICKS project IST-2002-2.3.1.12. were more satisfied with their task results than using list interface. 9. References Secondly, from previous discussion of task two, Bernard, K. and Forsyth, D. (2001) Learning the although it was shown that there was no significant Semantics of Words and Pictures. In: Proceedings of difference in retrieval performance between menu group the Intentional Conference on Computer Vision, vol 2, and list group, using concept hierarchy menu can be seen pp. 408-415. as benefit to image retrieval process. The terms displayed Cai, D., He, Xiaofei., Li, Zhiwei., Ma, W-Y., and Wei, J- on the concept hierarchy menu provided some useful clue R. (2004) Hierarchical clustering of WWW image for user to improve the successful rate on finding photos. search results using visual, textual and link information. Browsing concept hierarchy menu could be seen as In: Proceedings of the 12th annual ACM international providing an alternative choice for user to successfully conference on Multimedia, pp. 952-959. find image, especially when participants’ queries did not Clough, P., Mueller, H. and Sanderson, M. (2005), The work. CLEF 2004 Cross Language Image Retrieval Track, In: Finally, based on the results of evaluation Peters, C., Clough, P., Gonzalo, J., Jones, G., Kluck, M. questionnaire, the majority of participants thought the and Magnini, B. (Eds.) Multilingual Information Access menu interface is not as easy to use as list interface. for Text, Speech and Images: Results of the Fifth CLEF However, the menu interface is easy to learn to use. All Evaluation Campaign, Lecture Notes in Computer participants were never used the experimental system Science, Springer, to appear. before. After the training session, they can easily learn to Cutting, D.R., Karger, D.R., Pedersen, J.O., Tukey, J.W. use it to complete two search tasks. Therefore, the (1992) Scatter/gather. A cluster-based approach to learnability of the menu interface can be seen as browsing large document collections. In Proceedings of acceptability. In addition, majority of participants gave the ACM SIGIR positive remark on concept hierarchy menu used in image Hearst, M. (1999). User Interfaces and Visualization. In: retrieval. The satisfaction rate in menu interface was Baeza-Yates, R. & Ribeiro-Neto, B. (eds.), Modern slightly higher than list interface. The majority Information Retrieval, pp. 257-323. New York: ACM participants were satisfied with using concept hierarchy Press. menu to organize the retrieved results. They also Joho, H., Sanderson, M., and Beaulieu, M. (2004) A Study mentioned that they prefer to use menu interface to of User Interaction with a Concept-based Interactive retrieve image in the future. Query Expansion Support Tool. In: McDonald, S. & However, some participants had a number of negative Tait, J. (eds), Advances in Information Retrieval, 26th opinions in using menu interface. For example, two European Conference on Information Retrieval, pp. 42- participants who favoured list interface mentioned that 56. some terms displayed on the menu totally make them feel Markkula, M. and Sormunen, E. (2000) End-use searching confused; they have no idea why these terms could be challenges indexing practices in the digital newspaper generated. Other participants also stated that some terms photo archive, Information Retrieval, 1, pp. 259-285. make them to the wrong path, result in waste a lot time Park, G., Baek, Y., and Lee, H-K. (2005) Re-ranking and may sidetrack their original thought. algorithm using post-retrieval clustering for content- based image retrieval, Information Processing and 7. Conclusions Management, 41(2), pp. 177-194. Overall the participants’ impression of the Rodden, K., Basalaj, W., Sinclair, D., and Wood, K. experimental system CiQuest as image retrieval system (2001) Does Organisation by Similarity Assist Image OntoImage'2006 May 22, 2006 Genoa, Italy page 49 of 55. Browsing?, In: Proceedings of the SIGCHI conference on Human factors in computing systems, pp. 190-197. Robertson, S.E., Walker, S., Beaulieu, M.M., Gatford, M. & Payne, A. (1995). Okapi at TREC-4. In: Harman, D.K. (ed.), NIST Special Publication 500-236: The Fourth Text REtrieval Conference (TREC-4), Gaithersburg, MD. pp. 73-97. Sanderson, M. and Croft, B. (1999) Deriving concept hierarchies from text In: Proceedings of the 22nd ACM Conference of the Special Interest Group in Information Retrieval, pp. 206-213. Sanderson, M. and Lawrie, D. (2000) Building, Testing, and Applying Concept Hierarchies In: W. Bruce Croft, (ed.), Advances in Information Retrieval: Recent Research from the CIIR, Kluwer Academic Publishers, pp. 235-266. Spärck Jones, K. (1972). A Statistical Interpretation of Term Specificity and its Application in Retrieval. Journal of Documentation, 28 (1), 11-21. Yee, K-P., Swearingen, K., and Hearst, M. (2003) Faceted metadata for image search and browsing. In: Proceedings of the SIGCHI conference on Human factors in computing systems, pp. 401-408. OntoImage'2006 May 22, 2006 Genoa, Italy page 50 of 55. (extended abstract) CLiMB: Computational Linguistics for Metadata Building Judith L. Klavans, Ph. D. University of Maryland

[email protected]

The goal of the Computational Linguistics for Metadata Buildingi (CLiMB) project is to discover to what extent and under which circumstances automatic techniques can be used to extract descriptive subject-oriented metadata from scholarly and authoritative texts associated with image collections. Although manual cataloging is an established field, what is novel about the CLiMB approach is the notion that high-precision cataloging might be accomplished automatically. The achievement of CLiMB’s goals will not only address the cataloging bottleneck that arises as the volume of available data increases with new technology; it will also benefit end-users by providing a set of tools to aid in the access of information across collections and vocabularies. As a research project, CLiMB has created a cataloger’s platform in which to determine whether such an approach is indeed useful, and to what extent results can be incorporated into existing metadata schema, e.g. VRA, MARC, Dublin Core. The initial CLiMB Toolkit was fully implemented and evaluated at Columbia University as a Web-based application, operated from within a standard browser. This paper will describe the results of that implementation and the uses of the prototype toolkit. CLiMB employs text sources that are tightly-coupled with a digital image collection to automatically extract descriptive metadata from those texts – in effect, making the writings of specialist scholars useful to enrich the catalog entry. In many cases, researchers have already described aspects of selected images in contexts such as scholarly monographs and subject specific encyclopedias. The challenge is to identify the meaningful facts (or metadata) in the written material and distinguish them from among the thousands of other words that make up the text in its primary form. Ordinarily, descriptive metadata (in the form of catalog records and indexes) are compiled manually, a process that is slow, expensive, and often tailored to the purpose of a given collection. Our goal as a research project is to explore the potential for employing computational linguistic techniques to alleviate some of the obstacles that prevent wide access to digital collections. In the short run, by enhancing the identification of descriptive metadata through the use of automatic procedures, the CLiMB project has enabled the selection of candidate terms for review by catalogers. These candidate terms are extracted from written and tightly-coupled material associated with images in digital collections. The Climb Architecture Figure One shows the CLiMB process flow. The text to be loaded in Step One refers to the selected text for processing by CLiMB. This text must, of course be in OntoImage'2006 May 22, 2006 Genoa, Italy page 51 of 55. electronically-readable form, and must be either free of copyright, or have the permissions arranged in advance with the CLiMB team. Although a user will never see, nor be able to recreate the source text, the cataloger must be able to explore the context of a selected term, thus bringing in the questions of rights and permissions. For Step Two, the TOI (target object identifier) list refers to predefined named entities, provided by the minimal catalog record created upon the initial intake of an image collection Figure 1: CLiMB Toolkit Process Step Three will be the focus of the longer version of this paper, since the LREC audience is likely to be more interested in NLP methodologies as in issues such as rights and permissions, although all topics are essential for the success of the project as a whole. In the CLiMB-1 toolkit, a segmentation algorithm was implemented based on Kan, Klavans, and McKeown (1998). This step permits association between a text unit and a related image; for texts such as a catalog raisonne, no such step is generally required since text entries tend to be short and closely tied to a single image. The segmentation- association step is one where future research will enable more accuracy in linking the relevant text section with images and is a part of the association process flow which provide some key research areas in the future. A standard tagger (we have used the Mitre public toolkit, and we have experimented with other taggers), along with a named entity recognizer is applied to the segmented text. Lookup in the Art and Architecture Thesaurus (AAT) is performed. At this point, the user (a cataloger in this implementation of the Toolkit), is shown potential metadata for selection and feedback. The cataloger can, at Step Four, select subject access terms to be loaded into the catalog record. Step Five, review, permits the user to alter, delete, or insert any changes before the final load. CLiMB includes an extensive evaluation component, including formative evaluations with a wide variety of user types and iterative evaluation with cataloger specialists. These results, to be reported in Passonneau et al. (2006) will lead to the ability to assess OntoImage'2006 May 22, 2006 Genoa, Italy page 52 of 55. the usefulness of CLiMB metadata once included in image search platforms. To our knowledge, CLiMB is engaged in a novel approach to issues of automatic metadata extraction from selected authoritative texts combined with thesauri and other authority lists to assign weights to potential terms for use in image access. Thus, running studies with catalogers requires some initial training to familiarize them with the types of information, and types of error, they are likely to encounter. By enlisting catalogers to judge the output, we can then collect additional feedback for the ultimate application of new techniques. CLiMB Achievements Although the focus of this paper will be on the NLP components of CLiMB, at the service of the application, we will also discuss three aspects of the project that are required for success. The first is the identification of collections appropriate for us by CLiMB. We have developed selection criteria guidelines for us in the project (e.g. rights and permissions, digital format of text, etc). Secondly, we will discuss some of the issues in creating a toolkit that are not necessarily NLP-centric, e.g. client-server architecture points, implementation issues, and system-dependent usability issues such as speed. Thirdly, we will review the various types of evaluation that we have considered, including formative, iterative, and summative. We have explored utility across several sets of users (ranging from our own computer science graduate students to highly trained catalogers), and have observations about the toolkit that apply either to specific user groups or to applications. Future NLP Research in CLiMB In our next phase, we will explore methods to add ranked terms for selection, using relations with high-quality domain specific thesauri, such as the AAT At the moment, no disambiguation is performed. We will explore the applicability of using machine- learning techniques over data collected from experiments with CLiMB output and catalogers. We will compare combinations of taggers and chunkers to find the optimal requirements for this application We will expand our texts to those having a less tightly- coupled relationship to an image, in order to push our techniques beyond tidy data, to the more intractable (such as the Web.) We will test utility with catologers ranging from the most naïve (namely our information studies graduate students) to more sophisticated (from our new CLiMB-2 partners). We will load two, and possibly three, collections into the CLiMB platform, to test with image searchers, not just with catalogers. Among our objectives are the creation of a set of client-side downloadable tools to enhance access by labeling descriptive metadata for review by experts as well as, ultimately, to enable sophisticated automatic analysis procedures for the wider digital library community. OntoImage'2006 May 22, 2006 Genoa, Italy page 53 of 55. Very Selected and Incomplete References (full set with final paper) www.umiacs.umd.edu/~climb Alembic Sentence Tagger (MITRE) Brill Tagger (Eric Brill). NP-Chunk Davis, Peter T., David K. Elson, Judith L. Klavans. Methods for Precise Named Entity Matching in Digital Collections. Third ACM/IEEE Joint Conference on Digital Libraries (JCDL), 2003 Kan, Min-Yen, Judith L. Klavans, and Kathleen R. McKeown (1998). “Linear segmentation and segment relevance.” Sixth Annual Workshop on Very Large Corpora (WVLC6). Montreal, Quebec, Canada, 1998. Moëllic, Pierre-Alain, Patrick Hède, Gregory Grefenstette, Christophe Millet, “Evaluating Content Based Image Retrieval Techniques with the One Million Images CLIC TestBed”, Proceedings of the Second World Enformatika Congress, WEC’05, February 25-27, 2005, Istanbul, Turkey, pp 171-174. Passonneau, et al. (2006) Evaluation of CLiMB . LREC. Town C. and D. Sinclair. Language-based querying of image collections on the basis of an extensible ontology. IVC, 22(3):251--267, March 2004 i CLiMB-1 was funded by the Andrew W. Mellon Foundation to Columbia University in the City of New York., 2002-2004. CLiMB-2 was funded by the Mellon Foundation to the University of Maryland, College of Information Studies, 2005-2007. OntoImage'2006 May 22, 2006 Genoa, Italy page 54 of 55. Author Index Alcantara, Manuel …… 9 Bloch, Isabelle………… 34 Burgos, Diego ………… 5 Clough, Paul ………. 13,44 Declerck, Thierry ……. 9 Deselaers, Thomas…… 13 Grefenstette, Gregory… 34 Grubinger, Michael …. 13 Hanbury, Allan………. 24 Hède, Patrick…………. 34 Klavans, Judith L……….51 Millet, Christophe …… 34 Moëllic Pierre-Alain…. 34 Müller, Henning……... 13 Pastra, Katerina……… 40 Sanderson, Mark……… 44 Tian, Juan……………… 44 Wanner, Leo …………. 5 OntoImage 2006 Workshop on Language Resources for Content-based Image Retrieval during LREC 2006 Monday 22 May 2006 Magazzini del Cotone Conference Center, GENOA - ITALY OntoImage'2006 May 22, 2006 Genoa, Italy page 55 of 55.

(PDF) Workshop on Language Resources for Content-based Image Retrieval during LREC 2006