Content-Based Audio Search from Fingerprinting to Semantic Audio Retrieval

Pedro Cano
University Pompeu Fabra, Barcelona, Spain (April, 2007)


This dissertation is about audio content-based search. Specifically, it is on exploring promising paths for bridging the semantic gap that currently prevents widedeployment of audio content-based search engines. Music search sound engines rely on metadata, mostly human generated, to manage collections of audio assets. Even though time-consuming and error-prone, human labeling is a common practice. Audio content-based methods, algorithms that automatically extract description from audio files, are generally not mature enough to provide the user friendly representation that users demand when interacting with audio content. Mostly, content-based methods provide low-level descriptions, while high-level or semantic descriptions are beyond current capabilities. This dissertation has two parts. In a first part we explore the strengths and limitation of a pure low-level audio description technique: audio fingerprinting. We prove by implementation of di.erent systems that automatically extracted low-level description of audio are able to successfully solve a series of tasks such as linking unlabeled audio to corresponding metadata, duplicate detection or integrity verification. We show that the di.erent audio fingerprinting systems can be explained with respect to a general fingerprinting framework. We then suggest that the fingerprinting framework, which shares many functional blocks with content-based audio search engines, can eventually be extended to allow for content-based similarity type of search, such as find similar or “query-by-example”. However, low-level audio description cannot provide a semantic interaction with audio contents. It is not possible to generate a verbose and detailed descriptions in unconstraint domains, for instance, for asserting that a sound corresponds to "fast male footsteps on wood" but rather some signal-level descriptions.

In the second part of the thesis we hypothesize that one of the problems that hinders the closing the semantic gap is the lack of intelligence that encodes common sense knowledge and that such a knowledge base is a primary step toward bridging the semantic gap. For the specific case of sound e.ects, we propose a general sound classifier capable of generating verbose descriptions in a representation that computers and users alike can understand. We conclude the second part with the description of a sound e.ects retrieval system which leverages both low-level and semantic technologies and that allows for intelligent interaction with audio collections.

[BibTex, External Link, Return]