This thesis establishes links between the fields of audio indexing and video sequence analysis, through the problem of drum signal analysis.|
In a first part, the problem of drum track transcription from polyphonic music signals is addressed. After having presented several pre-processings for drum track enhancement, and a large set of relevant features, a statistical machine learning approach to drum track transcription is proposed. Novel supervised and unsupervised sequence modeling methods are also introduced to enhance the detection of drum strokes by taking into account the regularity of drum patterns. We conclude this part by evaluating various drum track separation algorithms and by underlining the duality between transcription and source separation.
In a second part, we extend this transcription system by taking into account the video information brought by cameras filming the drummer. Various approaches are introduced to segment the scene and map each region of interest to a drum instrument. Motion intensity features are then used to detect drum strokes. Our results show that a multimodal approach is capable of resolving some ambiguities inherent to audio-only transcription.
In the final part, we extend our work to a broader range of music videos, which may not show the musicians. We particularly address the problem of understanding how a piece of music can be illustrated by images. After having presented or introduced new segmentation techniques for audio and video streams, we define synchrony measures on their structures. These measures can be used for both retrieval applications (music retrieval by video) or content classification.