Music is built from sound, ultimately resulting from an elaborate interaction between the sound-generating properties of physical objects (i.e. music instruments) and the sound perception abilities of the human auditory system. Humans, even without any kind of formal music training, are typically able to extract, almost unconsciously, a great amount of relevant information from a musical signal. Features such as the beat of a musical piece, the main melody of a complex musical arrangement, the sound sources and events occurring in a complex musical mixture, the song structure (e.g. verse, chorus, bridge) and the musical genre of a piece, are just some examples of the level of knowledge that a naive listener is commonly able to extract just from listening to a musical piece. In order to do so, the human auditory system uses a variety of cues for perceptual grouping such as similarity, proximity, harmonicity, common fate, among others.|
This dissertation proposes a ?exible and extensible Computational Auditory Scene Analysis framework for modeling perceptual grouping in music listening. The goal of the proposed framework is to partition a monaural acoustical mixture into a perceptually motivated topological description of the sound scene (similar to the way a naive listener would perceive it) instead of attempting to accurately separate the mixture into its original and physical sources. The presented framework takes the view that perception primarily depends on the use of low-level sensory information, and therefore requires no training or prior knowledge about the audio signals under analysis. It is however designed to be e?cient and ?exible enough to allow the use of prior knowledge and high-level information in the segregation procedure.
The proposed system is based on a sinusoidal modeling analysis front-end, from which spectral components are segregated into sound events using perceptually inspired grouping cues. A novel similarity cue based on harmonicity (termed “Harmonically-Wrapped Peak Similarity”) is also introduced. The segregation process is based on spectral clustering methods, a technique originally proposed to model perceptual grouping tasks in the computer vision field. One of the main advantages of this approach is the ability to incorporate various perceptually-inspired grouping criteria into a single framework without requiring multiple processing stages.
Experimental validation of the perceptual grouping cues show that the novel harmonicity-based similarity cue presented in this dissertation compares favourably to other state-of-the-art harmonicity cues, and that its combination with other grouping cues, such as frequency and amplitude similarity, improves the overall separation performance. In addition, experimental results for several Music Information Retrieval tasks, including predominant melodic source segregation, main melody pitch estimation, voicing detection and timbre identification in polyphonic music signals, are presented. The use of segregated signals in these tasks allows to achieve final results that compare or outperform typical and state-of-the-art audio content analysis systems, which traditionally represent statistically the entire polyphonic sound mixture. Although a speciffic implementation of the proposed framework is presented in this dissertation and made available as open source software, the proposed approach is flexible enough to be able to utilize di?erent analysis front-ends and grouping criteria in a straightforward and efficient manner.