PhD Theses and Doctoral Dissertations Related to Music Information Retrieval (Long List)

BibTex (all)
BibTex (all incl. abstracts)
Short list (excl. abstracts)
>> Long list
Statistics

Related: See also Don Byrd's MIR related bibliography and the complete list of ISMIR papers.

Please report errors and any additional information to Elias Pampalk (firstname.lastname@gmail.com).

Last updated on: 07-Mar-2010

2010

Automatic Characterization of Music for Intuitive Retrieval
Tim Pohle, Johannes Kepler University, Linz, Austria, January 2010. [BibTex, External Link]

Signal Processing Methods for Drum Transcription and Music Structure Analysis
Jouni Paulus, Tampere University of Technology, Finnland, January 2010. [BibTex, Abstract]

This thesis proposes signal processing methods for the analysis of musical audio on two time scales: drum transcription on a finer time scale and music structure analysis on the time scale of entire pieces. The former refers to the process of locating drum sounds in the input and recognising the instruments that were used to produce the sounds. The latter refers to the temporal segmentation of a musical piece into parts, such as chorus and verse.

For drum transcription, both low-level acoustic recognition and high-level musicological modelling methods are presented. A baseline acoustic recognition method with a large number of features using Gaussian mixture models for the recognition of drum combinations is presented. Since drums occur in structured patterns, modelling of the sequential dependencies with N-grams is proposed. In addition to the conventional N-grams, periodic N-grams are proposed to model the dependencies between events that occur one pattern length apart. The evaluations show that incorporating musicological modelling improves the performance considerably. As some drums are more probable to occur at certain points in a pattern, this dependency is utilised for producing transcriptions of signals produced with arbitrary sounds, such as beatboxing.

A supervised source separation method using non-negative matrix factorisation is proposed for transcribing mixtures of drum sounds. Despite the simple signal model, a high performance is obtained for signals without other instruments. Most of the drum transcription methods operate only on single-channel inputs, but multichannel signals are available in recording studios. A multichannel extension of the source separation method is proposed, and an increase in performance is observed in evaluations.

Many of the drum transcription methods rely on detecting sound onsets for the segmentation of the signal. Detection errors will then decrease the overall performance of the system. To overcome this problem, a method utilising a network of connected hidden Markov models is proposed to perform the event segmentation and recognition jointly. The system is shown to be able to perform the transcription even from polyphonic music.

The second main topic of this thesis is music structure analysis. Two methods are proposed for this purpose. The first relies on defining a cost function for a description of the repeated parts. The second method defines a fitness function for descriptions covering the entire piece. The abstract cost (and fitness) functions are formulated in terms that can be determined from the input signal algorithmically, and optimisation problems are formulated. In both cases, an algorithm is proposed for solving the optimisation problems. The first method is evaluated on a small data set, and the relevance of the cost function terms is shown. The latter method is evaluated on three large data sets with a total of 831 (557+174+100) songs. This is to date the largest evaluation of a structure analysis method. The evaluations show that the proposed method outperforms a reference system on two of the data sets.

Music structure analysis methods rarely provide musically meaningful names for the parts in the result. A method is proposed to label the parts in descriptions based on a statistical model of the sequential dependencies between musical parts. The method is shown to label the main parts relatively reliably without any additional information. The labelling model is further integrated into the fitness function based structure analysis method.

2009

Fuzzy techniques in the usage and construction of comparison measures for music objects
Klaas Bosteels, Ghent University, Belgium, October 2009. [BibTex, Abstract, PDF]

Many of the algorithms and heuristics designed to support users in accessing and discovering music require a comparison measure for artists, songs, albums, or other music objects. In this thesis, we discuss how fuzzy set theory can be useful in both the usage and construction of such measures.

After showing that music objects can naturally be represented as fuzzy sets, we introduce a triparametric family of fuzzy comparison measures that can be used to systematically generate measures for comparing such representations. The main advantage of this family is that it paves the way for a convenient threestep approach to constructing a computationally efficient comparison measure for music objects that meets all requirements of a given application. We illustrate this by means of two practical examples, the second one being the construction of the underlying comparison measure for the popular “Multi Tag Search” demonstration on Last.fm’s Playground, which recently graduated to the main Last.fm website in the form of “Multi Tag Radio”.

For the usage of comparison measures for music objects, we focus on one specific application, namely, dynamic playlist generation. More precisely, we discuss heuristics that leverage the skipping behaviour of the user to intelligently determine the song to be played next. Using fuzzy set theory, we introduce a formal framework that makes the definitions of such heuristics systematic, concise, and easy to analyse and interpret. As an illustration of the latter benefit, we rely on this framework to relate the performance of some of the considered dynamic playlist generation heuristics to certain user behaviour. We also clearly confirm these theoretical insights by means of a new methodology for evaluating playlist generation heuristics based on listening patterns extracted from radio logs.

To conclude, we present the software that was employed and developed in order to be able to implement some of the described data-intensive experiments and applications.

Audio content processing for automatic music genre classification: descriptors, databases, and classifiers
Enric Guaus, University Pompeu Fabra, Barcelona, Spain, 2009. [BibTex, Abstract, External Link]

This dissertation presents, discusses, and sheds some light on the problems that appear when computers try to automatically classify musical genres from audio signals. In particular, a method is proposed for the automatic music genre classification by using a computational approach that is inspired in music cognition and musicology in addition to Music Information Retrieval techniques. In this context, we design a set of experiments by combining the different elements that may affect the accuracy in the classification (audio descriptors, machine learning algorithms, etc.). We evaluate, compare and analyze the obtained results in order to explain the existing glass-ceiling in genre classification, and propose new strategies to overcome it. Moreover, starting from the polyphonic audio content processing we include musical and cultural aspects of musical genre that have usually been neglected in the current state of the art approaches.

This work studies different families of audio descriptors related to timbre, rhythm, tonality and other facets of music, which have not been frequently addressed in the literature. Some of these descriptors are proposed by the author and others come from previous existing studies. We also compare machine learning techniques commonly used for classification and analyze how they can deal with the genre classification problem. We also present a discussion on their ability to represent the different classification models proposed in cognitive science. Moreover, the classification results using the machine learning techniques are contrasted with the results of some listening experiments proposed. This comparison drive us to think of a specific architecture of classifiers that will be justified and described in detail. It is also one of the objectives of this dissertation to compare results under different data configurations, that is, using different datasets, mixing them and reproducing some real scenarios in which genre classifiers could be used (huge datasets). As a conclusion, we discuss how the classification architecture here proposed can break the existing glass-ceiling effect in automatic genre classification.

To sum up, this dissertation contributes to the field of automatic genre classification: a) It provides a multidisciplinary review of musical genres and its classification; b) It provides a qualitative and quantitative evaluation of families of audio descriptors used for automatic classification; c) It evaluates different machine learning techniques and their pros and cons in the context of genre classification; d) It proposes a new architecture of classifiers after analyzing music genre classification from different disciplines; e) It analyzes the behavior of this proposed architecture in different environments consisting of huge or mixed datasets.

Poolcasting: an intelligent technique to customise music programmes for their audience
Claudio Baccigalupo, Universitat Autònoma de Barcelona, Spain, November 2009. [BibTex, Abstract, PDF]

Poolcasting is an intelligent technique to customise musical sequences for groups of listeners. Poolcasting acts like a disc jockey, determining and delivering songs that satisfy its audience. Satisfying an entire audience is not an easy task, especially when members of the group have heterogeneous preferences and can join and leave the group at different times. The approach of poolcasting consists in selecting songs iteratively, in real time, favouring those members who are less satisfied by the previous songs played.

Poolcasting additionally ensures that the played sequence does not repeat the same songs or artists closely and that pairs of consecutive songs ‘flow’ well one after the other, in a musical sense. Good disc jockeys know from expertise which songs sound well in sequence; poolcasting obtains this knowledge from the analysis of playlists shared on the Web. The more two songs occur closely in playlists, the more poolcasting considers two songs as associated, in accordance with the human experiences expressed through playlists. Combining this knowledge and the music profiles of the listeners, poolcasting autonomously generates sequences that are varied, musically smooth and fairly adapted for a particular audience.

A natural application for poolcasting is automating radio programmes. Many online radios broadcast on each channel a random sequence of songs that is not affected by who is listening. Applying poolcasting can improve radio programmes, playing on each channel a varied, smooth and group-customised musical sequence. The integration of poolcasting into a Web radio has resulted in an innovative system called Poolcasting Web radio. Tens of people have connected to this online radio during one year providing first-hand evaluation of its social features. A set of experiments have been executed to evaluate how much the size of the group and its musical homogeneity affect the performance of the poolcasting technique.

Signal Processing Methods for Audio Classification and Music Content Analysis
Antti Eronen, Tampere University of Technology, Tampere, Finland, June 2009. [BibTex, Abstract, External Link]

Signal processing methods for audio classification and music content analysis are developed in this thesis. Audio classification is here understood as the process of assigning a discrete category label to an unknown recording. Two specific problems of audio classification are considered: musical instrument recognition and context recognition. In the former, the system classifies an audio recording according to the instrument, e.g. violin, flute, piano, that produced the sound. The latter task is about classifying an environment, such a car, restaurant, or library, based on its ambient audio background.

In the field of music content analysis, methods are presented for music meter analysis and chorus detection. Meter analysis methods consider the estimation of the regular pattern of strong and weak beats in a piece of music. The goal of chorus detection is to locate the chorus segment in music which is often the catchiest and most memorable part of a song. These are among the most important and readily commercially applicable content attributes that can be automatically analyzed from music signals.

For audio classification, several features and classification methods are proposed and evaluated. In musical instrument recognition, we consider methods to improve the performance of a baseline audio classification system that uses mel-frequency cepstral coefficients and their first derivatives as features, and continuous-density hidden Markov models (HMMs) for modeling the feature distributions. Two improvements are proposed to increase the performance of this baseline system. First, transforming the features to a base with maximal statistical independence using independent component analysis. Secondly, discriminative training is shown to further improve the recognition accuracy of the system.

For musical meter analysis, three methods are proposed. The first performs meter analysis jointly at three different time scales: at the temporally atomic tatum pulse level, at the tactus pulse level, which corresponds to the tempo of a piece, and at the musical measure level. The features obtained from an accent feature analyzer and a bank of combfilter resonators are processed by a novel probabilistic model which represents primitive musical knowledge and performs joint estimation of the tatum, tactus, and measure pulses.

The second method focuses on estimating the beat and the tatum. The design goal was to keep the method computationally very efficient while retaining sufficient analysis accuracy. Simplified probabilistic modeling is proposed for beat and tatum period and phase estimation, and ensuring the continuity of the estimates. A novel phase-estimator based on adaptive comb filtering is presented. The accuracy of the method is close to the first method but with a fraction of the computational cost.

The third method for music rhythm analysis focuses on improving the accuracy in music tempo estimation. The method is based on estimating the tempo of periodicity vectors using locally weighted k-Nearest Neighbors (k-NN) regression. Regression closely relates to classification, the difference being that the goal of regression is to estimate the value of a continuous variable (the tempo), whereas in classification the value to be assigned is a discrete category label. We propose a resampling step applied to an unknown periodicity vector before finding the nearest neighbors to increase the likelihood of finding a good match from the training set. This step improves the performance of the method significantly. The tempo estimate is computed as a distance-weighted median of the nearest neighbor tempi. Experimental results show that the proposed method provides significantly better tempo estimation accuracies than three reference methods.

Finally, we describe a computationally efficient method for detecting a chorus section in popular and rock music. The method utilizes a self-dissimilarity representation that is obtained by summing two separate distance matrices calculated using the mel-frequency cepstral coefficient and pitch chroma features. This is followed by the detection of off-diagonal segments of small distance in the distance matrix. From the detected segments, an initial chorus section is selected using a scoring mechanism utilizing several heuristics, and subjected to further processing.

Computational Tonality Estimation: Signal Processing and Hidden Markov Models
Katy Noland, Queen Mary University of London, London, UK, March 2009. [BibTex, Abstract, PDF]

This thesis investigates computational musical tonality estimation from an audio signal. We present a hidden Markov model (HMM) in which relationships between chords and keys are expressed as probabilities of emitting observable chords from a hidden key sequence. The model is tested first using symbolic chord annotations as observations, and gives excellent global key recognition rates on a set of Beatles songs.

The initial model is extended for audio input by using an existing chord recognition algorithm, which allows it to be tested on a much larger database. We show that a simple model of the upper partials in the signal improves percentage scores. We also present a variant of the HMM which has a continuous observation probability density, but show that the discrete version gives better performance.

Then follows a detailed analysis of the effects on key estimation and computation time of changing the low level signal processing parameters. We find that much of the high frequency information can be omitted without loss of accuracy, and significant computational savings can be made by applying a threshold to the transform kernels. Results show that there is no single ideal set of parameters for all music, but that tuning the parameters can make a difference to accuracy.

We discuss methods of evaluating more complex tonal changes than a single global key, and compare a metric that measures similarity to a ground truth to metrics that are rooted in music retrieval. We show that the two measures give different results, and so recommend that the choice of evaluation metric is determined by the intended application.

Finally we draw together our conclusions and use them to suggest areas for continuation of this research, in the areas of tonality model development, feature extraction, evaluation methodology, and applications of computational tonality estimation.

A computational Framework for Sound Segregation in Music Signals
Luis Gustavo Martins, University of Porto, Porto, Portugal, February 2009. [BibTex, Abstract, External Link]

Music is built from sound, ultimately resulting from an elaborate interaction between the sound-generating properties of physical objects (i.e. music instruments) and the sound perception abilities of the human auditory system. Humans, even without any kind of formal music training, are typically able to extract, almost unconsciously, a great amount of relevant information from a musical signal. Features such as the beat of a musical piece, the main melody of a complex musical arrangement, the sound sources and events occurring in a complex musical mixture, the song structure (e.g. verse, chorus, bridge) and the musical genre of a piece, are just some examples of the level of knowledge that a naive listener is commonly able to extract just from listening to a musical piece. In order to do so, the human auditory system uses a variety of cues for perceptual grouping such as similarity, proximity, harmonicity, common fate, among others.

This dissertation proposes a ?exible and extensible Computational Auditory Scene Analysis framework for modeling perceptual grouping in music listening. The goal of the proposed framework is to partition a monaural acoustical mixture into a perceptually motivated topological description of the sound scene (similar to the way a naive listener would perceive it) instead of attempting to accurately separate the mixture into its original and physical sources. The presented framework takes the view that perception primarily depends on the use of low-level sensory information, and therefore requires no training or prior knowledge about the audio signals under analysis. It is however designed to be e?cient and ?exible enough to allow the use of prior knowledge and high-level information in the segregation procedure.

The proposed system is based on a sinusoidal modeling analysis front-end, from which spectral components are segregated into sound events using perceptually inspired grouping cues. A novel similarity cue based on harmonicity (termed “Harmonically-Wrapped Peak Similarity”) is also introduced. The segregation process is based on spectral clustering methods, a technique originally proposed to model perceptual grouping tasks in the computer vision field. One of the main advantages of this approach is the ability to incorporate various perceptually-inspired grouping criteria into a single framework without requiring multiple processing stages.

Experimental validation of the perceptual grouping cues show that the novel harmonicity-based similarity cue presented in this dissertation compares favourably to other state-of-the-art harmonicity cues, and that its combination with other grouping cues, such as frequency and amplitude similarity, improves the overall separation performance. In addition, experimental results for several Music Information Retrieval tasks, including predominant melodic source segregation, main melody pitch estimation, voicing detection and timbre identification in polyphonic music signals, are presented. The use of segregated signals in these tasks allows to achieve final results that compare or outperform typical and state-of-the-art audio content analysis systems, which traditionally represent statistically the entire polyphonic sound mixture. Although a speciffic implementation of the proposed framework is presented in this dissertation and made available as open source software, the proposed approach is flexible enough to be able to utilize di?erent analysis front-ends and grouping criteria in a straightforward and efficient manner.

A Distributed Music Information System
Yves Raimond, Queen Mary University of London, London, UK, January 2009. [BibTex, Abstract, External Link]

Information management is an important part of music technologies today, covering the management of public and personal collections, the construction of large editorial databases and the storage of music analysis results. The information management solutions that have emerged for these use-cases are still isolated from each other. The information one of these solutions manages does not benefit from the information another holds.

In this thesis, we develop a distributed music information system that aims at gathering music- related information held by multiple databases or applications. To this end, we use Semantic Web technologies to create a unified information environment. Web identifiers correspond to any items in the music domain: performance, artist, musical work, etc. These web identifiers have structured representations permitting sophisticated reuse by applications, and these representations can quote other web identifiers leading to more information.

We develop a formal ontology for the music domain. This ontology allows us to publish and interlink a wide range of structured music-related data on the Web. We develop an ontology evaluation methodology and use it to evaluate our music ontology. We develop a knowledge representation framework for combining structured web data and analysis tools to derive more information. We apply these different technologies to publish a large amount of pre-existing music-related datasets on the Web. We develop an algorithm to automatically relate such datasets among each other. We create automated music-related Semantic Web agents, able to aggregate musical resources, structured web data and music processing tools to derive and publish new information. Finally, we describe three of our applications using this distributed information environment. These applications deal with personal collection management, enhanced access to large audio streams available on the Web and music recommendation.

2008

Perception and Modeling of Segment Boundaries in Popular Music
Michael J. Bruderer, Technische Universiteit Eindhoven, Eindhoven, Netherlands, 2008. [BibTex, PDF]

Music Recommendation and Discovery in the Long Tail
Óscar Celma, University Pompeu Fabra, Barcelona, Spain, 2008. [BibTex, Abstract, External Link]

Music consumption is biased towards a few popular artists. For instance, in 2007 only 1% of all digital tracks accounted for 80% of all sales. Similarly, 1,000 albums accounted for 50% of all album sales, and 80% of all albums sold were purchased less than 100 times. There is a need to assist people to filter, discover, personalise and recommend from the huge amount of music content available along the Long Tail.

Current music recommendation algorithms try to accurately predict what people demand to listen to. However, quite often these algorithms tend to recommend popular -or well-known to the user- music, decreasing the effectiveness of the recommendations. These approaches focus on improving the accuracy of the recommendations. That is, try to make accurate predictions about what a user could listen to, or buy next, independently of how useful to the user could be the provided recommendations.

In this Thesis we stress the importance of the user's perceived quality of the recommendations. We model the Long Tail curve of artist popularity to predict -potentially-interesting and unknown music, hidden in the tail of the popularity curve. Effective recommendation systems should promote novel and relevant material (non-obvious recommendations), taken primarily from the tail of a popularity distribution.

The main contributions of this Thesis are: (i) a novel network-based approach for recommender systems, based on the analysis of the item (or user) similarity graph, and the popularity of the items, (ii) a user-centric evaluation that measures the user's relevance and novelty of the recommendations, and (iii) two prototype systems that implement the ideas derived from the theoretical work. Our findings have significant implications for recommender systems that assist users to explore the Long Tail, digging for content they might like.

Automatic Transcription of Pitch Content in Music and Selected Applications
Matti Ryynänen, Tampere University of Technology, Tampere, Finland, December 2008. [BibTex, Abstract, PDF]

Transcription of music refers to the analysis of a music signal in order to produce a parametric representation of the sounding notes in the signal. This is conventionally carried out by listening to a piece of music and writing down the symbols of common musical notation to represent the occurring notes in the piece. Automatic transcription of music refers to the extraction of such representations using signal-processing methods.

This thesis concerns the automatic transcription of pitched notes in musical audio and its applications. Emphasis is laid on the transcription of realistic polyphonic music, where multiple pitched and percussive instruments are sounding simultaneously. The methods included in this thesis are based on a framework which combines both low-level acoustic modeling and high-level musicological modeling. The emphasis in the acoustic modeling has been set to note events so that the methods produce discrete-pitch notes with onset times and durations as output. Such transcriptions can be efficiently represented as MIDI files, for example, and the transcriptions can be converted to common musical notation via temporal quantization of the note onsets and durations. The musicological model utilizes musical context and trained models of typical note sequences in the transcription process. Based on the framework, this thesis presents methods for generic polyphonic transcription, melody transcription, and bass line transcription. A method for chord transcription is also presented.

All the proposed methods have been extensively evaluated using realistic polyphonic music. In our evaluations with 91 half-a-minute music excerpts, the generic polyphonic transcription method correctly found 39% of all the pitched notes (recall) where 41% of the transcribed notes were correct (precision). Despite the seemingly low recognition rates in our simulations, this method was top-ranked in the polyphonic note tracking task in the international MIREX evaluation in 2007 and 2008. The methods for the melody, bass line, and chord transcription were evaluated using hours of music, where F-measure of 51% was achieved for both melodies and bass lines. The chord transcription method was evaluated using the first eight albums by The Beatles and it produced correct frame-based labeling for about 70% of the time.

The transcriptions are not only useful as human-readable musical notation but in several other application areas too, including music information retrieval and content-based audio modification. This is demonstrated by two applications included in this thesis. The first application is a query by humming system which is capable of searching melodies similar to a user query directly from commercial music recordings. In our evaluation with a database of 427 full commercial audio recordings, the method retrieved the correct recording in the topthree list for the 58% of 159 hummed queries. The method was also top-ranked in “query by singing/humming” task in MIREX 2008 for a database of 2048 MIDI melodies and 2797 queries. The second application uses automatic melody transcription for accompaniment and vocals separation. The transcription also enables tuning the user singing to the original melody in a novel karaoke application.

Modeling musical anticipation: From the time of music to the music of time
Arshia Cont, University of California in San Diego, CA, USA, October 2008. [BibTex, Abstract, External Link]

This thesis studies musical anticipation, both as a cognitive process and design principle for applications in music information retrieval and computer music. For this study, we reverse the problem of modeling anticipation addressed mostly in music cognition literature for the study of musical behavior, to anticipatory modeling, a cognitive design principle for modeling artificial systems. We propose anticipatory models and applications concerning three main preoccupations of expectation: What to expect?, How to expect?, and When to expect?

For the first question, we introduce a mathematical framework for music information geometry combining information theory, differential geometry, and statistical learning, with the aim of representing information content and gaining access to music structures. The second question is addressed as a machine learning planning problem in an environment, where interactive learning methods are employed on parallel agents to learn anticipatory profiles of actions to be used for decision making. To address the third question, we provide a novel anticipatory design for the problem of synchronizing a live performer to a pre-written music score, leading to Antescofo, a preliminary tool for writing of time and interaction in computer music. Despite the variety of topics present in this thesis, the anticipatory design concept is common in all propositions with the following premises: that an anticipatory design can reduce the structural and computational complexity of modeling, and helps address complex problems in computational aesthetics and most importantly computer music.

Novel Techniques for Audio Music Classification and Search
Kris West, University of East Anglia, UK, September 2008. [BibTex, Abstract, PDF]

This thesis presents a number of modified or novel techniques for the analysis of music audio for the purposes of classifying it according genre or implementing so called `search-by-example’ systems, which recommend music to users and generate playlists and personalised radio stations. Novel procedures for the parameterisation of music audio are introduced, including an audio event-based segmentation of the audio feature streams and methods of encoding rhythmic information in the audio signal. A large number of experiments are performed to estimate the performance of different classification algorithms when applied to the classification of various sets of music audio features. The experiments show differing trends regarding the best performing type of classification procedure to use for different feature sets and segmentations of feature streams.

A novel machine learning algorithm (MVCART), based on the classic Decision Tree algorithm (CART), is introduced to more effectively deal with multi-variate audio features and the additional challenges introduced by event-based segmentation of audio feature streams. This algorithm achieves the best results on the classification of event-based music audio features and approaches the performance of state-of-the-art techniques based on summaries of the whole audio stream.

Finally, a number of methods of extending music classifiers, including those based on event-based segmentations and the MVCART algorithm, to build music similarity estimation and search procedures are introduced. Conventional methods of audio music search are based solely on music audio profiles, whereas the methods introduced allow audio music search and recommendation indices to utilise cultural information (in the form of music genres) to enhance or scale their recommendations, without requiring this information to be present for every track. These methods are shown to yield very significant reductions in computational complexity over existing techniques (such as those based on the KL-Divergence) whilst providing a comparable or greater level of performance. Not only does the significantly reduced complexity of these techniques allow them to be applied to much larger collections than the KL-Divergence, but they also produce metric similarity spaces, allowing the use of standard techniques for the scaling of metric search spaces.

Cross-Domain Content-Based Retrieval of Audio Music through Transcription
Iman S. H. Suyoto, Royal Melbourne Institute of Technology (RMIT), Melbourne, September 2008. [BibTex, Abstract, External Link]

Research in the field of music information retrieval (MIR) is concerned with methods to effectively retrieve a piece of music based on a user's query. An important goal in MIR research is the ability to successfully retrieve music stored as recorded audio using note-based queries.

In this work, we consider the searching of musical audio using symbolic queries. We first examined the effectiveness of using a relative pitch approach to represent queries and pieces. Our experimental results revealed that this technique, while effective, is optimal when the whole tune is used as a query. We then suggested an algorithm involving the use of pitch classes in conjunction with the longest common subsequence algorithm between a query and target, also using the whole tune as a query. We also proposed an algorithm that works effectively when only a small part of a tune is used as a query. The algorithm makes use of a sliding window in addition to pitch classes and the longest common subsequence algorithm between a query and target. We examined the algorithm using queries based on the beginning, middle, and ending parts of pieces.

We performed experiments on an audio collection and manually-constructed symbolic queries. Our experimental evaluation revealed that our techniques are highly effective, with most queries used in our experiments being able to retrieve a correct answer in the first rank position.

In addition, we examined the effectiveness of duration-based features for improving retrieval effectiveness over the use of pitch only. We investigated note durations and inter-onset intervals. For this purpose, we used solely symbolic music so that we could focus on the core of the problem. A relative pitch approach alongside a relative duration representation were used in our experiments. Our experimental results showed that durations fail to significantly improve retrieval effectiveness, whereas inter-onset intervals significantly improve retrieval effectiveness.

From Sparse Models to Timbre Learning: New Methods for Musical Source Separation
Juan José Burred, Technical University of Berlin, Berlin, Germany, September 2008. [BibTex, Abstract, External Link]

The goal of source separation is to detect and extract the individual signals present in a mixture. Its application to sound signals and, in particular, to music signals, is of interest for content analysis and retrieval applications arising in the context of online music services. Other applications include unmixing and remixing for post-production, restoration of old recordings, object-based audio compression and upmixing to multichannel setups.

This work addresses the task of source separation from monaural and stereophonic linear musical mixtures. In both cases, the problem is underdetermined, meaning that there are more sources to separate than channels in the observed mixture. This requires taking strong statistical assumptions and/or learning a priori information about the sources in order for a solution to be feasible. On the other hand, constraining the analysis to instrumental music signals allows exploiting specific cues such as spectral and temporal smoothness, note-based segmentation and timbre similarity for the detection and extraction of sound events.

The statistical assumptions and, if present, the a priori information, are both captured by a given source model that can greatly vary in complexity and extent of application. The approach used here is to consider source models of increasing levels of complexity, and to study their implications on the separation algorithm.

The starting point is sparsity-based separation, which makes the general assumption that the sources can be represented in a transformed domain with few high-energy coefficients. It will be shown that sparsity, and consequently separation, can both be improved by using nonuniform-resolution time-frequency representations. To that end, several types of frequency-warped filter banks will be used as signal front-ends in conjunction with an unsupervised stereo separation approach.

As a next step, more sophisticated models based on sinusoidal modeling and statistical training will be considered in order to improve separation and to allow the consideration of the maximally underdetermined problem: separation from single-channel signals. An emphasis is given in this work to a detailed but compact approach to train models of the timbre of musical instruments. An important characteristic of the approach is that it aims at a close description of the temporal evolution of the spectral envelope. The proposed method uses a formant-preserving, dimension-reduced representation of the spectral envelope based on spectral interpolation and Principal Component Analysis. It then describes the timbre of a given instrument as a Gaussian Process that can be interpreted either as a prototype curve in a timbral space or as a time-frequency template in the spectral domain.

A monaural separation method based on sinusoidal modeling and on the mentioned timbre modeling approach will be presented. It exploits common-fate and good-continuation cues to extract groups of sinusoidal tracks corresponding to the individual notes. Each group is compared to each one of the timbre templates on the database using a specially-designed measure of timbre similarity, followed by a Maximum Likelihood decision. Subsequently, overlapping and missing parts of the sinusoidal tracks are retrieved by interpolating the selected timbre template. The method is later extended to stereo mixtures by using a preliminary spatial-based blind separation stage, followed by a set of refinements performed by the above sinusoidal modeling and timbre matching methods and aiming at reducing interferences with the undesired sources.

A notable characteristic of the proposed separation methods is that they do not assume harmonicity, and are thus not based on a previous multipitch estimation stage, nor on the input of detailed pitch-related information. Instead, grouping and separation relies solely on the dynamic behavior of the amplitudes of the partials. This also allows separating highly inharmonic sounds and extracting chords played by a single instrument as individual entities.

The fact that the presented approaches are supervised and based on classification and similarity allows using them (or parts thereof) for other content analysis applications. In particular the use of the timbre models, and the timbre matching stages of the separation systems will be evaluated in the tasks of musical instrument classification and detection of instruments in polyphonic mixtures.

Real Time Automatic Harmonisation (in French)
Giordano Cabra, University of Paris 6, Paris, France, July 2008. [BibTex, Abstract]

We define real-time automatic harmonization systems (HATR in french) as computer programs that accompany a possibly improvised melody, by finding an appropriate harmony to be applied to a rhythmic pattern. In a real-time harmonization situation, besides performance and scope constraints, the system and the user are in symbiosis. Consequently, the system incorporates elements from accompaniment, composition, and improvisation, remaining a challenging project, not only for its complexity but also for the lack of solid references in the scientific literature.

In this work, we propose some extensions to techniques developed in the recent past by the music information retrieval (MIR) community, in order to create programs able to work directly with audio signals. We have performed a series of experiments, which allowed us to systematize the main parameters involved in the development of such systems. This systematization led us to the construction of a HATR framework to explore possible solutions, instead of individual applications.

We compared the applications implemented with this framework by an objective measure as well as by a human subjective evaluation. This thesis presents the pros and cons of each solution, and estimates its musical level by comparing it to real musicians. The results of these experiments show that a simple solution may overcome complex ones. Other experiments have been made to test the robustness and scalability of the framework solutions.

Finally, the technology we constructed has been tested in novel situations, in order to explore possibilities of future work.

Automatically Extracting, Analyzing, and Visualizing Information on Music Artists from the World Wide Web
Markus Schedl, Johannes Kepler University, Linz, Austria, July 2008. [BibTex, Abstract, PDF]

In the context of this PhD thesis, methods for automatically extracting music-related information from the World Wide Web have been elaborated, implemented, and analyzed. Such information is becoming more and more important in times of digital music distribution via the Internet as users of online music stores nowadays expect to be offered additional music-related information beyond the pure digital music file. Novel techniques have been developed as well as existing ones refined in order to gather information about music artists and bands from the Web. These techniques are related to the research fields of music information retrieval, Web mining, and information visualization. More precisely, on sets of Web pages that are related to a music artist or band, Web content mining techniques are applied to address the following categories of information:
- similarities between music artists or bands
- prototypicality of an artist or a band for a genre
- descriptive properties of an artist or a band
- band members and instrumentation
- images of album cover artwork

Different approaches to retrieve the corresponding pieces of information for each of these categories have been elaborated and evaluated thoroughly on a considerable variety of music repositories. The results and main findings of these assessments are reported. Moreover, visualization methods and user interaction models for prototypical and similar artists as well as for descriptive terms have evolved from this work.

Based on the insights gained by the various experiments and evaluations conducted, the core application of this thesis, the "Automatically Generated Music Information System" (AGMIS) was build. AGMIS demonstrates the applicability of the elaborated techniques on a large collection of more than 600,000 artists by providing a Web-based user interface to access a database that has been populated automatically with the extracted information.

Although AGMIS does not always give perfectly accurate results, the automatic approaches to information retrieval have some advantages in comparison with those employed in existing music information systems, which are either based on labor-intensive information processing by music experts or on community knowledge that is vulnerable to distortion of information.

A System for Acoustic Chord Transcription and Key Extraction from Audio Using Hidden Markov Models Trained on Synthesized Audio
Kyogu Lee, Stanford University, CA, USA, March 2008. [BibTex, Abstract, PDF]

Extracting high-level information of musical attributes such as melody, harmony, key, or rhythm from the raw waveform is a critical process in Music Information Retrieval (MIR) systems. Using one or more of such features in a front end, one can efficiently and effectively search, retrieve, and navigate through a large collection of musical audio. Among those musical attributes, harmony is a key element in Western tonal music. Harmony can be characterized by a set of rules stating how simultaneously sounding (or inferred) tones create a single entity (commonly known as a chord), how the elements of adjacent chords interact melodically, and how sequences of chords relate to one another in a functional hierarchy. Patterns of chord changes over time allow for the delineation of structural features such as phrases, sections and movements. In addition to structural segmentation, harmony often plays a crucial role in projecting emotion and mood. This dissertation focuses on two aspects of harmony, chord labeling and chord progressions in diatonic functional tonal music. Recognizing the musical chords from the raw audio is a challenging task. In this dissertation, a system that accomplishes this goal using hidden Markov models is described. In order to avoid the enormously time-consuming and laborious process of manual annotation, which must be done in advance to provide the ground-truth to the supervised learning models, symbolic data like MIDI files are used to obtain a large amount of labeled training data. To this end, harmonic analysis is first performed on noise-free symbolic data to obtain chord labels with precise time boundaries. In parallel, a sample-based synthesizer is used to create audio files from the same symbolic files. The feature vectors extracted from synthesized audio are in perfect alignment with the chord labels, and are used to train the models.

Sufficient training data allows for key- or genre-specific models, where each model is trained on music of specific key or genre to estimate key- or genre-dependent model parameters. In other words, music of a certain key or genre reveals its own characteristics reflected by chord progression, which result in the unique model parameters represented by the transition probability matrix. In order to extract key or identify genre, when the observation input sequence is given, the forward-backward or Baum- Welch algorithm is used to efficiently compute the likelihood of the models, and the model with the maximum likelihood gives key or genre information. Then the Viterbi decoder is applied to the corresponding model to extract the optimal state path in a maximum likelihood sense, which is identical to the frame-level chord sequence. The experimental results show that the proposed system not only yields chord recognition performance comparable to or better than other previously published systems, but also provides additional information of key and/or genre without using any other algorithms or feature sets for such tasks. It is also demonstrated that the chord sequence with precise timing information can be successfully used to find cover songs from audio and to detect musical phrase boundaries by recognizing the cadences or harmonic closures.

This dissertation makes a substantial contribution to the music information retrieval community in many aspects. First, it presents a probabilistic framework that combines two closely related musical tasks — chord recognition and key extraction from audio — and achieves state-of-the-art performance in both applications. Second, it suggests a solution to a bottleneck problem in machine learning approaches by demonstrating the method of automatically generating a large amount of labeled training data from symbolic music documents. This will help free researchers of laborious task of manual annotation. Third, it makes use of more efficient and robust feature vector called tonal centroid and proves, via a thorough quantitative evaluation, that it consistently outperforms the conventional chroma feature, which was almost exclusively used by other algorithms. Fourth, it demonstrates that the basic model can easily be extended to key- or genre-specific models, not only to improve chord recognition but also to estimate key or genre. Lastly, it demonstrates the usefulness of recognized chord sequence in several practical applications such as cover song finding and structural music segmentation.

Computer-Based Music Theory and Acoustics
Matt Wright, Stanford University, CA, USA, March 2008. [BibTex, Abstract, External Link]

A musical event’s Perceptual Attack Time (“PAT”) is its perceived moment of rhythmic placement; in general it is after physical or perceptual onset. If two or more events sound like they occur rhythmically together it is because their PATs occur at the same time, and the perceived rhythm of a sequence of events is the timing pattern of the PATs of those events. A quantitative model of PAT is useful for the synthesis of rhythmic sequences with a desired perceived timing as well as for computer-assisted rhythmic analysis of recorded music. Musicians do not learn to make their notes' physical onsets have a certain rhythm; rather, they learn to make their notes' perceptual attack times have a certain rhythm.

PAT is notoriously difficult to measure, because all known methods can measure a test sound’s PAT only in relationship to a physical action or to a second sound, both of which add their own uncertainty to the measurements. A novel aspect of this work is the use of the ideal impulse (the shortest possible digital audio signal) as a reference sound. Although the ideal impulse is the best possible reference in the sense of being perfectly isolated in time and having a very clear and percussive attack, it is quite difficult to use as a reference for most sounds because it has a perfectly broad frequency spectrum, and it is more difficult to perceive the relative timing of sounds when their spectra differ greatly. This motivates another novel contribution of this work, Spectrally Matched Click Synthesis, the creation of arbitrarily short duration clicks whose magnitude frequency spectra approximate those of arbitrary input sounds.

All existing models represent the PAT of each event as a single instant. However, there is often a range of values that sound equally correct when aligning sounds rhythmically, and this range depends on perceptual characteristics of the specific sounds such as the sharpness of their attacks. Therefore this work represents each event’s PAT as a continuous probability density function indicating how likely a typical listener would be to hear the sound’s PAT at each possible time. The methodological problem of deriving each sound’s own PAT from measurements comparing pairs of sounds therefore becomes the problem of estimating the distributions of the random variables for each sound’s intrinsic PAT given only observations of a random variable corresponding to difference between the intrinsic PAT distributions for the two sounds plus noise. Methods presented to address this draw from maximum likelihood estimation and the graphtheoretical shortest path problem.

This work describes an online listening test, in which subjects download software that presents a series of PAT measurement trials and allows them to adjust their relative timing until they sound synchronous. This establishes perceptual “ground truth” for the PAT of a collection of 20 sounds compared against each other in various combinations. As hoped, subjects were indeed able to align a sound more reliably to one of that sound’s spectrally matched clicks than to other sounds of the same duration.

The representation of PAT with probability density functions provides a new perspective on the long-standing problem of predicting PAT directly from acoustical signals. Rather than choosing a single moment for PAT given a segment of sound known a priori to contain a single musical event, these regression methods estimate continuous shapes of PAT distributions from continuous (not necessarily presegmented) audio signals, formulated as a supervised machine learning regression problem whose inputs are DSP functions computed from the sound, the detection functions used in the automatic onset detection literature. This work concludes with some preliminary musical applications of the resulting models.

Studies on Hybrid Music Recommendation Using Timbral and Rhythmic Features
Kazuyoshi Yoshii, Kyoto University, Kyoto, Japan, March 2008. [BibTex, Abstract, PDF]

The importance of music recommender systems is increasing because most online services that manage large music collections cannot provide users with entirely satisfactory access to their collections. Many users of music streaming services want to discover “unknown” pieces that truly match their musical tastes even if these pieces are scarcely known by other users. Recommender systems should thus be able to select musical pieces that will likely be preferred by estimating user tastes. To develop a satisfactory system, it is essential to take into account the contents and ratings of musical pieces. Note that the content should be automatically analyzed from multiple viewpoints such as timbral and rhythmic aspects whereas the ratings are provided by users.

We deal with hybrid music recommendation in this thesis based on users’ ratings and timbral and rhythmic features extracted from polyphonic musical audio signals. Our goal was to simultaneously satisfy six requirements: (1) accuracy, (2) diversity, (3) coverage, (4) promptness, (5) adaptability, and (6) scalability. To achieve this, we focused on a model-based hybrid filtering method that has been proposed in the field of document recommendation. This method can be used to make accurate and prompt recommendations with wide coverage and diversity in musical pieces, using a probabilistic generative model that unifies both content-based and rating-based data in a statistical way.

To apply this method to enable satisfying music recommendation, we tackled four issues: (i) lack of adaptability, (ii) lack of scalability, (iii) no capabilities for using musical features, and (iv) no flexibility for integrating multiple aspects of music. To solve issue (i), we propose an incremental training method that partially updates the model to promptly reflect partial changes in the data (addition of rating scores and registration of new users and pieces) instead of training the entire model from scratch. To solve issue (ii), we propose a cluster-based training method that efficiently constructs the model at a fixed computational cost regardless of the numbers of users and pieces. To solve issue (iii), we propose a bag-of-features model that represents the time-series features of a musical piece as a set of existence probabilities of predefined features. To solve issue (iv), we propose a flexible method that integrates the musical features of timbral and rhythmic aspects into bag-of-features representations.

In Chapter 3, we first explain the model-based method of hybrid filtering that takes into account both rating-based and content-based data, i.e., rating scores awarded by users and the musical features of audio signals. The probabilistic model can be used to formulate a generative mechanism that is assumed to lie behind the observed data from the viewpoint of probability theory. We then present incremental training and its application to cluster-based training. The model formulation enables us to incrementally update the partial parameters of the model according to the increase in observed data. Cluster-based training initially builds a compact model called a core model for fixed numbers of representative users and pieces, which are the centroids of clusters of similar users and pieces. To obtain the complete model, the core model is then extended by registering all users and pieces with incremental training. Finally, we describe the bag-of-features model to enable hybrid filtering to deal with musical features extracted from polyphonic audio signals. To capture the timbral aspects of music, we created a model for the distribution of Mel frequency cepstral coefficients (MFCCs). To also take into account the rhythmic aspects of music, we effectively combined rhythmic features based on drum-sound onsets with timbral features (MFCCs) by using principal component analysis (PCA). The onsets of drum sounds were automatically obtained as described in the next chapter.

Chapter 4 describes a system that detects onsets of the bass drum, snare drum, and hi-hat cymbals from polyphonic audio signals. The system takes a template-matchingbased approach that uses the power spectrograms of drum sounds as templates. However, there are two problems. The first is that no appropriate templates are known for all songs. The second is that it is difficult to detect drum-sound onsets in sound mixtures including various sounds. To solve these, we propose two methods of template adaptation and harmonic-structure suppression. First, an initial template for each drum sound (seed template) is prepared. The former method adapted it to actual drum-sound spectrograms appearing in the song spectrogram. To make our system robust to the overlapping of harmonic sounds with drum sounds, the latter method suppressed harmonic components in the song spectrogram. Experimental results with 70 popular songs demonstrated that our methods improved the recognition accuracy and respectively achieved 83%, 58%, and 46% in detecting the onsets of the bass drum, snare drum, and hi-hat cymbals.

In Chapter 5, we discuss the evaluation of our system by using audio signals from commercial CDs and their corresponding rating scores obtained from an e-commerce site. The results revealed that our system accurately recommended pieces including non-rated ones from a wide diversity of artists and maintained a high degree of accuracy even when new rating score, users, and pieces were added. Cluster-based training, which can speed up model training a hundred fold, had the potential to improve the accuracy of recommendations. That is, we found a breakthrough that overcame the trade-off, i.e., accuracy v.s. efficiency, which has been considered to be unavoidable. In addition, we verified the importance of timbral and rhythmic features in making accurate recommendations.

Chapter 6 discusses the major contributions of this study to different research fields, particularly to music recommendation and music analysis. We also discuss issues that still remain to be resolved and future directions of research.

Chapter 7 concludes this thesis.

A Study on Developing Applications Systems Based on Singing Understanding and Singing Expression (in Japanese)
Tomoyasu Nakano, University of Tsukuba, Tsukuba, Japan, March 2008. [BibTex, Abstract]

Since the singing voice is one of the most familiar ways of expressing music, a music information processing system that utilizes the singing voice is a promising research topic with various applications in scope. The singing voice has been the subject of various studies in various fields including physiology, anatomy, acoustics, psychology, and engineering. Recently, the research interests have been directed towards developing systems that support use by non-musician users, such as singing training assistance and music information retrieval by using hummed melody (query-by-humming) or singing voice timbre. The basic studies of the singing voice can be applied to broadening the scope of the music information processing.

The aim of this research is to develop a system that enriches a relationship between humans and music, through the study of singing understanding and singing expression. The specific research themes treated in this thesis are the evaluation of singing skills in the scope of singing understanding, and voice percussion recognition in the scope of singing expression.

This thesis consists of two parts, corresponding to the two major topics of the research work. Part 1 deals with the study of human singing skill evaluation, as part of a more broader research domain of understanding of human singing. Part 2 deals with the study of voice percussion, as part of the study of singing expression.

In both studies, the approach and methodology follows what is called the HMI approach, which is a unification of three research approaches investigating the Human (H), Machine (M), and Interaction/Interface (I) aspects of singing.

Part1: Singing understanding (singing skill)

Chapter 2 presents the results of two experiments on singing skill evaluation, where human subjects (raters) judge the subjective quality of previously unheard melodies (H domain). This will serve as a preliminary basis for developing an automatic singing skill evaluation method for unknown melodies. Such an evaluation system can be a useful tool for improving singing skills, and also can be applied to broadening the scope of music information retrieval and singing voice synthesis. Previous research on singing skill evaluation for unknown melodies has focused on analyzing the characteristics of the singing voice, but were not directly applied to automatic evaluation or studied in comparison with the evaluation by human subjects.

The two experiments used the rank ordering method, where the subjects ordered a group of given stimuli according to their preferred ratings. Experiment 1 was intended to explore the criteria that human subjects use in judging singing skill and the stability of their judgments, using unaccompanied singing sequences (solo singing) as the stimuli. Experiment 2 uses the F0 sequences (F0 singing) extracted from solo singing, and was resynthesized as a sinusoidal wave. The experiment was intended to identify the contribution of F0 in the judgment. In experiment 1, six key features were extracted from the introspective reports of the subjects as being significant for judging singing skill. The results of experiment 1 show that 88.9% of the correlation between the subjects' evaluations were significant at the 5 % level. This drops to 48.6% in experiment 2, meaning that F0 contribution is relatively low, although the median ratings of stimuli evaluated as good were higher than the median ratings of stimuli evaluated as poor in all cases.

Human subjects can be seen to consistently evaluate the singing skills for unknown melodies. This suggests that their evaluation utilizes easily discernible features which are independent of the particular singer or melody. The approach presented in Chapter 3 uses pitch interval accuracy and vibrato (intentional, periodic fluctuation of F0) which are independent from specific characteristics of the singer or melody (M domain). These features was tested by a 2-class (good/poor) classification test with 600 song sequences, and achieved a classification rate of 83.5%.

Following the results of the subjective evaluation (H domain), MiruSinger, a singing skill visualization interface, was implemented (I domain). MiruSinger provides realtime visual feedback of singing voice, and focuses on the visualization of two key features . F0 (for pitch accuracy improvement) and vibrato sections (for singing technique improvement). Unlike previous systems, real-world music CD recordings are used as referential data. The F0 of vocal-part is estimated automatically from music CD recordings, which can further be hand-corrected interactively using a graphical interface on the MiruSinger screen.

Part2: Singing expression (voice percussion)

Voice percussion in our context is the mimicking of drum sounds by voice, expressed in verbal form that can be transcribed into phonemic representation, or onomatopoeia (e.g. don-tan-do-do-tan). Chapter 5 describes a psychological experiment, voice percussion expression experiment, which gathers data on how subjects express drum patterns (H domain). This will serve as a preliminary basis for developing a voice percussion recognition method. Previous studies on query-by-humming focused on pitch detection and melodic feature extraction, but these features have less relevance in voice percussion recognition, which is primarily concerned with classification of timbre and identification of articulation methods. The methods for handling such features can be useful tools for music notation interface, and also can be applied to have promising applications in widening the scope of music information retrieval.

A "drum pattern" in our context means a sequence of drum beats that form minimum unit (one measure). In this thesis, drum patterns consist of only two percussion instruments . bass drum (BD) and snare drum (SD). In the expression experiment, there were 17 subjects of ages 19 to 31 (two with experience in percussion). The voice percussion sung by the subjects were recorded and analyzed. Significant discoveries from the expression experiment include: "the onomatopoeic expression had correspondence with the length and rhythmic patterns of the beats" and "some subjects may verbally expressed rest notes".

Chapter 6 describes a voice percussion recognition method. The voice percussion was compared with all the patterns in a drum pattern database, and the pattern that was estimated to be acoustically most close to the voice percussion is selected as the recognized result (M domain). The search first looks for drum patterns over onomatopoeic sequences. This selects instrument sequences with the highest likelihood ratings, which are then checked over their onset timings. The pattern with the highest ranking is output as the final result. The recognition method was tested by recognition experiments over a combination of different settings of the acoustic model and the pronunciation dictionary. The following 4 conditions were evaluated.

(A) General acoustic model of speech
(B) Acoustic model tuned by voice percussion utterances not in evaluation data
(C) Acoustic model tuned to individual subjects
(D) Same acoustic model, with the pronunciation dictionary restricted to the expressions used by the subject

The recognition rate in the evaluation experiments were (A)58.5%, (B)58.5%, (C)85.0%, and (D)92.0%.

Following the encouraging results of the proposed method as a practical tool for voice percussion recognition, a score input interface, Voice Drummer, was developed, as its application (I domain). Voice Drummer consists of a score input mode which is used for drum pattern input intended for use in composition, and an arrangement mode which edits drum patterns in a given music piece. There is also a practice/adaptation mode where the user can practice and adapt the system to his/her voice, thus increasing the recognition rate.

Conclusions

Part 1 presented the results of the subjective evaluation experiments, and presented two acoustical features, pitch interval accuracy and vibrato, as key features for evaluating singing skills. The results of the subjective evaluation suggested that the singing skill evaluation of human listeners are generally consistent and in mutual agreement. In the classification experiment, the acoustical features are shown to be effective for evaluating singing skills without score information.

Part 2 presented the results of the voice percussion expression experiment, and presented a voice percussion recognition method. The onomatopoeic expressions utilized in the regcognition experiment were extracted from the expression experiment. In the recognition experiment, the voice percussion recognition method achieved a recognition rate of 91.0% for the highest-tuned setting.

The results of these two studies were adapted to the development of two applications systems, MiruSinger for singing training assistance and Voice Drummer for percussion instrument notation. Trial usage of the systems suggest that they would be a useful tool and fun for average users.

The presented work can be seen as pioneering work in the fields of singing understanding and expression, contributing to the advance of singing voice research.

2007

Anchors and hubs in audio-based music similarity
Adam Berenzweig, Columbia University, NY, USA, 2007. [BibTex, Abstract, External Link]

Content-based music discovery, retrieval, and management tools become increasingly important as the size of personal music collections grows larger than several thousand songs. Furthermore, they can play a crucial role in the music marketing process as music discovery moves online, which which benefits both consumers and artists.

This dissertation describes our work on computing music similarity measures from audio. The basic approach is to compute short-time spectral features, model the distributions of these features, and then compare the models. Several choices of features, models, and comparison techniques are examined, including a method of mapping audio features into a semantic space called anchor space before modeling. A practical problem with this technique, known as the hub phenomenon, is explored, and we conclude that it is related to the curse of dimensionality.

Music similarity is inherently subjective, context-dependent, and multi-dimensional, and so there is no single ground truth for training and evaluation. Therefore, some effort has gone into exploring different sources of subjective human opinion and objective ground truth, and developing evaluation metrics that use them.

Content-based visualisation to aid popular navigation of musical audio
Gavin Wood, University of York, York, UK, 2007. [BibTex]

SWIPE: A Sawtooth Waveform Inspired Pitch Estimator for Speech and Music
Arturo Camacho, University of Florida, FL, USA, December 2007. [BibTex, Abstract, PDF]

A Sawtooth Waveform Inspired Pitch Estimator (SWIPE) has been developed for processing speech and music. SWIPE is shown to outperform existing algorithms on several publicly available speech/musical-instruments databases and a disordered speech database. SWIPE estimates the pitch as the fundamental frequency of the sawtooth waveform whose spectrum best matches the spectrum of the input signal. A decaying cosine kernel provides an extension to older frequency-based, sieve-type estimation algorithms by providing smooth peaks with decaying amplitudes to correlate with the harmonics of the signal. An improvement on the algorithm is achieved by using only the first and prime harmonics, which significantly reduces subharmonic errors commonly found in other pitch estimation algorithms.

Towards Automatic Rhythmic Accompaniment
Matthew Davies, Queen Mary University of London, London, UK, August 2007. [BibTex, Abstract, PDF]

In this thesis we investigate the automatic extraction of rhythmic and metrical information from audio signals. Primarily we address three analysis tasks: the extraction of beat times, equivalent to the human ability of foot-tapping in time to music; finding bar boundaries, which can be considered analogous to counting the beats of the bar; and thirdly the extraction of a predominant rhythmic pattern to characterise the distribution of note onsets within the bars.

We extract beat times from an onset detection function using a two-state beat tracking model. The first state is used to discover an initial tempo and track tempo changes, while the second state maintains contextual continuity within consistent tempo hypotheses. The bar boundaries are recovered by finding the spectral difference between beat synchronous analysis frames, and the predominant rhythmic pattern by clustering bar length onset detection function frames.

In addition to the new techniques presented for extracting rhythmic information, we also address the problem of evaluation, that of how to quantify the extent to which the analysis has been successful. To this aim we propose a new formulation for beat tracking evaluation, where accuracy is measured in terms of the entropy of a beat error histogram.

To illustrate the combination of all three layers of analysis we present this research in the context of automatic musical accompaniment, such that the resulting rhythmic information can be realised as an automatic percussive accompaniment to a given input audio signal.

Transcription des signaux percussifs. Application à l'analyse de scènes musicales audiovisuelles
Olivier Gillet, Telecom ParisTech - ENST, Paris, France, June 2007. [BibTex, Abstract, External Link]

This thesis establishes links between the fields of audio indexing and video sequence analysis, through the problem of drum signal analysis.

In a first part, the problem of drum track transcription from polyphonic music signals is addressed. After having presented several pre-processings for drum track enhancement, and a large set of relevant features, a statistical machine learning approach to drum track transcription is proposed. Novel supervised and unsupervised sequence modeling methods are also introduced to enhance the detection of drum strokes by taking into account the regularity of drum patterns. We conclude this part by evaluating various drum track separation algorithms and by underlining the duality between transcription and source separation.

In a second part, we extend this transcription system by taking into account the video information brought by cameras filming the drummer. Various approaches are introduced to segment the scene and map each region of interest to a drum instrument. Motion intensity features are then used to detect drum strokes. Our results show that a multimodal approach is capable of resolving some ambiguities inherent to audio-only transcription.

In the final part, we extend our work to a broader range of music videos, which may not show the musicians. We particularly address the problem of understanding how a piece of music can be illustrated by images. After having presented or introduced new segmentation techniques for audio and video streams, we define synchrony measures on their structures. These measures can be used for both retrieval applications (music retrieval by video) or content classification.

Spectral Processing Of The Singing Voice
Alex Loscos, University Pompeu Fabra, Barcelona, Spain, May 2007. [BibTex, Abstract, External Link]

This dissertation is centered on the digital processing of the singing voice, more concretely on the analysis, transformation and synthesis of this type of voice in the spectral domain, with special emphasis on those techniques relevant for music applications.

The digital signal processing of the singing voice became a research topic itself since the middle of last century, when first synthetic singing performances were generated taking advantage of the research that was being carried out in the speech processing field. Even though both topics overlap in some areas, they present significant differentiations because of (a) the special characteristics of the sound source they deal and (b) because of the applications that can be built around them. More concretely, while speech research concentrates mainly on recognition and synthesis; singing voice research, probably due to the consolidation of a forceful music industry, focuses on experimentation and transformation; developing countless tools that along years have assisted and inspired most popular singers, musicians and producers. The compilation and description of the existing tools and the algorithms behind them are the starting point of this thesis.

The first half of the thesis compiles the most significant research on digital processing of the singing voice based on spectral domain, proposes a new taxonomy for grouping them into categories, and gives specific details for those in which the author has mostly contributed to; namely the sinusoidal plus residual model Spectral Modelling Synthesis (SMS), the phase locked vocoder variation Spectral Peak Processing (SPP), the Excitation plus Residual (EpR) spectral model of the voice, and a sample concatenation based model. The second half of the work presents new formulations and procedures for both describing and transforming those attributes of the singing voice that can be regarded as voice specific. This part of the thesis includes, among others, algorithms for rough and growl analysis and transformation, breathiness estimation and emulation, pitch detection and modification, nasality identification, voice to melody conversion, voice beat onset detection, singing voice morphing, and voice to instrument transformation; being some of them exemplified with concrete applications.

Content-Based Audio Search from Fingerprinting to Semantic Audio Retrieval
Pedro Cano, University Pompeu Fabra, Barcelona, Spain, April 2007. [BibTex, Abstract, External Link]

This dissertation is about audio content-based search. Specifically, it is on exploring promising paths for bridging the semantic gap that currently prevents widedeployment of audio content-based search engines. Music search sound engines rely on metadata, mostly human generated, to manage collections of audio assets. Even though time-consuming and error-prone, human labeling is a common practice. Audio content-based methods, algorithms that automatically extract description from audio files, are generally not mature enough to provide the user friendly representation that users demand when interacting with audio content. Mostly, content-based methods provide low-level descriptions, while high-level or semantic descriptions are beyond current capabilities. This dissertation has two parts. In a first part we explore the strengths and limitation of a pure low-level audio description technique: audio fingerprinting. We prove by implementation of di.erent systems that automatically extracted low-level description of audio are able to successfully solve a series of tasks such as linking unlabeled audio to corresponding metadata, duplicate detection or integrity verification. We show that the di.erent audio fingerprinting systems can be explained with respect to a general fingerprinting framework. We then suggest that the fingerprinting framework, which shares many functional blocks with content-based audio search engines, can eventually be extended to allow for content-based similarity type of search, such as find similar or “query-by-example”. However, low-level audio description cannot provide a semantic interaction with audio contents. It is not possible to generate a verbose and detailed descriptions in unconstraint domains, for instance, for asserting that a sound corresponds to "fast male footsteps on wood" but rather some signal-level descriptions.

In the second part of the thesis we hypothesize that one of the problems that hinders the closing the semantic gap is the lack of intelligence that encodes common sense knowledge and that such a knowledge base is a primary step toward bridging the semantic gap. For the specific case of sound e.ects, we propose a general sound classifier capable of generating verbose descriptions in a representation that computers and users alike can understand. We conclude the second part with the description of a sound e.ects retrieval system which leverages both low-level and semantic technologies and that allows for intelligent interaction with audio collections.

Statistical Approach to Multipitch Analysis
Hirokazu Kameoka, University of Tokyo, Japan, March 2007. [BibTex, Abstract, PDF]

We deal through this paper with the problem of estimating "information" of each sound source separately from an acoustic signal of compound sound. Here "information" is used in a wide sense to include not only the waveform itself of the separate source signal but also the power spectrum, fundamental frequency (F0), spectral envelope and other features. Such a technique could be potentially useful for a wide range of applications such as robot auditory sensor, robust speech recognition, automatic transcription of music, waveform encoding for the audio CODEC (compression-decompression) system, a new equalizer system enabling bass and treble controls for separate source, and indexing of music for music retrieval system.

Generally speaking, if the compound signal were separated, then it would be a simple matter to obtain an F0 estimate from each stream using a single voice F0 estimation method and, on the other hand, if the F0s were known in advance, could be very useful information available for separation algorithms. Therefore, source separation and F0 estimation are essentially a "chicken-and-egg problem", and it is thus perhaps better if one could formulate these two tasks as a joint optimization problem. In Chapter 2, we introduce a method called "Harmonic Clustering", which searches for the optimal spectral masking function and the optimal F0 estimate for each source by performing the source separation step and the F0 estimation step iteratively.

In Chapter 3, we establish a generalized principle of Harmonic Clustering by showing that Harmonic Clustering can be understood as the minimization of the distortion between the power spectrum of the mixed sound and a mixture of spectral cluster models. Based on this fact, it becomes clear that this problem amounts to a maximum likelihood problem with the continuous Poisson distribution as the likelihood function. This Bayesian reformulation enables us not only to impose empirical constraints, which are usually necessary for any underdetermined problems, to the parameters by introducing prior probabilities but also to derive a model selection criterion, that leads to estimating the number of sources. We confirmed through the experiments the effectiveness of the two techniques introduced in this chapter: multiple F0 estimation and source number estimation.

Human listeners are able to concentrate on listening to a target sound without difficulty even in the situation where many speakers are talking at the same time or many instruments are played together. Recent efforts are being directed toward the attempt to implement this ability by human called the \auditory stream segregation". Such an approach is referred to as the "Computational Auditory Scene Analysis (CASA)". In Chapter 4, we aim at developing a computational algorithm enabling the decomposition of the time-frequency components of the signal of interest into distinct clusters such that each of them is associated with a single auditory stream. To do so, we directly model a spectro-temporal model whose shape can be taken freely within the constraint called \Bregman's grouping cues", and then try to fit the mixture of this model to the observed spectrogram as well as possible. We call this approach "Harmonic-Temporal Clustering". While most of the conventional methods usually perform separately the extraction of the instantaneous features at each discrete time point and the estimation of the whole tracks of these features, the method described in this chapter performs these procedures simultaneously. We confirmed the advantage of the proposed method over conventional methods through experimental evaluations.

Although many efforts have been devoted to both F0 estimation and spectral envelope estimation intensively in the speech processing area, the problem of determining F0 and spectral envelope seems to have been tackled independently. If the F0 were known in advance, then the spectral envelope could be estimated very reliably. On the other hand, if the spectral envelope were known in advance, then we could easily correct subharmonic errors. F0 estimation and spectral envelope estimation, having such a chicken and egg relationship, should thus be done jointly rather than independently with successive procedures. From this standpoint, we will propose a new speech analyzer that jointly estimates pitch and spectral envelope using a parametric speech source-filter model. We found through the experiments a significant advantage of jointly estimating F0 and spectral envelope in both F0 estimation and spectral envelope estimation.

The approaches of the preceding chapters are based on the approximate assumption of additivity of the power spectra (neglecting the terms corresponding to interferences between frequency components), but it becomes usually difficult to infer F0s when two voices are mixed with close F0s as far as we are only looking at the power spectrum. In this case not only the harmonic structure but also the phase difference of each signal becomes an important cue for separation. Moreover, having in mind future source separation methods designed for multi-channel signals of multiple sensory input, analysis methods in the complex spectrum domain including the phase estimation are indispensable. Taking into account the significant effectiveness and the advantage of the approach described in the preceding chapters, we have been motivated to extend it to a complex-spectrum-domain approach without losing its essential characteristics. The main topic of Chapter 6 is the development of a nonlinear optimization algorithm to obtain the maximum likelihood parameter of the superimposed periodic signal model: focusing on the fact that the difficulty of the single tone frequency estimation or the fundamental frequency estimation, which are at the core of the parameter estimation problem for the sinusoidal signal model, comes essentially from the nonlinearity of the model in the frequency parameter, we introduce a new iterative estimation algorithm using a principle called the "auxiliary function method". This idea was inspired by the principle of the EM algorithm. Through simulations, we confirmed that the advantage of the proposed method over the existing gradient descent-based method in the ability to avoid local solutions and the convergence speed. We also confirmed the basic performance of our method through 1ch speech separation experiments on real speech signal.

Computational Musical Instrument Recognition and Its Application to Content-based Music Information Retrieval
Tetsuro Kitahara, Kyoto University, Japan, March 2007. [BibTex, Abstract, PDF]

The current capability of computers to recognize auditory events is severely limited when compared to human ability. Although computers can accurately recognize sounds that are sufficiently close to those trained in advance and that occur without other sounds simultaneously, they break down whenever the inputs are degraded by competing sounds.

In this thesis, we address computational recognition of non-percussive musical instruments in polyphonic music. Music is a good domain for computational recognition of auditory events because multiple instruments are usually played simultaneously. The difficulty in handling music resides in the fact that signals (events to be recognized) and noises (events to be ignored) are not uniquely defined. This is the main difference from studies of speech recognition under noisy environments. Musical instrument recognition is also important from an industrial standpoint. The recent development of digital audio and network technologies has enabled us to handle a tremendous number of musical pieces and therefore efficient music information retrieval (MIR) is required. Musical instrument recognition will serve as one of the key technologies for sophisticated MIR because the types of instruments played characterize musical pieces; some musical forms, in fact, are based on instruments, for example “piano sonata” and “string quartet.”

Despite the importance of musical instrument recognition, studies have until recently mainly dealt with monophonic sounds. Although the number of studies dealing with polyphonic music has been increasing, their techniques have not yet achieved a sufficient level to be applied to MIR or other real applications. We investigate musical instrument recognition in two stages. At the first stage, we address instrument recognition for monophonic sounds to develop basic technologies for handling musical instrument sounds. In instrument recognition for monophonic sounds, we deal with two issues: (1) the pitch dependency of timbre and (2) the input of non-registered instruments. Because musical instrument sounds have wide pitch ranges in contrast to other kinds of sounds, the pitch dependency of timbre is an important issue. The second issue, that is, handling instruments that are not contained in training data, is also an inevitable problem. This is because it is impossible in practice to build a thorough training data set due to a virtually infinite number of instruments. At the second stage, we address instrument recognition in polyphonic music. To deal with polyphonic music, we must solve the following two issues: (3) the overlapping of simultaneously played notes and (4) the unreliability of precedent note estimation process. When multiple instruments simultaneously play, partials (harmonic components) of their sounds overlap and interfere. This makes the acoustic features different from those of monophonic sounds. The overlapping of simultaneous notes is therefore an essential problem for polyphonic music. In addition, note estimation, that is, estimating the onset time and fundamental frequency (F0) of each note, is usually used as a preprocess in a typical instrument recognition framework. It remains, however, a challenging problem for polyphonic music.

In Chapter 3, we propose an F0-dependent multivariate normal distribution to resolve the first issue. The F0-dependent multivariate normal distribution is an extension of a multivariate normal distribution where the mean vector is defined as a function of F0. The key idea behind this is to approximate variation of each acoustic feature from pitch to pitch as a function of F0. This approximation makes it possible to separately model the pitch and non-pitch dependencies of timbres. We also investigate acoustic features for musical instrument recognition in this chapter. Experimental results with 6,247 solo tones of 19 instruments showed improvement of the recognition rate from 75.73% to 79.73% on average.

In Chapter 4, we solve the second issue by recognizing non-registered instruments at the category level. When a given sound is registered, its instrument name, e.g. violin, is recognized. Even if it is not registered, its category name, e.g. strings, can be recognized. The important issue in achieving such recognition is to adopt a musical instrument taxonomy that reflects acoustical similarity. We present a method for acquiring such a taxonomy by applying hierarchical clustering to a large-scale musical instrument sound database. Experimental results showed that around 77% of non-registered instrument sounds, on average, were correctly recognized at the category level.

In Chapter 5, we tackle the third issue by weighting features based on how much they are affected by overlapping; that is, we give lower weights to features affected more and higher weights to features affected less. For this kind of weighting, we have to evaluate the influence of the overlapping on each feature. It was, however, impossible in previous studies to evaluate the influence by analyzing training data because the training data were only taken from monophonic sounds. Taking training data from polyphonic music (called a mixed-sound template), we evaluate the influence as the ratio of the within-class variance to the between-class variance in the distribution of the training data. We then generate feature axes using a weighted mixture that minimizes the influence by means of linear discriminant analysis. We also introduced musical context to avoid musically unnatural errors (e.g., only one clarinet note within a sequence of flute notes). Experimental results showed that the recognition rates obtained using the above were 84.1% for duo music, 77.6% for trio music, and 72.3% for quartet music.

In Chapter 6, we describe a new framework of musical instrument recognition to solve the fourth issue. We formulate musical instrument recognition as the problem of calculating instrument existence probabilities at every point on the time-frequency plane. The instrument existence probabilities are calculated by multiplying two kinds of probabilities, one of which is calculated using the PreFEst and the other of which is calculated using hidden Markov models. The instrument existence probabilities are visualized in the spectrogram-like graphical representation called the instrogram. Because the calculation is performed for each time and each frequency, not for each note, estimation of the onset time and F0 of each note is not necessary. We obtained promising results for both synthesized music and recordings of real performances of classical and jazz music.

In Chapter 7, we describe an application of the instrogram analysis to similarity-based MIR. Because most previous similarity-based MIR systems used low-level features such as MFCCs, similarities for musical elements such as the melody, rhythm, harmony, and instrumentation could not be separately measured. As the first step toward measuring such music similarity, we develop a music similarity measure that reflects instrumentation only based on the instrogram representation. We confirmed that the instrogram can be applied to content-based MIR by developing a prototype system that searches for musical pieces that have instrumentation similar to that specified by the user.

In Chapter 8, we discuss the major contributions of this study toward research fields including computational auditory scene analysis, content-based MIR, and music visualization. We also discuss remaining issues and future directions of research.

Finally, we present our conclusions of this work in Chapter 9.

Music Retrieval based on Melodic Similarity
Rainer Typke, Utrecht University, Netherlands, February 2007. [BibTex, PDF]

Structural Analysis and Segmentation of Music Signals
Bee Suan Ong, University Pompeu Fabra, Barcelona, Spain, February 2007. [BibTex, Abstract, External Link]

With the recent explosion in the quantity of digital audio libraries and databases, content descriptions play an important role in efficiently managing and retrieving audio files. This doctoral research aims to discover and extract structural description from polyphonic music signals. As repetition and transformations of music structure creates a unique identity of music itself, extracting such information can link low-level and higher-level descriptions of music signal and provide better quality access plus powerful way of interacting with audio content. Finding appropriate boundary truncations is indispensable in certain content-based applications. Thus, temporal audio segmentation at the semantic level and the identification of representative excerpts from music audio signal are also investigated. We make use of higher-level analysis technique for better segment truncation. From both theoretical and practical points of view, this research not only helps in increasing our knowledge of music structure but also facilitates in time-saving browsing and assessing of music.

Music Complexity: A Multi-faceted Description of Audio Content
Sebastian Streich, University Pompeu Fabra, Barcelona, Spain, February 2007. [BibTex, Abstract, External Link]

The complexity of music is one of the less intensively researched areas in music information retrieval so far. Although very interesting findings have been reported over the years, there is a lack of a unified approach to the matter. Relevant publications mostly concentrate on single aspects only and are scattered across different disciplines. Especially an automated estimation based on the audio material itself has hardly been addressed in the past. However, it is not only an interesting and challenging topic, it also allows for very practical applications.

The motivation for the presented research lies in the enhancement of human interaction with digital music collections. As we will discuss, there is a variety of tasks to be considered, such as collection visualization, play-list generation, or the automatic recommendation of music. While this thesis doesn’t deal with any of these problems in deep detail it aims to provide a useful contribution to their solution in form of a set of music complexity descriptors. The relevance of music complexity in this context will be emphasized by an extensive review of studies and scientific publications from related disciplines, like music psychology, musicology, information theory, or music information retrieval.

This thesis proposes a set of algorithms that can be used to compute estimates of music complexity facets from musical audio signals. They focus on aspects of acoustics, rhythm, timbre, and tonality. Music complexity is thereby considered on the coarse level of common agreement among human listeners. The target is to obtain complexity judgements through automatic computation that resemble a naive listener’s point of view. Expert knowledge of specialists in particular musical domains is therefore out of the scope of the proposed algorithms. While it is not claimed that this set of algorithms is complete or final, we will see a selection of evaluations that gives evidence to the usefulness and relevance of the proposed methods of computation. We will finally also take a look at possible future extensions and continuations for further improvement, like the consideration of complexity on the level of musical structure.

A Study on a Universal Platform for Digital Content Annotation and its Application to Music Information Processing (in Japanese)
Katsuhiko Kaji, Nagoya University, Japan, January 2007. [BibTex, Abstract, External Link]

Generally, there are multiple interpretations of single content such as a text, a video clip, a song, and so on. Therefore we have developed an annotation platform called ``Annphony'' to deal with such multiple interpretations. By extending RDF, a common annotation framework, each annotation has its own identifier that enables distinction of each interpretation. Music is one typical example that apt to cause multiple interpretations. Listeners do not necessarily make the same interpretation about the tune. Therefore, to deal with various musical interpretations we have developed a musical annotation system that consists of the following annotation editors - 1: basic information of the tunes , 2: continuous media, and 3: musical scores. We have also developed two applications based on annotations about multiple interpretations collected by the annotation system. One is a music retrieval system based on music's inner structure. The other is a playlist recommendation system that semi-automatically generates playlists that reflect listener preferences and situations.

Analysis of Systems for Melody Search and Methods for Functional Testing (in German)
Johann-Markus Batke, Technische Universität Berlin, Germany, January 2007. [BibTex, Abstract, External Link]

Since plenty of digital files of music are available in the world wide web the search for melodies gains more and more importance. A mobile playback device like Apple's iPod is capable to store 5,000 titles easily. Up to now single pieces of music are only found in such a volume knowing the title or interpret of a tune. However, it is a typical problem that someone searching a tune is not aware of this information and knows only the melody.

Technically speaking this problem is a search in a music database. If the user only knows the melody, the melody should be provided to the search system. A query-by-humming system (QBH-system) provides this possibility: the melody is hummed and the system presents a number of similar melodies as a result of this query.

The thesis explains the principles of QBH-systems and evaluates methods to test the functionallity of all single system parts. All important fundamentals regarding to the functions are explained. The chapter music and melody explains the meaning of the term melody and possible melody descriptions. An overview of existing melody and music search engines is given subsequenctly. A special chapter deals with multimedia standards for use with such systems. The role of signal processing for melody transcription is described in the monophonic and polyphonic case detailly. Measurements of melody similarity are shown in the following chapter. In the last chapter this thesis features especially the statitics of melody databases inside search engines.

2006

Towards Autonomous Agents for Live Computer Music: Realtime Machine Listening and Interactive Music Systems
Nicholas M. Collins, University of Cambridge, UK, November 2006. [BibTex, Abstract, External Link]

Musical agents which can interact with human musicians in concert situations are a reality, though the extent to which they themselves embody human-like capabilities can be called into question. They are perhaps most correctly viewed, given their level of artificial intelligence technology, as `projected intelligences', a composer's anticipation of the dynamics of a concert setting made manifest in programming code. This thesis will describe a set of interactive systems developed for a range of musical styles and instruments, all of which attempt to participate in a concert by means of audio signal analysis alone. Machine listening, being the simulation of human peripheral auditory abilities, and the hypothetical modelling of central auditory and cognitive processes, is utilised in these systems to track musical activity. Whereas much of this modelling is inspired by a bid to emulate human abilities, strategies diverging from plausible human physiological mechanisms are often employed, leading to machine capabilities which exceed or differ from the human counterparts. Technology is described which detects events from an audio stream, further analysing the discovered events (typically notes) for perceptual features of loudness, pitch, attack time and timbre. In order to exploit processes that underlie common musical practice, beat tracking is investigated, allowing the inference of metrical structure which can act as a co-ordinative framework for interaction. Psychological experiments into human judgement of perceptual attack time and beat tracking to ecologically valid stimuli clarify the parameters and constructs that should most appropriately be instantiated in the computational systems. All the technology produced is intended for the demanding environment of realtime concert use. In particular, an algorithmic audio splicing and analysis library called BBCut2 is described, designed with appropriate processing and scheduling faculties for realtime operation. Proceeding to outlines of compositional applications, novel interactive music systems are introduced which have been tested in real concerts. These are evaluated by interviews with the musicians who performed with them, and an assessment of their claims to agency in the sense of `autonomous agents'. The thesis closes by considering all that has been built, and the possibilities for future advances allied to artificial intelligence and signal processing technology.

Sound Source Separation in Monaural Music Signals
Tuomas Virtanen, Tampere University of Technology, Finland, November 2006. [BibTex, Abstract, PDF]

Sound source separation refers to the task of estimating the signals produced by individual sound sources from a complex acoustic mixture. It has several applications, since monophonic signals can be processed more eciently and flexibly than polyphonic mixtures.

This thesis deals with the separation of monaural, or, one-channel music recordings. We concentrate on separation methods, where the sources to be separated are not known beforehand. Instead, the separation is enabled by utilizing the common properties of real-world sound sources, which are their continuity, sparseness, and repetition in time and frequency, and their harmonic spectral structures. One of the separation approaches taken here use unsupervised learning and the other uses model-based inference based on sinusoidal modeling.

Most of the existing unsupervised separation algorithms are based on a linear instantaneous signal model, where each frame of the input mixture signal is modeled as a weighted sum of basis functions. We review the existing algorithms which use independent component analysis, sparse coding, and non-negative matrix factorization to estimate the basis functions from an input mixture signal.

Our proposed unsupervised separation algorithm based on the instantaneous model combines non-negative matrix factorization with sparseness and temporal continuity objectives. The algorithm is based on minimizing the reconstruction error between the magnitude spectrogram of the observed signal and the model, while restricting the basis functions and their gains to non-negative values, and the gains to be sparse and continuous in time. In the minimization, we consider iterative algorithms which are initialized with random values and updated so that the value of the total objective cost function decreases at each iteration. Both multiplicative update rules and a steepest descent algorithm are proposed for this task. To improve the convergence of the projected steepest descent algorithm, we propose an augmented divergence to measure the reconstruction error. Simulation experiments on generated mixtures of pitched instruments and drums were run to monitor the behavior of the proposed method. The proposed method enables average signal-to-distortion ratio (SDR) of 7.3 dB, which is higher than the SDRs obtained with the other tested methods based on the instantaneous signal model.

To enable separating entities which correspond better to real-world sound objects, we propose two convolutive signal models which can be used to represent time-varying spectra and fundamental frequencies. We propose unsupervised learning algorithms extended from non-negative matrix factorization for estimating the model parameters from a mixture signal. The objective in them is to minimize the reconstruction error between the magnitude spectrogram of the observed signal and the model while restricting the parameters to non-negative values. Simulation experiments show that time-varying spectra enable better separation quality of drum sounds, and time-varying frequencies representing different fundamental frequency values of pitched instruments conveniently.

Another class of studied separation algorithms is based on the sinusoidal model, where the periodic components of a signal are represented as the sum of sinusoids with time-varying frequencies, amplitudes, and phases. The model provides a good representation for pitched instrument sounds, and the robustness of the parameter estimation is here increased by restricting the sinusoids of each source to harmonic frequency relationships.

Our proposed separation algorithm based on sinusoidal modeling minimizes the reconstruction error between the observed signal and the model. Since the rough shape of spectrum of natural sounds is continuous as a function of frequency, the amplitudes of overlapping overtones can be approximated by interpolating from adjacent overtones, for which we propose several methods. Simulation experiments on generated mixtures of pitched musical instruments show that the proposed methods allow average SDR above 15 dB for two simultaneous sources, and the quality decreases gradually as the number of sources increases.

Expressivity-aware Tempo Transformations of Music Performances Using Case Based Reasoning
Maarten Grachten, University Pompeu Fabra, Barcelona, Spain, November 2006. [BibTex, Abstract, PDF]

This dissertation is about expressivity-aware tempo transformations of monophonic audio recordings of saxophone jazz performances. It is a contribution to content-based audio processing, a field of technology that has recently emerged as an answer to the increased need to deal intelligently with the ever growing amount of digital multimedia information available nowadays. Content-based audio processing applications may for example search a data base for music that has a particular instrumentation, or musical form, rather than just searching for music based on meta-data such as the artist, or title of the piece.

Content-based audio processing also includes making changes to the audio to meet specific musical needs. The work presented here is an example of such content-based transformation. We have investigated the problem of how a musical performance played at a particular tempo can be rendered automatically at another tempo, while preserving naturally sounding expressivity. Or, differently stated, how does expressiveness change with global tempo. Changing the tempo of a given melody is a problem that cannot be reduced to just applying a uniform transformation to all the notes of the melody. The expressive resources for emphasizing the musical structure of the melody and the affective content differ depending on the performance tempo. We present a case based reasoning system to address this problem. It automatically performs melodic and expressive analysis, and it contains a set of examples of tempo-transformations, and when a new musical performance must be tempo-transformed, it uses the most similar example tempo-transformation to infer the changes of expressivity that are necessary to make the result sound natural.

We have validated the system experimentally, and show that expressivity-aware tempotransformation are more similar to human performances than tempo transformations obtained by uniform time stretching, the current standard technique for tempo transformation. Apart from this contribution as an intelligent audio processing application prototype, several other contributions have been made in this dissertation. Firstly, we present a representation scheme of musical expressivity that is substantially more elaborate than existing representations, and we describe a technique to automatically annotate music performances using this representation scheme. This is an important step towards fullyautomatic case acquisition for musical CBR applications. Secondly, our method reusing past cases provides an example of solving synthetic tasks with multi-layered sequential data, a kind of task that has not been explored much in case based reasoning research. Thirdly, we introduce a similarity measure for melodies that computes similarity based on an semi-abstract musical level. In a comparison with other state-of-the-art melodic similarity techniques, this similarity measure gave the best results. Lastly, a novel evaluation methodology is presented to assess the quality of predictive models of musical expressivity.

Feature Extraction of Musical Content for Automatic Music Transcription
Ruohua Zhou, Ecole Polytechnique Fédérale de Lausanne, Swiss, October 2006. [BibTex, Abstract, External Link]

The purpose of this thesis is to develop new methods for automatic transcription of melody and harmonic parts of real-life music signal. Music transcription is here defined as an act of analyzing a piece of music signal and writing down the parameter representations, which indicate the pitch, onset time and duration of each pitch, loudness and instrument applied in the analyzed music signal. The proposed algorithms and methods aim at resolving two key sub-problems in automatic music transcription: music onset detection and polyphonic pitch estimation. There are three original contributions in this thesis.

The first is an original frequency-dependent time-frequency analysis tool called the Resonator Time-Frequency Image (RTFI). By simply defining a parameterized function mapping frequency to the exponent decay factor of the complex resonator filter bank, the RTFI can easily and flexibly implement the time-frequency analysis with different time-frequency resolutions such as ear-like (similar to human ear frequency analyzer), constant-Q or uniform (evenly-spaced) time-frequency resolutions. The corresponding multi-resolution fast implementation of RTFI has also been developed.

The second original contribution consists of two new music onset detection algorithms: Energy-based detection algorithm and Pitch-based detection algorithm. The Energy-based detection algorithm performs well on the detection of hard onsets. The Pitch-based detection algorithm is the first one, which successfully exploits the pitch change clue for the onset detection in real polyphonic music, and achieves a much better performance than the other existing detection algorithms for the detection of soft onsets.

The third contribution is the development of two new polyphonic pitch estimation methods. They are based on the RTFI analysis. The first proposed estimation method mainly makes best of the harmonic relation and spectral smoothing principle, consequently achieves an excellent performance on the real polyphonic music signals. The second proposed polyphonic pitch estimation method is based on the combination of signal processing and machine learning. The basic idea behind this method is to transform the polyphonic pitch estimation as a pattern recognition problem. The proposed estimation method is mainly composed by a signal processing block followed by a learning machine. Multi-resolution fast RTFI analysis is used as a signal processing component, and support vector machine (SVM) is selected as learning machine. The experimental result of the first approach show clear improvement versus the other state of the art methods.

Melody Detection in Polyphonic Audio
Rui Pedro Paiva, University of Coimbra, Portugal, September 2006. [BibTex, Abstract, External Link]

In this research work, we address the problem of melody detection in polyphonic audio. Our system comprises three main modules, where a number of rule based procedures are proposed to attain the specific goals of each unit: i) pitch detection; ii) determination of mu-sical notes (with precise temporal boundaries and pitches); and iii) identification of melodic notes. We follow a multi stage approach, inspired on principles from perceptual theory and musical practice. Physiological models and perceptual cues of sound organization are incorporated into our method, mimicking the behavior of the human auditory system to some extent. Moreover, musicological principles are applied, in order to support the identification of the musical notes that convey the main melodic line.

Our algorithm starts with an auditory model based pitch detector, where multiple pitches are extracted in each analysis frame. These correspond to a few of the most intense fun-damental frequencies, since one of our basis assumptions is that the main melody is usu-ally salient in musical ensembles.

Unlike most other melody extraction approaches, we aim to explicitly distinguish individual musical notes, characterized by specific temporal boundaries and MIDI note numbers. In addition, we store their exact frequency sequences and intensity related val-ues, which might be necessary for the study of performance dynamics, timbre, etc. We start this task with the construction of pitch trajectories that are formed by connecting pitch candidates with similar frequency values in consecutive frames. The objective is to find regions of stable pitches, which indicate the presence of musical notes.

Since the created tracks may contain more than one note, temporal segmentation must be carried out. This is accomplished in two steps, making use of the pitch and in-tensity contours of each track, i.e., frequency and salience based segmentation. In fre-quency based track segmentation, the goal is to separate all notes of different pitches that are included in the same trajectory, coping with glissando, legato and vibrato and other sorts of frequency modulation. As for salience based segmentation, the objective is to separate consecutive notes at the same pitch, which may have been incorrectly inter-preted as forming one single note.

Regarding the identification of the notes bearing the melody, we found our strategy on two core assumptions that we designate as the salience principle and the melodic smooth-ness principle. By the salience principle, we assume that the melodic notes have, in gen-eral, a higher intensity in the mixture (although this is not always the case). As for the melodic smoothness principle, we exploit the fact that melodic intervals tend normally to be small. Finally, we aim to eliminate false positives, i.e., erroneous notes present in the obtained melody. This is carried out by removing the notes that correspond to abrupt salience or duration reductions and by implementing note clustering to further discrimi-nate the melody from the accompaniment.

Experimental results were conducted, showing that our method performs satisfacto-rily under the specified assumptions. However, additional difficulties are encountered in song excerpts where the intensity of the melody in comparison to the surrounding ac-companiment is not so favorable.

To conclude, despite its broad range of applicability, most of the research problems involved in melody detection are complex and still open. Most likely, sufficiently robust, general, accurate and efficient algorithms will only become available after several years of intensive research.

Automatic Annotation of Musical Audio for Interactive Applications
Paul Brossier, Queen Mary University of London, UK, August 2006. [BibTex, Abstract, External Link]

As machines become more and more portable, and part of our everyday life, it becomes apparent that developing interactive and ubiquitous systems is an important aspect of new music applications created by the research community. We are interested in developing a robust layer for the automatic annotation of audio signals, to be used in various applications, from music search engines to interactive installations, and in various contexts, from embedded devices to audio content servers. We propose adaptations of existing signal processing techniques to a real time context. Amongst these annotation techniques, we concentrate on low and mid-level tasks such as onset detection, pitch tracking, tempo extraction and note modelling. We present a framework to extract these annotations and evaluate the performances of different algorithms.

The first task is to detect onsets and offsets in audio streams within short latencies. The segmentation of audio streams into temporal objects enables various manipulation and analysis of metrical structure. Evaluation of different algorithms and their adaptation to real time are described. We then tackle the problem of fundamental frequency estimation, again trying to reduce both the delay and the computational cost. Different algorithms are implemented for real time and experimented on monophonic recordings and complex signals. Spectral analysis can be used to label the temporal segments; the estimation of higher level descriptions is approached. Techniques for modelling of note objects and localisation of beats are implemented and discussed.

Applications of our framework include live and interactive music installations, and more generally tools for the composers and sound engineers. Speed optimisations may bring a significant improvement to various automated tasks, such as automatic classification and recommendation systems. We describe the design of our software solution, for our research purposes and in view of its integration within other systems.

Tonal Description of Music Audio Signals
Emilia Gómez, University Pompeu Fabra, Barcelona, Spain, July 2006. [BibTex, Abstract, External Link]

In this doctoral dissertation we propose and evaluate a computational approach for the automatic description of tonal aspects of music from the analysis of polyphonic audio signals. The problems that appear when computer programs try to automatically extract tonal descriptors from musical audio signals are also discussed.

We propose a number of algorithms to directly process digital audio recordings from acoustical music in order to extract tonal descriptors. These algorithms focus on the computation of pitch class distributions descriptors, the estimation of the key of a piece, the visualization of the evolution of its tonal center or the measurement of the similarity between two different musical pieces. Those algorithms have been validated and evaluated in a quantitative way. First, we have evaluated low-level descriptors and their independence with respect to timbre, dynamics and other external factors to tonal characteristics. Second, we have evaluated the method for key finding, obtaining an accuracy around 80% for a music collection of 1400 pieces with different characteristics. We have studied the influence of different aspects such as the employed tonal model, the advantage of using a cognition-inspired model vs machine learning methods, the location of the tonality within a musical piece, and the influence of the musical genre on the definition of a tonal center. Third, we have proposed the extracted features as a tonal representation of an audio signal, useful to measure similarity between two pieces and to establish the structure of a musical piece.

A Method for Discovering Musical Patterns Through Time Series Analysis of Symbolic Data
Tamar C. Adler-Berman, Northwestern University, IL, USA, June 2006. [BibTex, Abstract, External Link]

This paper proposes a method for discovering musical patterns through the conversion of MIDI files into time series and analyzing these with data mining tools and SQL queries. The method was tested on patterns prevalent in the music of W.A. Mozart, as represented in a corpus of 505 MIDI sequences of pieces by Mozart.

The novelty of the pattern extraction method described here lies in its ability to discover and retrieve sequences that are composed of complex events. These contain both melodic and harmonic features, which may be overlaid upon each other, embedded within each other, and may be separated by, or occur simultaneously with, other patterns or occurrences.

Results support the feasibility of constructing an information system for the discovery and retrieval of complex musical patterns based on time series analysis. This system is intended for use by music researchers, scholars and students. A prototype of this system, tested and described in this paper, has discovered frequently recurring patterns, and provided examples of musical passages which contain a particular pattern selected for investigation.

This work demonstrates how data mining can be applied to musical time series to discover frequent musical sequences. Further, it shows how examples of a particular complex pattern - the 1-7…4-3 structure - can be fetched from the corpus based on the same time series representation.

Ten Experiments on the Modelling of Polyphonic Timbre
Jean-Julien Aucouturier, University of Paris 6, Paris, France, May 2006. [BibTex, Abstract, PDF]

The majority of systems extracting high-level music descriptions from audio signals rely on a common, implicit model of the global sound or polyphonic timbre of a musical signal. This model represents the timbre of a texture as the long-term distribution of its local spectral features. The underlying assumption is rarely made explicit: the perception of the timbre of a texture is assumed to result from the most statistically significant feature windows. This thesis questions the validity of this assumption. To do so, we construct an explicit measure of the timbre similarity between polyphonic music textures, and variants thereof inspired by previous work in Music Information Retrieval. We show that the precision of such measures is bounded, and that the remaining error rate is not incidental. Notably, this class of algorithms tends to create false positives - which we call hubs - which are mostly always the same songs regardless of the query. Their study shows that the perceptual saliency of feature observations is not necessarily correlated with their statistical significance with respect to the global distribution. In other words, music listeners routinely "hear" things that are not statistically significant in musical signals, but rather are the result of high-level cognitive reasoning, which depends on cultural expectations, a priori knowledge, and context. Much of the music we hear as being "piano music" is really music that we expect to be piano music. Such statistical/perceptual paradoxes are instrumental in the observed discrepancy between human perception of timbre and the models studied here.

Computer-Supported Cooperative Work for Music Applications
Álvaro Barbosa, University Pompeu Fabra, Barcelona, Spain, April 2006. [BibTex, Abstract, PDF]

This dissertation derives from research on musical practices mediated by computer networks conducted from 2001 to 2005 in the Music Technology Group of the Pompeu Fabra University in Barcelona, Spain. It departs from work carried out over the last decades in the field of Computer-Supported Cooperative Work (CSCW), which provides us with collaborative communication mechanisms that can be regarded from a music perspective in diverse scenarios: Composition, Performance, Improvisation or Education.

The first contribution originated from this research work is an extensive survey and systematic classification of Computer-Supported Cooperative Work for Music Applications. This survey led to the identification of innovative approaches, models and applications, with special emphasis on the shared nature of geographically displaced communication over the Internet. The notion of Shared Sonic Environments was introduced and implemented in a proof-of-concept application entitled Public Sound Objects (PSOs).

A second major contribution of this dissertation concerns methods that reduce the disrupting effect of network latency in musical communication over long distance networks. From laboratorial experimentation and evaluation, the techniques of Network Latency Adaptive Tempo and Individual Delayed Feed-Back were proposed and implemented in the PSOs prototype.

Over the course of the PSOs development other relevant and inspirational issues were addressed, such as: behavioral-driven interface design applied to interface decoupled applications; overcoming network technology security features; system scalability for various applications in audio web services.

Throughout this dissertation conceptual perspectives, of related issues to computermediated musical practices, were widely discussed conveying different standpoints ranging from a Psycho-Social study of collaborative music processes to the Computer Science and Music Technology point of view.

Computer Models for Musical Instrument Identification
Nicolas D. Chétry, Queen Mary University of London, UK, April 2006. [BibTex, Abstract]

A particular aspect in the perception of sound is concerned with what is commonly termed as texture or timbre. From a perceptual perspective, timbre is what allows us to distinguish sounds that have similar pitch and loudness. Indeed most people are able to discern a piano tone from a violin tone or able to distinguish different voices or singers.

This thesis deals with timbre modelling. Specifically, the formant theory of timbre is the main theme throughout. This theory states that acoustic musical instrument sounds can be characterised by their formant structures. Following this principle, the central point of our approach is to propose a computer implementation for building musical instrument identification and classification systems.

Although the main thrust of this thesis is to propose a coherent and unified approach to the musical instrument identification problem, it is oriented towards the development of algorithms that can be used in Music Information Retrieval (MIR) frameworks. Drawing on research in speech processing, a complete supervised system taking into account both physical and perceptual aspects of timbre is described. The approach is composed of three distinct processing layers. Parametric models that allow us to represent signals through mid-level physical and perceptual representations are considered. Next, the use of the Line Spectrum Frequencies as spectral envelope and formant descriptors is emphasised.

Finally, the use of generative and discriminative techniques for building instrument and database models is investigated. Our system is evaluated under realistic recording conditions using databases of isolated notes and melodic phrases.

Computing Pitch Names in Tonal Music: A Comparative Analysis of Pitch Spelling Algorithms
David Meredith, University of Oxford, UK, April 2006. [BibTex, Abstract]

A pitch spelling algorithm predicts the pitch names (e.g., C#4, Bb5 etc.) of the notes in a passage of tonal music, when given the onset-time, MIDI note number and possibly the duration and voice of each note. A new algorithm, called ps13, was compared with the algorithms of Longuet-Higgins, Cambouropoulos, Temperley and Chew and Chen by running various versions of these algorithms on a ‘clean’, score-derived test corpus, C, containing 195972 notes, equally divided between eight classical and baroque composers. The standard deviation of the accuracies achieved by each algorithm over the eight composers was used to measure style dependence (SD). The best versions of the algorithms were tested for robustness to temporal deviations by running them on a ‘noisy’ version of the test corpus, denoted by C'.

A version of ps13 called PS13s1 was the most accurate of the algorithms tested, achieving note accuracies of 99.44% (SD = 0.45) on C and 99.41% (SD = 0.50) on C'. A real-time version of PS13s1 also out-performed the other real-time algorithms tested, achieving note accuracies of 99.19% (SD = 0.51) on C and 99.16% (SD = 0.53) on C'. PS13s1 was also as fast and easy to implement as any of the other algorithms.

New, optimised versions of Chew and Chen’s algorithm were the least dependent on style over C. The most accurate of these achieved note accuracies of 99.15% (SD = 0.42) on C and 99.12% (SD = 0.47) on C'. The line of fifths was found to work just as well as Chew’s (2000) “spiral array model” in these algorithms.

A new, optimised version of Cambouropoulos’s algorithm made 8% fewer errors over C than the most accurate of the versions described by Cambouropoulos himself. This algorithm achieved note accuracies of 99.15% (SD = 0.47) on C and 99.07% (SD = 0.53) on C'. A new implementation of the most accurate of the versions described by Cambouropoulos achieved note accuracies of 99.07% (SD = 0.46) on C and 99.13% (SD = 0.39) on C', making it the least dependent on style over C'. However, Cambouropoulos’s algorithms were among the slowest of those tested.

When Temperley and Sleator’s harmony and meter programs were used for pitch spelling, they were more affected by temporal deviations and tempo changes than any of the other algorithms tested. When enharmonic changes were ignored and the music was at a natural tempo, these programs achieved note accuracies of 99.27% (SD = 1.30) on C and 97.43% (SD = 1.69) on C'. A new implementation, called TPROne, of just the first preference rule in Temperley’s theory achieved note accuracies of 99.06% (SD = 0.63) on C and 99.16% (SD = 0.52) on C'. TPROne’s performance was independent of tempo and less dependent on style than that of the harmony and meter programs.

Of the several versions of Longuet-Higgins’s algorithm tested, the best was the original one, implemented in his music.p program. This algorithm achieved note accuracies of 98.21% (SD = 1.79) on C and 98.25% (SD = 1.71) on C', but only when the data was processed a voice at a time.

None of the attempts to take voice-leading into account in the algorithms considered in this study resulted in an increase in note accuracy and the most accurate algorithm, PS13s1, ignores voice-leading altogether. The line of fifths is used in most of the algorithms tested, including PS13s1. However, the superior accuracy achieved by PS13s1 suggests that pitch spelling accuracy can be optimised by modelling the local key as a pitch class frequency distribution instead of a point on the line of fifths, and by keeping pitch names close to the local tonic(s) on the line of fifths rather than close on the line of fifths to the pitch names of neighbouring notes.

Temporal Feature Integration for Music Organisation
Anders Meng, Technical University of Denmark, Denmark, April 2006. [BibTex, Abstract, External Link]

This Ph.D. thesis focuses on temporal feature integration for music organisation. Temporal feature integration is the process of combining all the feature vectors of a given time-frame into a single new feature vector in order to capture relevant information in the frame. Several existing methods for handling sequences of features are formulated in the temporal feature integration framework. Two datasets for music genre classification have been considered as valid test-beds for music organisation. Human evaluations of these, have been obtained to access the subjectivity on the datasets.

Temporal feature integration has been used for ranking various short-time features at different time-scales. This include short-time features such as the Mel frequency cepstral coefficients (MFCC), linear predicting coding coefficients (LPC) and various MPEG-7 short-time features. The ‘consensus sensitivity ranking’ approach is proposed for ranking the short-time features at larger time-scales according to their discriminative power in a music genre classification task.

The multivariate AR (MAR) model has been proposed for temporal feature integration. It effectively models local dynamical structure of the short-time features.

Different kernel functions such as the convolutive kernel, the product probability kernel and the symmetric Kullback Leibler divergence kernel, which measures similarity between frames of music have been investigated for aiding temporal feature integration in music organisation. A special emphasis is put on the product probability kernel for which the MAR model is derived in closed form. A thorough investigation, using robust machine learning methods, of the MAR model on two different music genre classification datasets, shows a statistical significant improvement using this model in comparison to existing temporal feature integration models. This improvement was more pronounced for the larger and more difficult dataset. Similar findings where observed using the MAR model in a product probability kernel. The MAR model clearly outperformed the other investigated density models: the multivariate Gaussian model and the Gaussian mixture model.

Computational Models of Music Similarity and their Application in Music Information Retrieval
Elias Pampalk, Vienna University of Technology, Vienna, Austria, March 2006. [BibTex, Abstract, PDF]

This thesis aims at developing techniques which support users in accessing and discovering music. The main part consists of two chapters.

Chapter 2 gives an introduction to computational models of music similarity. The combination of different approaches is optimized and the largest evaluation of music similarity measures published to date is presented. The best combination performs significantly better than the baseline approach in most of the evaluation categories. A particular effort is made to avoid overfitting. To cross-check the results from the evaluation based on genre classification a listening test is conducted. The test confirms that genre-based evaluations are suitable to efficiently evaluate large parameter spaces. Chapter 2 ends with recommendations on the use of similarity measures.

Chapter 3 describes three applications of such similarity measures. The first application demonstrates how music collections can be organized and visualized so that users can control the aspect of similarity they are interested in. The second application demonstrates how music collections can be organized hierarchically into overlapping groups at the artist level. These groups are summarized using words from web pages associated with the respective artists. The third application demonstrates how playlists can be generated which require minimum user input.

Music Genre Classification Systems - A Computational Approach
Peter Ahrendt, Technical University of Denmark, Kongens Lyngby, Denmark, February 2006. [BibTex, Abstract, External Link]

Automatic music genre classification is the classification of a piece of music into its corresponding genre (such as jazz or rock) by a computer. It is considered to be a cornerstone of the research area Music Information Retrieval (MIR) and closely linked to the other areas in MIR. It is thought that MIR will be a key element in the processing, searching and retrieval of digital music in the near future.

This dissertation is concerned with music genre classification systems and in particular systems which use the raw audio signal as input to estimate the corresponding genre. This is in contrast to systems which use e.g. a symbolic representation or textual information about the music. The approach to music genre classification systems has here been system-oriented. In other words, all the different aspects of the systems have been considered and it is emphasized that the systems should be applicable to ordinary real-world music collections.

The considered music genre classification systems can basically be seen as a feature representation of the song followed by a classification system which predicts the genre. The feature representation is here split into a Short-time feature extraction part followed by Temporal feature integration which combines the (multivariate) time-series of short-time feature vectors into feature vectors on a larger time scale.

Several different short-time features with 10-40 ms frame sizes have been examined and ranked according to their significance in music genre classification. A Consensus sensitivity analysis method was proposed for feature ranking. This method has the advantage of being able to combine the sensitivities over several resamplings into a single ranking.

The main efforts have been in temporal feature integration. Two general frameworks have been proposed; the Dynamic Principal Component Analysis model as well as the Multivariate Autoregressive Model for temporal feature integration. Especially the Multivariate Autoregressive Model was found to be successful and outperformed a selection of state-of-the-art temporal feature integration methods. For instance, an accuracy of 48% was achieved in comparison to 57% for the human performance on an 11-genre problem.

A selection of classifiers were examined and compared. We introduced Cooccurrence models for music genre classification. These models include the whole song within a probabilistic framework which is often an advantage compared to many traditional classifiers which only model the individual feature vectors in a song.

Separation of Musical Sources and Structure from Single-Channel Polyphonic Recordings
Mark Every, University of York, UK, February 2006. [BibTex, Abstract, PDF]

The thesis deals principally with the separation of pitched sources from single-channel polyphonic musical recordings. The aim is to extract from a mixture a set of pitched instruments or sources, where each source contains a set of similarly sounding events or notes, and each note is seen as comprising partial, transient and noise content. The work also has implications for separating non-pitched or percussive sounds from recordings, and in general, for unsupervised clustering of a list of detected audio events in a recording into a meaningful set of source classes. The alignment of a symbolic score/MIDI representation with the recording constitutes a pre-processing stage. The three main areas of contribution are: firstly, the design of harmonic tracking algorithms and spectral-filtering techniques for removing harmonics from the mixture, where particular attention has been paid to the case of harmonics which are overlapping in frequency. Secondly, some studies will be presented for separating transient attacks from recordings, both when they are distinguishable from and when they are overlapping in time with other transients. This section also includes a method which proposes that the behaviours of the harmonic and noise components of a note are partially correlated. This is used to share the noise component of a mixture of pitched notes between the interfering sources. Thirdly, unsupervised clustering has been applied to the task of grouping a set of separated notes from the recording into sources, where notes belonging to the same source ideally have similar features or attributes. Issues relating to feature computation, feature selection, dimensionality and dependence on a symbolic music representation are explored. Applications of this work exist in audio spatialisation, audio restoration, music content description, effects processing and elsewhere.

2005

A Multi-Dimensional Entropy Model of Jazz Improvisation for Music Information Retrieval
Scott J. Simon, University of North Texas, TX, USA, December 2005. [BibTex, Abstract, External Link]

Jazz improvisation provides a case context for examining information in music; entropy provides a means for representing music for retrieval. Entropy measures are shown to distinguish between different improvisations on the same theme, thus demonstrating their potential for representing jazz information for analysis and retrieval. The calculated entropy measures are calibrated against human representation by means of a case study of an advanced jazz improvisation course, in which synonyms for “entropy” are frequently used by the instructor. The data sets are examined for insights in music information retrieval, music information behavior, and music representation.

Research on Key Techniques of Music Data Management and Retrieval (in Chinese)
Chaokun Wang, Harbin Institute of Technology, China, December 2005. [BibTex, Abstract]

With the rapid evolution of information techniques, the amount of data in various media tends to be increased explosively. Digital music exists everywhere and grows day by day. At the same time, various music applications accelerate the progress of information techniques very much. For instance, peer-to-peer went mainstream in the late 20 centuries because of music sharing. As a burgeoning field, there is a wide future for music data management and information retrieval in both theoretical research and real-life applications. Research results in this filed are far from satisfying people's needs for real applications. Thus, this paper aims at music data, exploits data management and statistical computing methods, and studies theory, techniques and methods of music data management, as well as content-based music information retrieval techniques in peer-to-peer environments. The main contributions of this paper are as follows:

In the field of music data management, we propose data model, query language, storage structures, access methods, theme mining algorithms and query processing methods. In detail, a music data model is presented, which can not only express various complex hierarchical structures and their semantics of music data efficiently, but also support various music computing applications. A music data definition and manipulation language MuSQL is also given, which directly supports multifarious music operations. The structured storage policy is proposed to store various music data. N-gram inverted index structures are proposed for music applications, which can be established easily and can be implemented in a database system simply. Abstract indices (AbIx) in P2P environments are presented, including central AbIx in centralized P2P data systems, plain AbIx in distributed P2P data systems, and cell AbIx in structured P2P data systems. Abstract indices have many advantages, such as low system cost, improving the query processing speed, supporting high autonomy of peers and very frequent updates. They are also applicable to other media besides of music. The music theme mining algorithm CDM is proposed, whose performance is better than any other algorithm for the same purpose. A content-based music data query processing algorithm is presented, and two implementations of the algorithm, i.e. NestedLoop and SortMerge, are also given. Techniques on content-based approximate query processing are discussed. The algorithms to achieve the set of candidate peers is proposed in centralized, distributed and structured peer-to-peer data systems. The content-based approximate query processing methods are also given. It can be used to search as few peers as possible but get as many returns satisfying users' queries as possible. The proposed methods are also applicable to other distributed environments, such as Grid, Deep Web and so on.

In the field of music information retrieval, we present and analyze four schemes and retrieval algorithms for music information retrieval in peer-to-peer environments, and propose an implementation of PsPsC scheme. In detail, the content-based music information retrieval problem in peer-to-peer environments is proposed and four peer-to-peer schemes for content-based music information retrieval are presented. After evaluation of these themes on network load, retrieval time, system update and robustness in theoretical analysis and experimental comparisons, PsPsC scheme is found out to be the best one for approximate queries and PsC+ is best for exact queries. An implementation of PsPsC scheme is presented, including the feature-extracting method, the retrieval algorithm, the method to filter out the replica in the final results, and the system architecture. Some simulated experiments are made to show the efficiency of proposed algorithms.

The music data management system HIT-DML (Harbin Institute of Technology-Digital Music Library) has been designed and implemented on the basis of the above fundamental research harvest. It adopts a novel framework and is inherently based on database systems. Musical content data, feature data and meta-data are structurally stored in databases. Some music operations, such as feature extracting, are implemented within the database system, and database technologies are combined with multimedia technologies seamlessly. The advanced features of database systems, such as transaction processing, are utilized. It can retrieve and play music data based on content against different kinds of musical instruments. HIT-DML verifies the theoretical and practical significances of our proposed key techniques on management and retrieval of music data.

Music Information Retrieval: Conceptual Framework, Annotation and User Behavior
Micheline Lesaffre, Ghent University, Belgium, December 2005. [BibTex, Abstract]

Understanding music is a process both based on and influenced by the knowledge and experience of the listener. Although content-based music retrieval has been given increasing attention in recent years, much of the research still focuses on bottom-up retrieval techniques. In order to make a music information retrieval system appealing and useful to the user, more effort should be spent on constructing systems that both operate directly on the encoding of the physical energy of music and are flexible with respect to users’ experiences.

This thesis is based on a user-centred approach, taking into account the mutual relationship between music as an acoustic phenomenon and as an expressive phenomenon. The issues it addresses are: the lack of a conceptual framework, the shortage of annotated musical audio databases, the lack of understanding of the behaviour of system users and shortage of user-dependent knowledge with respect to high-level features of music.

In the theoretical part of this thesis, a conceptual framework for content-based music information retrieval is defined. The proposed conceptual framework - the first of its kind - is conceived as a coordinating structure between the automatic description of low-level music content, and the description of high-level content by the system users. A general framework for the manual annotation of musical audio is outlined as well. A new methodology for the manual annotation of musical audio is introduced and tested in case studies. The results from these studies show that manually annotated music files can be of great help in the development of accurate analysis tools for music information retrieval.

Empirical investigation is the foundation on which the aforementioned theoretical framework is built. Two elaborate studies involving different experimental issues are presented. In the first study, elements of signification related to spontaneous user behaviour are clarified. In the second study, a global profile of music information retrieval system users is given and their description of high-level content is discussed. This study has uncovered relationships between the users’ demographical background and their perception of expressive and structural features of music. Such a multi-level approach is exceptional as it included a large sample of the population of real users of interactive music systems. Tests have shown that the findings of this study are representative of the targeted population.

Finally, the multi-purpose material provided by the theoretical background and the results from empirical investigations are put into practice in three music information retrieval applications: a prototype of a user interface based on a taxonomy, an annotated database of experimental findings and a prototype semantic user recommender system.

Results are presented and discussed for all methods used. They show that, if reliably generated, the use of knowledge on users can significantly improve the quality of music content analysis. This thesis demonstrates that an informed knowledge of human approaches to music information retrieval provides valuable insights, which may be of particular assistance in the development of user-friendly, content-based access to digital music collections.

Automatic Classification of Audio Signals: Machine Recognition of Musical Instruments (in French)
Slim Essid, Université Pierre et Marie Curie, Paris, France, December 2005. [BibTex, External Link]

A Computational Approach to Rhythm Description --- Audio Features for the Computation of Rhythm Periodicity Functions and their use in Tempo Induction and Music Content Processing
Fabien Gouyon, University Pompeu Fabra, Barcelona, Spain, November 2005. [BibTex, Abstract, External Link]

This dissertation is about musical rhythm. More precisely, it is concerned with computer programs that automatically extract rhythmic descriptions from musical audio signals.

New algorithms are presented for tempo induction, tatum estimation, time signature determination, swing estimation, swing transformations and classification of ballroom dance music styles. These algorithms directly process digitized recordings of acoustic musical signals. The backbones of these algorithms are rhythm periodicity functions: functions measuring the salience of a rhythmic pulse as a function of the period (or frequency) of the pulse, calculated from selected instantaneous physical attributes (henceforth features) emphasizing rhythmic aspects of sound. These features are computed at a constant time rate on small chunks (frames) of audio signal waveforms.

Our algorithms determine tempo and tatum of different genres of music, with almost constant tempo, with over 80% accuracy if we do not insist on finding a specific metrical level. They identify time signature with around 90% accuracy, assuming lower metrical levels are known. They classify ballroom dance music in 8 categories with around 80% accuracy when taking nothing but rhythmic aspects of the music into account. Finally they add (or remove) swing to musical audio signals in a fully-automatic fashion, while conserving very good sound quality.

From a more general standpoint, this dissertation substantially contributes to the field of computational rhythm description
a) by proposing an unifying functional framework;
b) by reviewing the architecture of many existing systems with respect to individual blocks of this framework;
c) by organizing the first public evaluation of tempo induction algorithms; and
d) by identifying promising research directions, particularly with respect to the selection of instantaneous features which are best suited to the computation of useful rhythm periodicity functions and the strategy to combine and parse multiple sources of rhythmic information.

Profiles of Pitch Classes --- Circularity of Relative Pitch and Key: Experiments, Models, Music Analysis, and Perspectives
Hendrik Purwins, Technische Universität Berlin, Germany, October 2005. [BibTex, Abstract, External Link]

The doubly-circular inter-relation of the major and minor keys on all twelve pitch classes can be depicted in toroidal models of inter-key relations (TOMIR). We demonstrate convergence of derivations on the explanatory levels of a) an experiment in music psychology, b) geometrical considerations in music theory, and c) computer implementation of musical listening scenarios. Generalizing Shepard (1964) to full overtone spectra, circular perception of relative pitch is experimentally verified and mathematically modeled as the spectrum of pitch differences, derived from virtual pitch (Terhardt 1998). Musical examples of circular pitch, tempo, and loudness are analyzed. For each pitch class calculating the intensity in
a musical recording, our constant quotient (CQ-)profile method is
a) consistent with psychological probe tone ratings,
b) highly efficient,
c) computable in real-time,
d) stable with respect to sound quality,
e) applicable to transposition,
f) free of musical presupposition, except approximately equal temperament, and
g) sensitive to substantial musical features (style, composer, tendency towards chromaticism, and major/minor alteration) in a highly compact reduction. In Bach, Chopin, Alkan, Scriabin, Hindemith, and Shostakovich, the latter features are extracted from overall CQ-profiles by classification (support vector machine [SVM], regularized discriminant analysis) and clustering. Their inter-relations are visualized by a technique called Isomap. Leman (1995) models acquisition of inter-key relations. Artificial cadences are preprocessed through modeling the auditory periphery. Then, in the toroidal self organizing feature map (SOM) a TOMIR is arrived at. We extend this approach by
a) using a great many actual performances of music as input, and/or
b) not presupposing toroidal topology a priori. Visualizing CQ-profiles from Bach's WTC by correspondence analysis (CA) and Isomap reveals the circle of fifths. TOMIRs evolve from
a) average CQ-profiles of Chopin's Preludes and toroidal SOMs,
b) overall annotated duration profiles of Bach's WTC I from score and CA, and
c) formalization of music theoretical derivation from Weber's (1817) key chart by the topographic ordering map introduced here. These results are consistent with Krumhansl's (1982) visualization of music psychology ratings. As auxiliary results, we discuss fuzzy distance, spatial and systematized synesthetic visualization in conjunction with beat weight detection on multiple time scales and suggested applications to automatic tone center tracking. Furthermore, statistics on the key preference of various composers are collected. Based on the latter, CA visualizes composer inter-relations. This thesis substantially contributes to content retrieval (MPEG-7), automated analysis, interactive audio-based computer music, and musicology.

Instrument Recognition and Melody Estimation in Polyphonic Music Based on Limited Time-Frequency Representations
Jana Eggink, University of Sheffield, UK, September 2005. [BibTex]

Creating Music by Listening
Tristan Jehan, Massachusetts Institute of Technology, MA, USA, September 2005. [BibTex, Abstract, External Link]

Machines have the power and potential to make expressive music on their own. This thesis aims to computationally model the process of creating music using experience from listening to examples. Our unbiased signal-based solution models the life cycle of listening, composing, and performing, turning the machine into an active musician, instead of simply an instrument. We accomplish this through an analysis-synthesis technique by combined perceptual and structural modeling of the musical surface, which leads to a minimal data representation.

We introduce a music cognition framework that results from the interaction of psychoacoustically grounded causal listening, a time-lag embedded feature representation, and perceptual similarity clustering. Our bottom-up analysis intends to be generic and uniform by recursively revealing metrical hierarchies and structures of pitch, rhythm, and timbre. Training is suggested for top-down unbiased supervision, and is demonstrated with the prediction of downbeat. This musical intelligence enables a range of original manipulations including song alignment, music restoration, cross-synthesis or song morphing, and ultimately the synthesis of original pieces.

Automated Analysis of Musical Structure
Wei Chai, Massachusetts Institute of Technology, MA, USA, September 2005. [BibTex, Abstract, PDF]

Listening to music and perceiving its structure is a fairly easy task for humans, even for listeners without formal musical training. For example, we can notice changes of notes, chords and keys, though we might not be able to name them (segmentation based on tonality and harmonic analysis); we can parse a musical piece into phrases or sections (segmentation based on recurrent structural analysis); we can identify and memorize the main themes or the catchiest parts – hooks - of a piece (summarization based on hook analysis); we can detect the most informative musical parts for making certain judgments (detection of salience for classification). However, building computational models to mimic these processes is a hard problem. Furthermore, the amount of digital music that has been generated and stored has already become unfathomable. How to efficiently store and retrieve the digital content is an important real-world problem.

This dissertation presents our research on automatic music segmentation, summarization and classification using a framework combining music cognition, machine learning and signal processing. It will inquire scientifically into the nature of human perception of music, and offer a practical solution to difficult problems of machine intelligence for automatic musical content analysis and pattern discovery.

Specifically, for segmentation, an HMM-based approach will be used for key change and chord change detection; and a method for detecting the self-similarity property using approximate pattern matching will be presented for recurrent structural analysis. For summarization, we will investigate the locations where the catchiest parts of a musical piece normally appear and develop strategies for automatically generating music thumbnails based on this analysis. For musical salience detection, we will examine methods for weighting the importance of musical segments based on the confidence of classification. Two classification techniques and their definitions of confidence will be explored. The effectiveness of all our methods will be demonstrated by quantitative evaluations and/or human experiments on complex real-world musical stimuli.

Learning the Meaning of Music
Brian Whitman, Massachusetts Institute of Technology, MA, USA, June 2005. [BibTex, Abstract, PDF]

Expression as complex and personal as music is not adequately represented by the signal alone. We define and model meaning in music as the mapping between the acoustic signal and its contextual interpretation - the 'community metadata' based on popularity, description and personal reaction, collected from reviews, usage, and discussion. In this thesis we present a framework for capturing community metadata from free text sources, audio representations general enough to work across domains of music, and a machine learning framework for learning the relationship between the music signals and the contextual reaction iteratively at a large scale.

Our work is evaluated and applied as semantic basis functions - meaning classifiers that are used to maximize semantic content in a perceptual signal. This process improves upon statistical methods of rank reduction as it aims to model a community's reaction to perception instead of relationships found in the signal alone. We show increased accuracy of common music retrieval tasks with audio projected through semantic basis functions. We also evaluate our models in a 'query-by-description' task for music, where we predict description and community interpretation of audio. These unbiased learning approaches show superior accuracy in music and multimedia intelligence tasks such as similarity, classification and recommendation.

Artificial Listening Systems: Models for the Approximation of the Subjective Perception of Music Similarity (in German)
Stephan Baumann, Technical University of Kaiserslautern, Germany, March 2005. [BibTex, PDF]

2004

A Multipitch Prediction-driven Approach to Polyphonic Music Transcription
Giuliano Monti, Queen Mary University of London, UK, 2004. [BibTex]

Automatic Drum Transcription and Source Separation
Derry FitzGerald, Dublin Institute of Technology, Dublin, Ireland, 2004. [BibTex, Abstract, PDF]

While research has been carried out on automated polyphonic music transcription, to-date the problem of automated polyphonic percussion transcription has not received the same degree of attention. A related problem is that of sound source separation, which attempts to separate a mixture signal into its constituent sources. This thesis focuses on the task of polyphonic percussion transcription and sound source separation of a limited set of drum instruments, namely the drums found in the standard rock/pop drum kit.

As there was little previous research on polyphonic percussion transcription a broad review of music information retrieval methods, including previous polyphonic percussion systems, was also carried out to determine if there were any methods which were of potential use in the area of polyphonic drum transcription. Following on from this a review was conducted of general source separation and redundancy reduction techniques, such as Independent Component Analysis and Independent Subspace Analysis, as these techniques have shown potential in separating mixtures of sources.

Upon completion of the review it was decided that a combination of the blind separation approach, Independent Subspace Analysis (ISA), with the use of prior knowledge as used in music information retrieval methods, was the best approach to tackling the problem of polyphonic percussion transcription as well as that of sound source separation.

A number of new algorithms which combine the use of prior knowledge with the source separation abilities of techniques such as ISA are presented. These include subband ISA, Prior Subspace Analysis (PSA), and an automatic modelling and grouping technique which is used in conjunction with PSA to perform polyphonic percussion transcription. These approaches are demonstrated to be effective in the task of polyphonic percussion transcription, and PSA is also demonstrated to be capable of transcribing drums in the presence of pitched instruments.

A sound source separation scheme is presented, which combines two previous separation methods, ISA and the DUET algorithm with the use of prior knowledge obtained from the transcription algorithms to allow percussion instrument source separation.

Polyphonic Music Retrieval: The N-gram approach
Shyamala Doraisamy, Imperial College London, London, UK, 2004. [BibTex]

Instrument Models for Source Separation and Transcription of Music Recordings
Emmanuel Vincent, University of Paris 6, France, December 2004. [BibTex, Abstract, PDF]

For about fifteen years the study of chamber music recordings has focused on two distinct viewpoints: source separation and polyphonic transcription. Source separation tries to extract from a recording the signals corresponding to each musical instrument playing. Polyphonic transcription aims to describe a recording by a set of parameters: instrument names, pitch and loudness of the notes, etc. Existing methods, based on spatial and spectro-temporal analysis of the recordings, provide satisfying results in simple cases. But their performance generally degrades quickly in the presence of reverberation, instruments of similar pitch range or notes at harmonic intervals.

Our hypothesis is that these methods often suffer from too generic models of instrumental sources. We propose to address this by creating specific instrument models based on a learning framework.

In this dissertation, we justify this hypothesis by studying the relevant information present in musical recordings and its use by existing methods. Then we describe new probabilistic instrument models inspired from Independent Subspace Analysis (ISA) and we give a few examples of learnt instruments. Finally we exploit these models to separate and transcribe realistic recordings, among which CD tracks and synthetic convolutive or underdetermined mixtures of these tracks.

Towards Automatic Musical Instrument Timbre Recognition
Tae Hong Park, Princeton University, NJ, USA, November 2004. [BibTex, Abstract, PDF]

This dissertation is comprised of two parts – focus on issues concerning research and development of an artificial system for automatic musical instrument timbre recognition and musical compositions. The technical part of the essay includes a detailed record of developed and implemented algorithms for feature extraction and pattern recognition. A review of existing literature introducing historical aspects surrounding timbre research, problems associated with a number of timbre definitions, and highlights of selected research activities that have had significant impact in this field are also included. The developed timbre recognition system follows a bottom-up, data-driven model that includes a preprocessing module, feature extraction module, and a RBF/EBF (Radial/Elliptical Basis Function) neural network-based pattern recognition module. 829 monophonic samples from 12 instruments have been chosen from the Peter Siedlaczek library (Best Service) and other samples from the internet and personal collections. Significant emphasis has been put on feature extraction development and testing to achieve robust and consistent feature vectors that are eventually passed to the neural network module. In order to avoid a garbage-in-garbage-out (GIGO) trap and improve generality, extra care was taken in designing and testing the developed algorithms using various dynamics, different playing techniques, and a variety of pitches for each instrument with inclusion of attack and steady-state portions of a signal. Most of the research and development was conducted in Matlab. The compositional part of the essay includes brief introductions to “A d’Ess Are,” “Aboji,” “48 13 N, 16 20 O,” and “pH SQ.” A general outline pertaining to the ideas and concepts behind the architectural designs of the pieces including formal structures, time structures, orchestration methods, and pitch structures are also presented.

Bayesian Music Transcription
A. Taylan Cemgil, Radboud University of Nijmegen, Netherlands, November 2004. [BibTex, Abstract, PDF]

Music transcription refers to extraction of a human readable and interpretable description from a recording of a music performance. The final goal is to implement a program that can automatically infer a musical notation that lists the pitch levels of notes and corresponding score positions in any arbitrary acoustical input. However, in this full generality, music transcription stays yet as a hard problem and arguably requires simulation of a human level intelligence. On the other hand, under some realistic assumptions, a practical engineering solution is possible by an interplay of scientific knowledge from cognitive science, musicology, musical acoustics and computational techniques from artificial intelligence, machine learning and digital signal processing. In this context, the aim of this thesis is to integrate this vast amount of prior knowledge in a consistent and transparent computational framework and to demonstrate the feasibility of such an approach in moving us closer to a practical solution to music transcription.

In this thesis, we approach music transcription as a statistical inference problem where given a signal, we search for a score that is consistent with the encoded music. In this context, we identify three subproblems: Rhythm Quantization, Tempo Tracking and Polyphonic Pitch Tracking. For each subproblem, we define a probabilistic generative model, that relates the observables (i.e. onsets or audio signal) with the underlying score. Conceptually, the transcription task is then to ``invert'' this generative model by using the Bayes Theorem and to estimate the most likely score.

Inferring Score Level Musical Information From Low-Level Musical Data
Jürgen F. Kilian, Darmstadt University of Technology, Darmstadt, Germany, October 2004. [BibTex, Abstract, PDF]

The task of inferring score level musical information from low-level musical data can be accomplished by human listeners – depending on their training – almost intuitively, but an algorithmic model for computer aided transcription is very hard to achieve. In between there exist a large number of approaches addressing different issues related to the process of musical transcription. Especially for the two core issues in the context of transcription, i.e., tempo detection and quantisation, still no general, adequate, and easy-to-use approach seems to be available. Many of the approaches described in the literature have been implemented only in a prototypical way or are restricted to certain styles of music or input data.

This thesis gives an introduction to the general issue of computer aided transcription and describes related approaches known from literature as well as new approaches developed in the context of this thesis. It also describes the implementation and evaluation of these new models by the implementation of a complete system for inferring score level information from low-level symbolic musical data as an usable transcription tool. The described system consists of several modules each addressing specific issues in the context of musical transcription. For each module the thesis includes a discussion of related work known from literature as well as a description of the specific implemented approaches. For the two main tasks during transcription – tempo detection and quantisation – new pattern-based approaches have been developed and implemented. Beside these main issues the thesis addresses also approaches for voice separation, segmentation and structural analysis, inferring of keyand time signature, pitch spelling, and the detection of musical ornaments. Also approaches for inferring other secondary score elements such as slurs, staccato, and intensity marking are discussed.

A general intention behind the here developed approaches is adequacy in the sense that somehow simple input data should be processed automatically but for more complex input data the system might ask the user for additional information. Where other approaches try to infer always the correct transcription, the here described system was built under the assumption, that a correct, fully automatic algorithmical transcription is not possible for all cases of input data. Therefore the here described system analyses the input and output data for certain features and might ask the user for additional information or it might create warnings if potential errors are detected.

Because the processing of audio files for the detection of the start- and end-positions of notes, and their pitch and intensity information is a complex, challenging task on its own, the here described system uses low-level symbolic data – consisting of explicit note objects with pitch, intensity and absolute timing information – as input. Different from most other approaches in this area the system uses the Guido Music Notation format as file format for the inferred output data (i.e., musical scores). Where other file formats (e.g., MIDI) are not able to represent all types of high-level score information, they cannot be converted into graphical scores (e.g., ASCII note lists), or it becomes a complex task to create them (e.g., proprietary binary formats), Guido Music Notation can be created algorithmically in a straight forward way. It also offers the advantage that it is human readable, that it can be converted into graphical scores by using existing tools, and that it can represent all score level information required for conventional music notation and beyond.

Extraction de descripteurs musicaux: une approche évolutionniste
Aymeric Zils, University of Paris 6, Paris, France, September 2004. [BibTex, Abstract, External Link]

Nous présentons un système d'extraction automatique de descripteurs par traitement du signal, applicable notamment sur les signaux musicaux. Ce système appelé EDS (Extractor Discovery System) permet de construire des descripteurs à partir d'une base de signaux d'apprentissage étiquetés. Il fonctionne en deux étapes: l'extraction de fonctions pertinentes et l'optimisation du modèle descriptif.

La première étape consiste à construire automatiquement des fonctions de traitement du signal adaptées au problème descriptif à résoudre. Cette construction est réalisée grâce à un algorithme de recherche génétique qui génère des fonctions sous forme de composition d'opérations de traitement du signal, évalue leur pertinence, et réalise une optimisation en appliquant des transformations génétiques (mutations, croisements, etc) aux meilleures fonctions.

La seconde étape consiste à utiliser un ensemble de fonctions obtenues lors de la première étape, dans des classifieurs paramétrés (kNN, neural nets, etc). Une recherche complète permet d'obtenir le classifieur et le paramétrage optimal, qui fournissent le modèle descriptif définitif.

Le modèle obtenu est exportable depuis EDS sous forme d'un exécutable indépendant du système, applicable sur un fichier audio. Cet exécutable permet d'intégrer les descripteurs modélisés dans des applications musicales originales: un outil de recherche de musique, et une application de séquençage d'objets sonores.

Techniques for the Automated Analysis of Musical Audio
Stephen Webley Hainsworth, University of Cambridge, UK, September 2004. [BibTex, Abstract, PDF]

This thesis presents work on automated analysis techniques for musical audio. A complete analysis of the content within an audio waveform is termed ‘transcription’ and can be used for intelligent, informed processing of the signal in a variety of applications. However, full transcription is beyond the ability of current methods and hence this thesis concerns subtasks of whole problem.

A major part of musical analysis is the extraction of signal information from a time-frequency representation, often the short time Fourier transform or spectrogram. The ‘reassignment’ method is a technique for improving the clarity of these representations. It is comprehensively reviewed and its relationship to a number of classical instantaneous frequency measures is explicitly stated. Performance of reassignment as a sinusoidal frequency estimator is then assessed quantitatively for the first time. This is achieved in the framework of the Cram´er-Rao bound. Reassignment is shown to be inferior to some other estimators for simple synthetic signals but achieves comparable results for real musical examples. Several extensions and uses of the reassignment method are then described. These include the derivation of reassignment measures extracted from the amplitude of the spectrogram, rather than the traditional phase-based method and the use of reassignment measures for signal classification.

The second major area of musical analysis investigated in this thesis is beat tracking, where the aim is to extract both the tempo implied in the music and also its phase which is analagous to beat location. Three models for the beat in general music examples are described. These are cast in a state-space formulation within the Bayesian paradigm, which allows the use of Monte Carlo methods for inference of the model parameters. The first two models use pre-extracted note onsets and model tempo as either a linear Gaussian process or as Brownian motion. The third model also includes on-line detection of onsets, thus adding an extra layer of complexity. Sequential Monte Carlo algorithms, termed particle filters, are then used for the estimation of the data. The systems are tested on an extensive database, nearly three and a half hours in length and consisting of various styles of music. The results exceed the current published state of the art.

The research presented here could form the early stages of a full transcription system, a proposal for which is also expounded. This would use a flow of contextual information from simpler, more global structures to aid the inference of more complex and detailed processes. The global structures present in the music (such as style, structure, tempo, etc.) still have their own uses, making this scheme instantly applicable to real problems.

Automatic Singer Identification in Polyphonic Music
Mark A. Bartsch, University of Michigan, MI, USA, August 2004. [BibTex, PDF]

Harmonic Modeling for Polyphonic Music Retrieval
Jeremy Pickens, University of Massachusetts Amherst, MA, USA, May 2004. [BibTex, Abstract, External Link]

The content-based retrieval of Western music has received increasing attention in recent years. While much of this research deals with monophonic music, polyphonic music is far more common and more interesting, encompassing a wide selection of classical to popular music. Polyphony is also far more complex, with multiple overlapping notes per time step, in comparison with monophonic music's one-dimensional sequence of notes. Many of the techniques developed for monophonic music retrieval either break down or are simply not applicable to polyphony.

The first problem one encounters is that of vocabulary, or feature selection. How does one extract useful features from a polyphonic piece of music? The second problem is one of similarity. What is an effective method for determining the similarity or relevance of a music piece to a music query using the features that we have chosen? In this work we develop two approaches to solve these problems. The first approach, hidden Markov modeling, integrates feature extraction and probabilistic modeling into a single, formally sound framework. However, we feel these models tend to overfit the music pieces on which they were trained and, while useful, are limited in their effectiveness. Therefore, we develop a second approach, harmonic modeling, which decouples the feature extraction from the probabilistic sequence modeling. This allows us more control over the observable data and the aspects of it that are used for sequential probability estimation.

Our systems - the first of their kind - are able to not only retrieve real-world polyphonic music variations using polyphonic queries, but also bridge the audio-symbolic divide by using imperfectlytranscribed audio queries to retrieve error-free symbolic pieces of music at an extremely high precision rate. In support of this work we offer a comprehensive evaluation of our systems.

Signal Processing Methods for the Automatic Transcription of Music
Anssi Klapuri, Tampere University of Technology, Finnland, March 2004. [BibTex, Abstract, PDF]

Signal processing methods for the automatic transcription of music are developed in this thesis. Music transcription is here understood as the process of analyzing a music signal so as to write down the parameters of the sounds that occur in it. The applied notation can be the traditional musical notation or any symbolic representation which gives sufficient information for performing the piece using the available musical instruments. Recovering the musical notation automatically for a given acoustic signal allows musicians to reproduce and modify the original performance. Another principal application is structured audio coding: a MIDI-like representation is extremely compact yet retains the identifiability and characteristics of a piece of music to an important degree.

The scope of this thesis is in the automatic transcription of the harmonic and melodic parts of real-world music signals. Detecting or labeling the sounds of percussive instruments (drums) is not attempted, although the presence of these is allowed in the target signals. Algorithms are proposed that address two distinct subproblems of music transcription. The main part of the thesis is dedicated to multiple fundamental frequency (F0) estimation, that is, estimation of the F0s of several concurrent musical sounds. The other subproblem addressed is musical meter estimation. This has to do with rhythmic aspects of music and refers to the estimation of the regular pattern of strong and weak beats in a piece of music.

For multiple-F0 estimation, two different algorithms are proposed. Both methods are based on an iterative approach, where the F0 of the most prominent sound is estimated, the sound is cancelled from the mixture, and the process is repeated for the residual. The first method is derived in a pragmatic manner and is based on the acoustic properties of musical sound mixtures. For the estimation stage, an algorithm is proposed which utilizes the frequency relationships of simultaneous spectral components, without assuming ideal harmonicity. For the cancelling stage, a new processing principle, spectral smoothness, is proposed as an efficient new mechanism for separating the detected sounds from the mixture signal.

The other method is derived from known properties of the human auditory system. More specifically, it is assumed that the peripheral parts of hearing can be modelled by a bank of bandpass filters, followed by half-wave rectification and compression of the subband signals. It is shown that this basic structure allows the combined use of time-domain periodicity and frequency-domain periodicity for F0 extraction. In the derived algorithm, the higher-order (unresolved) harmonic partials of a sound are processed collectively, without the need to detect or estimate individual partials. This has the consequence that the method works reasonably accurately for short analysis frames. Computational efficiency of the method is based on calculating a frequency-domain approximation of the summary autocorrelation function, a physiologically-motivated representation of sound.

Both of the proposed multiple-F0 estimation methods operate within a single time frame and arrive at approximately the same error rates. However, the auditorily-motivated method is superior in short analysis frames. On the other hand, the pragmatically-oriented method is “complete” in the sense that it includes mechanisms for suppressing additive noise (drums) and for estimating the number of concurrent sounds in the analyzed signal. In musical interval and chord identification tasks, both algorithms outperformed the average of ten trained musicians.

For musical meter estimation, a method is proposed which performs meter analysis jointly at three different time scales: at the temporally atomic tatum pulse level, at the tactus pulse level which corresponds to the tempo of a piece, and at the musical measure level. Acoustic signals from arbitrary musical genres are considered. For the initial time-frequency analysis, a new technique is proposed which measures the degree of musical accent as a function of time at four different frequency ranges. This is followed by a bank of comb filter resonators which perform feature extraction for estimating the periods and phases of the three pulses. The features are processed by a probabilistic model which represents primitive musical knowledge and performs joint estimation of the tatum, tactus, and measure pulses. The model takes into account the temporal dependencies between successive estimates and enables both causal and noncausal estimation. In simulations, the method worked robustly for different types of music and improved over two state-of-the-art reference methods. Also, the problem of detecting the beginnings of discrete sound events in acoustic signals, onset detection, is separately discussed.

Data-Driven Concatenative Sound Synthesis
Diemo Schwarz, Ircam - Centre Pompidou, Paris, France, January 2004. [BibTex, Abstract, External Link]

Concatenative data-driven sound synthesis methods use a large database of source sounds, segmented into heterogeneous units, and a unit selection algorithm that finds the units that match best the sound or musical phrase to be synthesised, called the target. The selection is performed according to the features of the units. These are characteristics extracted from the source sounds, e.g. pitch, or attributed to them, e.g. instrument class. The selected units are then transformed to fully match the target specification, and concatenated. However, if the database is sufficiently large, the probability is high that a matching unit will be found, so the need to apply transformations is reduced.

Usual synthesis methods are based on a model of the sound signal. It is very difficult to build a model that would preserve all the fine details of sound. Concatenative synthesis achieves this by using actual recordings. This data-driven approach (as opposed to a rule-based approach) takes advantage of the information contained in the many sound recordings. For example, very naturally sounding transitions can be synthesized, since unit selection is aware of the context of the database units.

In speech synthesis, concatenative synthesis methods are the most widely used. They resulted in a considerable gain of naturalness and intelligibility. Results in other fields, for instance speech recognition, confirm the general superiority of data-driven approaches. Concatenative data-driven approaches have made their way into some musical synthesis applications which are briefly presented.

The CATERPILLAR software system developed in this thesis allows data-driven musical sound synthesis from a large database. However, musical creation is an artistic activity and thus not based on clearly definable criteria, like in speech synthesis. That's why a flexible, interactive use of the system allows composers to obtain new sounds.

To constitute a unit database, alignment of music to a score is used to segment musical instrument recordings. It is based on spectral peak structure matching and the two approaches using Dynamic Time Warping and Hidden Markov Models are compared.

Descriptor extraction analyses the sounds for their signal, spectral, harmonic, and perceptive characteristics, and temporal modeling techniques characterise the temporal evolution of the units uniformly. However, it is possible to attribute score information like playing style, or arbitrary information to the units, which can later be used for selection.

The database is implemented using a relational SQL database management system for optimal flexibility and reliability. A database interface cleanly separates the synthesis system from the database.

The best matching sequence of units is found by a Viterbi unit selection algorithm. To incorporate a more flexible specification of the resulting sequence of units, the constraint solving algorithm of adaptive local search has been alternatively applied to unit selection. Both algorithms are based on two distance functions: the target distance expresses the similarity of a target unit to the database units, and the concatenation distance the quality of the join of two database units.

Data-driven concatenative synthesis is then applied to instrument synthesis with high level control, explorative free synthesis from arbitrary sound databases, resynthesis of a recording with sounds from the database, and artistic speech synthesis. For these applications, unit corpora of violin sounds, environmental noises, and speech have been built.

2003

Computationally Measurable Differences Between Speech and Song
David Gerhard, Simon Fraser University, CA, USA, April 2003. [BibTex, Abstract, PDF]

Automatic audio signal classification is one of the general research areas in which algorithms are developed to allow computer systems to understand and interact with the audio environment. Human utterance classification is a specific subset of audio signal classification in which the domain of audio signals is restricted to those likely to be encountered when interacting with humans. Speech recognition software performs classification in a domain restricted to human speech, but human utterances can also include singing, shouting, poetry and prosodic speech, for which current recognition engines are not designed.

Another recent and relevant audio signal classification task is the discrimination between speech and music. Many radio stations have periods of speech (news, information reports, commercials) interspersed with periods of music, and systems have been designed to search for one type of sound in preference over another. Many of the current systems used to distinguish between speech and music use characteristics of the human voice, so such systems are not able to distinguish between speech and music when the music is an individual unaccompanied singer.

This thesis presents research into the problem of human utterance classification, specifically differentiation between talking and singing. The question is addressed: “Are there measurable differences between the auditory waveforms produced by talking and singing?” Preliminary background is presented to acquaint the reader with some of the science used in the algorithm development. A corpus of sounds was collected to study the physical and perceptual differences between singing and talking, and the procedures and results of this collection are presented. A set of 17 features is developed to diff erentiate between talking and singing, and to investigate the intermediate vocalizations between talking and singing. The results of these features are examined and evaluated.

Towards the Automated Analysis of Simple Polyphonic Music: A Knowledge-based Approach
Juan P. Bello, Queen Mary University of London, London, UK, January 2003. [BibTex, Abstract, PDF]

Music understanding is a process closely related to the knowledge and experience of the listener. The amount of knowledge required is relative to the complexity of the task in hand.

This dissertation is concerned with the problem of automatically decomposing musical signals into a score-like representation. It proposes that, as with humans, an automatic system requires knowledge about the signal and its expected behaviour to correctly analyse music.

The proposed system uses the blackboard architecture to combine the use of knowledge with data provided by the bottom-up processing of the signal's information. Methods are proposed for the estimation of pitches, onset times and durations of notes in simple polyphonic music. A method for onset detection is presented. It provides an alternative to conventional energy-based algorithms by using phase information. Statistical analysis is used to create a detection function that evaluates the expected behaviour of the signal regarding onsets.

Two methods for multi-pitch estimation are introduced. The first concentrates on the grouping of harmonic information in the frequency-domain. Its performance and limitations emphasise the case for the use of high-level knowledge.

This knowledge, in the form of the individual waveforms of a single instrument, is used in the second proposed approach. The method is based on a time-domain linear additive model and it presents an alternative to common frequency-domain approaches.

Results are presented and discussed for all methods, showing that, if reliably generated, the use of knowledge can significantly improve the quality of the analysis.

2002

Towards Music Perception by Redundancy Reduction and Unsupervised Learning in Probabilistic Models
Samer A. Abdallah, King's College London, London, UK, 2002. [BibTex, Abstract, PDF]

The study of music perception lies at the intersection of several disciplines: perceptual psychology and cognitive science, musicology, psychoacoustics, and acoustical signal processing amongst others. Developments in perceptual theory over the last fifty years have emphasised an approach based on Shannon’s information theory and its basis in probabilistic systems, and in particular, the idea that perceptual systems in animals develop through a process of unsupervised learning in response to natural sensory stimulation, whereby the emerging computational structures are well adapted to the statistical structure of natural scenes. In turn, these ideas are being applied to problems in music perception.

This thesis is an investigation of the principle of redundancy reduction through unsupervised learning, as applied to representations of sound and music. In the first part, previous work is reviewed, drawing on literature from some of the fields mentioned above, and an argument presented in support of the idea that perception in general and music perception in particular can indeed be accommodated within a framework of unsupervised learning in probabilistic models. In the second part, two related methods are applied to two different low-level representations. Firstly, linear redundancy reduction (Independent Component Analysis) is applied to acoustic waveforms of speech and music. Secondly, the related method of sparse coding is applied to a spectral representation of polyphonic music, which proves to be enough both to recognise that the individual notes are the important structural elements, and to recover a rough transcription of the music.

Finally, the concepts of distance and similarity are considered, drawing in ideas about noise, phase invariance, and topological maps. Some ecologically and information theoretically motivated distance measures are suggested, and put in to practice in a novel method, using multidimensional scaling (MDS), for visualising geometrically the dependency structure in a distributed representation.

Time-frequency Methods for Pitch Detection
Herbert Griebel, Vienna University of Technology, Austria, September 2002. [BibTex, Abstract, External Link]

The thesis proposes new methods for the pitch detection of monophonic and polyphonic signals. Investigated have been speech and music signals with non-ideal real properties, with a little number of harmonics or stretched harmonics. Pitch detection means detecting the fundamental frequency of a harmonic complex sound, i.e. the sound consists of a fundamental tone and harmonics at integral multiples of the fundamental frequency. Additionally the detection of a single sinusoid is treated in general, with strong overlap from arbitrary other small-band components and with strong overlap from other stable sinusoidal components. Fundamental problem of polyphonic pitch detection is the overlapping of signal components. Estimation of frequency, amplitude and phase is no simple task anymore. Resolving overlapping determined signal components was neglected in the past and is main part of this thesis. The detection of individual sinusoidal components is subproblem of the fundamental frequency detection. Motivation for the thorough treatment is the problem of detecting voiced and unvoiced segments of a speech signal, which is more difficult than detecting the fundamental frequency and further, the application in automatic speech recognition. If the amplitudes of speech harmonics and the tone have the same order of magnitude, individual kernels of the front-end filterbank are disturbed and the error rate deteriorates. The proposed method uses two additional time-frequency planes, which represent the smoothness of the sinusoidal signal. It is possible to detect stationary sinusoidal signal even with strong overlap of partial tones of a speech signal. Polyphonic pitch detection is main part of an automatic music recognition system. Musicians could use such a system, analysis of musical expression and tune recognition are important applications. The evaluated iterative method identifies the most easily detectable sound and subtracts it from the overall spectrum. Both steps are repeated until no sound is detectable anymore. The sound is detected locally in bands and does not utilize partial tracks. The subtraction simplifies the spectrum in the sense, that overlaps are resolved and other sounds become detectable. Detection of the fundamental frequency of speech is economically the most important problem. With an accurate signal model many problems can be solved easier or can be solved at all. Applications are denoising or equalizing of speech, estimating syllables rates, speech recognition and speech detection. Despite the vast amount of research already done on the field current available methods are not reliable enough. The proposed method overcomes some of the shortcomings and gives more reliable results than other methods, especially all correlation based methods.

On Quantitative Aspects of Musical Meaning
Kalev Tiits, University of Helsinki, Finland, August 2002. [BibTex, Abstract, PDF]

This study introduces some approaches to computer-based and other highly formalized methods for analysing form in time-dependent, music-based data. The data in question are defined as a stream of events (elements, signs, samples) that constitute a time series, and contain such structural cues in time which can be experienced as musically relevant. Various classical computer music methods and data representations are discussed and a new method sketched, which draws ideas from recent scientific developments. Different technologies are evaluated in light of their relation to musical meaning, particularly in its symbolic function. A key aspect of this study is that a datadriven processing method is preferred over the more widely used rule-based ones. This study also surveys isomorphisms and connections of sub-symbolic data processing to the general emergence of musical signification, as seen from a semiotic perspective on the reception of music. The study incorporates computer software written by the author, which was used in testing and experimentation. A series of empirical experiments was conducted on three works selected from the solo flute repertoire of the 20th-century. In sum, the present study brings together several threads of thought, including schools as theoretically distant from each other as philosophical-semiotic explication, on the one side, and computer-based, sub-symbolic signal processing on the other.

Music Information Retrieval Technology
Alexandra L. Uitdenbogerd, RMIT University, Melbourne, Victoria, Australia, July 2002. [BibTex, Abstract, PDF]

The field of Music Information Retrieval research is concerned with the problem of locating pieces of music by content, for example, finding the best matches in a collection of music to a particular melody fragment. This is useful for applications such as copyright-related searches.

In this work we investigate methods for the retrieval of polyphonic music stored as musical performance data using the MIDI standard file format. We devised a three-stage approach to melody matching consisting of melody extraction, melody standardisation, and similarity measurement. We analyse the nature of musical data, compare several novel melody extraction techniques, describe many melody standardisation techniques, develop, and compare various melody similarity measurement techniques, and also develop a method for evaluating the techniques in terms of the quality of answers retrieved, based on approaches developed withing the Information Retrieval community.

We have found that a technique that was judged to work well for extracting melodies consists of selecting the highest pitch note that starts at each instant.

We have tested a variety of methods for locating similar pieces of music. The best techniques found thus far are local alignment of intervals and coordinate matching based on n-grams, with n from 5 to 7.

In addition, we have compiled a collection of MIDI files, the representations of automatically extracted melodies of these files, a query set based on the extracted melodies, a collection of manual queries, and inferred and human relevance judgements. Experiments using these sets show that the type of query used in testing a system makes a significant difference in the outcome of an evaluation. Specifically, we found that manual queries performed best with a representation of the music collection that keeps a "melody" extracted from each instrumental part, and short automatically generated queries performed better when matched agains a representation of the collection that, for each piece, uses the melody extraction approach described above.

Manipulation, Analysis and Retrieval Systems for Audio Signals
George Tzanetakis, Princeton University, NJ, USA, June 2002. [BibTex, Abstract, PDF]

Digital audio and especially music collections are becoming a major part of the average computer user experience. Large digital audio collections of sound effects are also used by the movie and animation industry. Research areas that utilize large audio collections include: Auditory Display, Bioacoustics, Computer Music, Forensics, and Music Cognition.

In order to develop more sophisticated tools for interacting with large digital audio collections, research in Computer Audition algorithms and user interfaces is required. In this work a series of systems for manipulating, retrieving from, and analysing large collections of audio signals will be described. The foundation of these systems is the design of new and the application of existing algorithms for automatic audio content analysis. The results of the analysis are used to build novel 2D and 3D graphical user interfaces for browsing and interacting with audio signals and collections. The proposed systems are based on techniques from the fields of Signal Processing, Pattern Recognition, Information Retrieval, Visualization and Human Computer Interaction. All the proposed algorithms and interfaces are integrated under MARSYAS, a free software framework designed for rapid prototyping of computer audition research. In most cases the proposed algorithms have been evaluated and informed by conducting user studies.

New contributions of this work to the area of Computer Audition include: a general multifeature audio texture segmentation methodology, feature extraction from mp3 compressed data, automatic beat detection and analysis based on the Discrete Wavelet Transform and musical genre classification combining timbral, rhythmic and harmonic features. Novel graphical user interfaces developed in this work are various tools for browsing and visualizing large audio collections such as the Timbregram, TimbreSpace, GenreGram, and Enhanced Sound Editor.

Automatic Transcription of Piano Music with Neural Networks
Matija Marolt, University of Ljubljana, Slovenia, January 2002. [BibTex, Abstract]

The thesis presents a system for transcription of polyphonic piano music. Music transcription could be defined as an act of listening to a piece of music and writing down music notation for the piece; it is a process of converting an acoustical waveform to a parametric representation, where notes, their pitches, starting times and durations are extracted from the musical signal.

Our main motivation for this work was to evaluate connectionist approaches in different parts of the transcription system.

We developed a new partial tracking technique, based on a combination of a psychoacoustic time-frequency transform and adaptive oscillators. The technique exploits the synchronization ability of adaptive oscillators to track partials in outputs of the psychoacoustic transform. We showed that the algorithm successfully tracks partials in signals with diverse characteristics, including frequency modulation and beating. We extended the method for tracking individual partials to a method for tracking groups of harmonically related partials by joining adaptive oscillators into networks. Oscillator networks produce a very clear time-frequency representation and we show that it significantly improves the accuracy of transcription. We show how different types of neural networks perform for the task of note recognition and how onset detection and repeated note detection can be successfully performed by connectionist approaches.

We tested our system on several synthesized and natural recordings of piano pieces and for most pieces, the accuracy of transcription ranged between 80 and 90 percent. When compared to several other systems, our system achieved similar or better results and we believe that neural networks represent a viable alternative in building transcription systems and should be further studied.

2001

Redundancy Reduction for Computational Audition, a Unifying Approach
Paris Smaragdis, Massachusetts Institute of Technology, MA, USA, June 2001. [BibTex, Abstract, External Link]

Computational audition has always been a subject of multiple theories. Unfortunately very few place audition in the grander scheme of perception, and even fewer facilitate formal and robust definitions as well as efficient implementations. In our work we set forth to address these issues.

We present mathematical principles that unify the objectives of lower level listening functions, in an attempt to formulate a global and plausible theory of computational audition. Using tools to perform redundancy reduction, and adhering to theories of its incorporation in a perceptual framework, we pursue results that support our approach. Our experiments focus on three major auditory functions, preprocessing, grouping and scene analysis. For auditory preprocessing, we prove that it is possible to evolve cochlear-like filters by adaptation to natural sounds. Following that and using the same principles as in preprocessing, we present a treatment that collapses the heuristic set of the gestalt auditory grouping rules, down to one efficient and formal rule. We successfully apply the same elements once again to form an auditory scene analysis foundation, capable of detection, autonomous feature extraction, and separation of sources in real-world complex scenes.

Our treatment was designed in such a manner so as to be independent of parameter estimations and data representations specific to the auditory domain. Some of our experiments have been replicated in other domains of perception, providing equally satisfying results, and a potential for defining global ground rules for computational perception, even outside the realm of our five senses.

Emotion and the Experience of Listening to Music: A Framework for Empirical Research
Matthew Levy, University of Cambridge, UK, April 2001. [BibTex, Abstract, PDF]

That music evokes emotion is a well-known and uncontested fact. Rather more contentious have been the numerous attempts by philosophers, writers and musicians over the centuries to explain the phenomenon. In recent years, the development of cognitive psychology has led to renewed interest in the field; a growing number of music psychologists are devoting their energies to the empirical examination of various aspects of musically evoked emotion. Despite the wealth of data fast amassing, however, there exist few theoretical accounts of emotional response to music written from a music-psychological perspective within which empirical studies can be understood and upon which they can build. Furthermore, accounts that do exist have traditionally made rigid distinctions between intrinsic and extrinsic sources of emotion, distinctions that do not fit well with our understanding of emotional antecedents in other domains.

This thesis presents the foundations of a model of emotional response to music that places the experience of listening to music squarely within the wider frame of human engagement with the environment. Instead of presenting a categorization of music or an analysis of cultural or individual semantic tokens, the model develops four basic assumptions concerning listeners and their relationship to music:

1. Music is heard as sound. The constant monitoring of auditory stimuli does not suddenly switch off when people listen to music; just like any other stimulus in the auditory environment, music exists to be monitored and analyzed.

2. Music is heard as human utterance. Humans have a remarkable ability to communicate and detect emotion in the contours and timbres of vocal utterances; this ability is not suddenly lost during a musical listening experience.

3. Music is heard in context. Listeners do not exist in a vacuum: music is always heard within the context of a complex web of knowledge, thoughts and environment, all of which can potentially contribute to an emotional experience.

4. Music is heard as narrative. Listening to music involves the integration of sounds, utterances and context into dynamic, coherent experience. Such integration, far from being a phenomenon specic to music listening, is underpinned by generic narrative processes.

The first part of the thesis introduces the four components of the model, reviews existing empirical and theoretical research that supports its premises, and considers its ramifications. The discussion reveals that despite an abundance of evidence pointing to the importance of narrative for affective responses to music, virtually no empirical work has addressed the issue directly. Hence, the second part of the thesis presents three experiments that constitute a preliminary attempt to do so. First is an experiment that investigates the interaction between music and listening context in the evocation of emotional response. It presents participants with musical excerpts in conjunction with explicit extra-musical narratives in order to demonstrate how readily music binds with extra-musical context to form a dynamic, coherent whole. The second and third experiments seek to demonstrate that such binding is not specic to music but is an example of the workings of more generic cognitive processes that underpin narrative comprehension. In addition, both of these experiments are intended to exemplify research paradigms that could be used in future empirical research on the narrative processing of music and its role in the evocation of emotion.

The thesis argues that the Sound-Utterance-Context-Narrative model constitutes a good framework for empirical work because it is specific enough to provoke detailed research questions and methodologies, but generic enough that a theoretically complete answer to all the questions it poses would constitute a comprehensive understanding of emotional response to music. Its over-arching claim is that an understanding of emotional response to music can only be attained by the development of models that refrain from treating music as a privileged class of object with intrinsic emotional properties, and instead consider the act of listening to music as a perfectly ordinary human activity.

2000

Automatic Transcription of Lithuanian Folk Songs
Gailius Raskinis, Vytautas Magnus University, Kaunas, Lithuania, 2000. [BibTex]

String Matching Techniques for Music Retrieval
Kjell Lemström, University of Helsinki, Finland, November 2000. [BibTex, Abstract, External Link]

In content-based music information retrieval (MIR), the primary task is to find exact or approximate occurrences of a monophonic musical query pattern within a music database, which may contain either monophonic or polyphonic documents. In tonal music, the occurrences that are searched for may be presented in any musical key, making transposition invariance a useful property for an MIR algorithm. Moreover, as the term 'content-based' indicates, a query should be comprised solely of musical information. It cannot contain, e.g., annotations such as keywords or lyrics of a musical document.

In this thesis, various aspects of music comparison and retrieval, including both theoretical and practical, will be studied. The general pattern matching scheme based on approximate string matching framework will be applied, and estimats on the theoretical complexity of such pattern matching will be given. Moreover, the aforementioned framework is to be tuned so that it would be possible to retrieve symbolically encoded musical documents efficiently. This modified framework is plausible as regards to findings in musicology. It enables the consideration of context in music and gives novel transposition invariant distance measures, avoiding some shortcomings of the straightforward applications of conventional string matching measures.

A database containing only monophonic music represents a special case, for which the applied techniques can be implemented more efficiently. Nevertheless, in this thesis, both the cases of monophonic and polyphonic music databases will be considered and solutions applying bit-parallelism, a very efficient way to implement dynamic programming, will be presented. In particular, the thesis introduces our novel, very efficient algorithm for polyphonic databases.

Further, the thesis gives two novel music alphabet reduction schemes. They both have strong musical basis, and they are used as essential components in our software system prototype called SEMEX. Finally, the thesis introduces SEMEX and describes how the described techniques and observations are implemented.

A Multiresolution Time-Frequency Analysis and Interpretation of Musical Rhythm
Leigh M. Smith, University of Western Australia, Australia, October 2000. [BibTex, Abstract]

Computational approaches to music have considerable problems in representing musical time. In particular, in representing structure over time spans longer than short motives. The new approach investigated here is to represent rhythm in terms of frequencies of events, explicitly representing the multiple time scales as spectral components of a rhythmic signal.
Approaches to multiresolution analysis are then reviewed. In comparison to Fourier theory, the theory behind wavelet transform analysis is described. Wavelet analysis can be used to decompose a time dependent signal onto basis functions which represent time-frequency components. The use of Morlet and Grossmann's wavelets produces the best simultaneous localisation in both time and frequency domains. These have the property of making explicit all characteristic frequency changes over time inherent in the signal.
An approach of considering and representing a musical rhythm in signal processing terms is then presented. This casts a musician's performance in terms of a conceived rhythmic signal. The actual rhythm performed is then a sampling of that complex signal, which listeners can reconstruct using temporal predictive strategies which are aided by familarity with the music or musical style by enculturation. The rhythmic signal is seen in terms of amplitude and frequency modulation, which can characterise forms of accents used by a musician.
Once the rhythm is reconsidered in terms of a signal, the application of wavelets in analysing examples of rhythm is then reported. Example rhythms exhibiting duration, agogic and intensity accents, accelerando and rallentando, rubato and grouping are analysed with Morlet wavelets. Wavelet analysis reveals short term periodic components within the rhythms that arise. The use of Morlet wavelets produces a "pure" theoretical decomposition. The degree to which this can be related to a human listener's perception of temporal levels is then considered.
The multiresolution analysis results are then applied to the well-known problem of foot-tapping to a performed rhythm. Using a correlation of frequency modulation ridges extracted using stationary phase, modulus maxima, dilation scale derivatives and local phase congruency, the tactus rate of the performed rhythm is identified, and from that, a new foot-tap rhythm is synthesised. This approach accounts for expressive timing and is demonstrated on rhythms exhibiting asymmetrical rubato and grouping. The accuracy of this approach is presented and assessed.
From these investigations, I argue the value of representing rhythm into time-frequency components. This is the explication of the notion of temporal levels (strata) and the ability to use analytical tools such as wavelets to produce formal measures of performed rhythms which match concepts from musicology and music cognition. This approach then forms the basis for further research in cognitive models of rhythm based on interpretation of the time-frequency components.

Signal Separation of Musical Instruments
Paul Joseph Walmsley, University of Cambridge, UK, September 2000. [BibTex, Abstract, External Link]

This thesis presents techniques for the modelling of musical signals, with particular regard to monophonic and polyphonic pitch estimation. Musical signals are modelled as a set of notes, each comprising of a set of harmonically-related sinusoids. An hierarchical model is presented that is very general and applicable to any signal that can be decomposed as the sum of basis functions. Parameter estimation is posed within a Bayesian framework, allowing for the incorporation of prior information about model parameters. The resulting posterior distribution is of variable dimension and so reversible jump MCMC simulation techniques are employed for the parameter estimation task. The extension of the model to time-varying signals with high posterior correlations between model parameters is described. The parameters and hyperparameters of several frames of data are estimated jointly to achieve a more robust detection. A general model for the description of time-varying homogeneous and heterogeneous multiple component signals is developed, and then applied to the analysis of musical signals. The importance of high level musical and perceptual psychological knowledge in the formulation of the model is highlighted, and attention is drawn to the limitation of pure signal processing techniques for dealing with musical signals. Gestalt psychological grouping principles motivate the hierarchical signal model, and component identifi- ability is considered in terms of perceptual streaming where each component establishes its own context. A major emphasis of this thesis is the practical application of MCMC techniques, which are generally deemed to be too slow for many applications. Through the design of efficient transition kernels highly optimised for harmonic models, and by careful choice of assumptions and approximations, implementations approaching the order of realtime are viable.

Content Based Retrieval and Navigation of Music Using Melodic Pitch Contours
Steven G. Blackburn, University of Southampton, Southampton, UK, September 2000. [BibTex, Abstract, External Link]

The support for audio in existing hypermedia systems is generally not as comprehensive as for text and images, considering audio to be an endpoint medium. Temporal linking has been considered in the Soundviewer for Microcosm. This linking model is beginning to become more mainstream through the development of standards associated with the World Wide Web, such as SMIL.

This thesis investigates the application of content based navigation to music. It considers the viability of using melodic pitch contours as the content representation for retrieval and navigation. It observes the similarity between content based retrieval and navigation and focuses on the fast retrieval of music. The technique of using n-grams for efficient and accurate content based retrieval is explored. The classification of MIDI tracks into musical roles is used to reduce the amount of data that must be indexed, and so increase speed and accuracy. The impact of classification is evaluated against a melodic retrieval engine.

Finally, the system built for retrieval is modified for navigation. A set of tools which support both temporal and content based navigation is built to provide proof of concept.

Segmentation et indexation des signaux sonores musicaux
Stéphane Rossignol, Université Pierre et Marie Curie, Paris, France, July 2000. [BibTex, External Link]

Music-Listening Systems
Eric Scheirer, Massachusetts Institute of Technology, MA, USA, April 2000. [BibTex, Abstract]

When human listeners are confronted with musical sounds, they rapidly and automatically orient themselves in the music. Even musically untrained listeners have an exceptional ability to make rapid judgments about music from very short examples, such as determining the music s style, performer, beat, complexity, and emotional impact. However, there are presently no theories of music perception that can explain this behavior, and it has proven very difficult to build computer music-analysis tools with similar capabilities. This dissertation examines the psychoacoustic origins of the early stages of music listening in humans, using both experimental and computer-modeling approaches. The results of this research enable the construction of automatic machine-listening systems that can make human-like judgments about short musical stimuli.

New models are presented that explain the perception of musical tempo, the perceived segmentation of sound scenes into multiple auditory images, and the extraction of musical features from complex musical sounds. These models are implemented as signal-processing and pattern-recognition computer programs, using the principle of *understanding without separation*. Two experiments with human listeners study the rapid assignment of high-level judgments to musical stimuli, and it is demonstrated that many of the experimental results can be explained with a multiple-regression model on the extracted musical features.

From a theoretical standpoint, the thesis shows how theories of music perception can be grounded in a principled way upon sychoacoustic models in a computational-auditory-scene-analysis framework. Further, the perceptual theory presented is more relevant to everyday listeners and situations than are previous cognitive-structuralist approaches to music perception and cognition. From a practical standpoint, the various models form a set of computer signal-processing and pattern-recognition tools that can mimic human perceptual abilities on a variety of musical tasks such as tapping along with the beat, parsing music into sections, making semantic judgments about musical examples, and estimating the similarity of two pieces of music.

Meter Through Synchrony: Processing Rhythmical Patterns with Relaxation Oscillators
Douglas S. Eck, Indiana University, IN, USA, April 2000. [BibTex, Abstract, PDF]

This dissertation uses a network of relaxation oscillators to beat along with temporal signals. Relaxation oscillators exhibit interspersed slow-fast movement and model a wide array of biological oscillations. The model is built up gradually: first a single relaxation oscillator is exposed to rhythms and shown to be good at finding downbeats in them. Then large networks of oscillators are mutually coupled in an exploration of their internal synchronization behavior. It is demonstrated that appropriate weights on coupling connections cause a network to form multiple pools of oscillators having stable phase relationships. This is a promising first step towards networks that can recreate a rhythmical pattern from memory. In the full model, a coupled network of relaxation oscillators is exposed to rhythmical patterns. It is shown that the network finds downbeats in patterns while continuing to exhibit good internal stability. A novel non-dynamical model of downbeat induction called the Normalized Positive (NP) clock model is proposed, analyzed, and used to generate comparison predictions for the oscillator model. The oscillator model compares favorably to other dynamical approaches to beat induction such as adaptive oscillators. However, the relaxation oscillator model takes advantage of intrinsic synchronization stability to allow the creation of large coupled networks. This research lays the groundwork for a long-term research goal, a robotic arm that responds to rhythmical signals by tapping along. It also opens the door to future work in connectionist learning of long rhythmical patterns.

Towards a Mathematical Model for Tonality
Elaine Chew, Massachusetts Institute of Technology, MA, USA, February 2000. [BibTex, Abstract, External Link]

This dissertation addresses the question of how musical pitches generate a tonal center. Being able
to characterize the relationships that generate a tonal center is crucial to the computer analysis and the
generating of western tonal music. It also can inform issues of compositional styles, structural boundaries,
and performance decisions.

The proposed Spiral Array model offers a pasimonious description of the inter-relations among tonal
elements, and suggests new ways to re-conseptualize and reorganize musical information. The Spiral Array
generated representations for pitches, intervals, chords and keys within a single spatial framework, allowing
comparisons among elements from different hierarchical levels. Structurally, this spatial representation is
a helical realization of the harmonic network (tonnetz). The basic idea behind the Spiral Array is the
representation of higher level tonal elements as composites of their lower level parts.

The Spiral Array assigns greatest prominence to perfect fifth and major/minor third interval relations,
placing elements related by these intervals in proximity to each other. As a result, distances between tonal
parameter values that affect proximity relations are prescribed based on a few perceived relations among
pitches, intervals, chords and keys. This process of interfacing between the model and actual perception
creates the opportunity to research some basic, but till now unanswered questions about the relationships
that generate tonality.

A generative model, the Spiral Array also provides a framework on which to design viable and efficient
algorithms for problems in music cognition. I demonstrate its versatility by applying the model to three
different problems: I develop an algorithm to determine the key of musical passages that, on average, performs
better than existing ones when applied to the 24 fuge subjects in Book 1 of Bach's WTC; I propose the
first computationally viable method for determining modulations (the change of key); and, I design a basic
algorithm for finding the roots of chords, comparing its results to those of algorithms by other researchers.
All three algorithms were implemented in Matlab.

1999

Timbre Models of Musical Sounds
Kristoffer Jensen, University of Copenhagen, Denmark, 1999. [BibTex, Abstract, PDF]

This work involves the analysis of musical instrument sounds, the creation of timbre models, the estimation of the parameters of the timbre models and the analysis of the timbre model parameters.

The timbre models are found by studying the literature of auditory perception, and by studying the gestures of music performance.

Some of the important results from this work are an improved fundamental frequency estimator, a new envelope analysis method, and simple intuitive models for the sound of musical instruments. Furthermore a model for the spectral envelope is introduced in this work. A new function, the brightness creation function, is introduced in the spectral envelope model.

The timbre model is used to analyze the evolution of the different timbre parameters when the fundamental frequency is changed, but also for different intensity, tempo, or style. The main results from this analysis are that brightness rises with frequency, but nevertheless the fundamental has almost all amplitude for the high notes. The attack and release times generally fall with frequency. It was found that only brightness and amplitude are affected by a change in intensity, and only the sustain and release times are affected when the tempo is changed.

The different timbre models are also used for the classification of the sounds in musical instrument classes with very good results. Finally, listening tests have been performed, which assessed that the best timbre model has an acceptable sound quality.

Model-Based Segmentation of Time-Frequency Images for Musical Transcription
Andrew Sterian, University of Michigan, MI, USA, 1999. [BibTex, PDF]

ARTIST: Adaptive Resonance Theory to Internalize the Structure of Tonality (A Neural Net Listening to Music)
Frederic Georges Paul Piat, University of Texas at Dallas, TX, USA, December 1999. [BibTex, Abstract, External Link]

After sufficient exposure to music, we naturally develop a sense of which note sequences are musical and pleasant, even without being taught anything about music. This is the result of a process of acculturation that consists of extracting the temporal and tonal regularities found in the styles of music we hear. ARTIST, an artificial neural network based on Grossberg's (1982) Adaptive Resonance Theory, is proposed to model the acculturation process. The model self-organizes its 2-layer architecture of neuron-like units through unsupervised learning: no a priori musical knowledge is provided to ARTIST, and learning is achieved through simple exposure to stimuli. The model's performance is assessed by how well it accounts for human data on several tasks, mostly involving pleasantness ratings of musical sequences. ARTIST's responses on Krumhansl and Shepard's (1979) probe-tone technique are virtually identical to humans', showing that ARTIST successfully extracted the rules of tonality from its environment. Thus, it distinguishes between tonal vs atonal musical sequences and can predict their exact degree of tonality or pleasantness. Moreover, as exposure to music increases, the model's responses to a variation of the probe-tone task follow the same changes as those of children as they grow up. ARTIST can further discriminate between several kinds of musical stimuli within tonal music: its preferences for some musical modes over others resembles humans'. This resemblance seems limited by the differences between humans' and ARTIST's musical environment. The recognition of familiar melodies is also one of ARTIST's abilities. It is impossible to identify even a very familiar melody when its notes are interleaved with distractors notes. However, a priori knowledge regarding the possible identity of the melody enables its identification, by humans as well as by ARTIST. ARTIST shares one more feature with humans, namely the robustnes regarding perturbations of the input: even larger random temporal fluctuations in the cycles of presentations of the inputs do not provoke important degradation of ARTIST's performance. All of these characteristics contributre to the plausibility of ARTIST as a model of musical learning by humans. Expanding the model by adding more layers of neurons may enable it to develop even more human-like capabilities, such as the recognition of melodies after transposition.

Evaluating a Simple Approach to Music Information Retrieval: Conceiving Melodic N-Grams as Text
J. Stephen Downie, University of Western Ontario, Canada, July 1999. [BibTex, Abstract, PDF]

Taking our cue from those printed thematic catalogues that have reduced the amount of music information represented we developed, and then evaluated, a Music Information Retrieval (MIR) system based upon the intervals found within the melodies of a collection of 9354 folksongs. We believe that there is enough information contained within an interval-only representation of monophonic melodies that effective retrieval of music information has been achieved. We extended the thematic catalogue model by affording access to musical expressions found anywhere within a melody. To achieve this extension we fragmented to the melodies into length-n subsections called n-grams. The length of these n-grams and the degree to which we precisely represent the intervals are variables analyzed in this thesis.

N-grams form discrete units of melodic information much in the same manner as words are discrete units of language. Thus, we have come to consider them “musical words.” This implies that, for the purposes of music information retrieval, we can treat them as “real words” and thereby apply traditional text-based information retrieval techniques. We examined the validity of our “musical word” concept in two ways. First, a variety of informetric analyses were conducted to examine in which ways the informetric properties of “musical words” and “real words” are similar or different. Second, we constructed a collection of “musical word” databases using the famous text-based, SMART information retrieval system. A group of simulated queries was run against these databases. The results were evaluated using the normalized precision and normalized recall measures. Results indicate that the simple approach to music information retrieval examined in this study shows great merit.

Sound-source Recognition: A theory and Computational Model
Keith Dana Martin, Massachusetts Institute of Technology, MA, USA, June 1999. [BibTex, Abstract, PDF]

The ability of a normal human listener to recognize objects in the environment from only the sounds they produce is extraordinarily robust with regard to characteristics of the acoustic environment and of other competing sound sources. In contrast, computer systems designed to recognize sound sources function precariously, breaking down whenever the target sound is degraded by reverberation, noise, or competing sounds. Robust listening requires extensive contextual knowledge, but the potential contribution of sound-source recognition to the process of auditory scene analysis has largely been neglected by researchers building computational models of the scene analysis process.

This thesis proposes a theory of sound-source recognition, casting recognition as a process of gathering information to enable the listener to make inferences about objects in the environment or to predict their behavior. In order to explore the process, attention is restricted to isolated sounds produced by a small class of sound sources, the non-percussive orchestral musical instruments. Previous research on the perception and production of orchestral instrument sounds is reviewed from a vantage point based on the excitation and resonance structure of the sound-production process, revealing a set of perceptually salient acoustic features.

A computer model of the recognition process is developed that is capable of “listening” to a recording of a musical instrument and classifying the instrument as one of 25 possibilities. The model is based on current models of signal processing in the human auditory system. It explicitly extracts salient acoustic features and uses a novel improvisational taxonomic architecture (based on simple statistical pattern-recognition techniques) to classify the sound source. The performance of the model is compared directly to that of skilled human listeners, using both isolated musical tones and excerpts from compact disc recordings as test stimuli. The computer model’s performance is robust with regard to the variations of reverberation and ambient noise (although not with regard to competing sound sources) in commercial compact disc recordings, and the system performs better than three out of fourteen skilled human listeners on a forced-choice classification task.

This work has implications for research in musical timbre, automatic media annotation, human talker identification, and computational auditory scene analysis.

1998

Identification de sons polyphoniques de piano (Identification of polyphonic piano sounds)
Lucile Rossi, Université de Corse, Corte, France, 1998. [BibTex, Abstract]

Cette thèse traite du problème de l'identification, directement à partir du signal acoustique, de sons polyphoniques de piano, c'est à dire de la reconnaissance de plusieurs notes jouées simultanément sur le même instrument. Ce problème est en liaison étroite avec celui traitant de la détermination de la fréquence fondamentale de signaux sonores, sujet de recherche dont les applications concernent aussi bien la reconnaissance de la parole que la reproduction du fonctionnement de l'oreille humaine. Les sons polyphoniques de piano sont des signaux à large bande de fréquence fondamentale (32,71 Hz - 987,77 Hz, octaves zéro à quatre), complexes, non-stationnaires, inharmoniques, de grande richesse spectrale, pour lesquels les interactions entre les composantes des différentes notes émises sont fréquentes et qui présentent, de ce fait, de grandes difficultés d'analyse. L'étude réalisée sur les sons de piano (notes isolées, notes répétées, sons polyphoniques), à partir de laquelle les bases de l'algorithme d'identification ont été déterminées, est présentée dans la première partie de cette thèse. Les évolutions temporelles de l'amplitude et de la fréquence des partiels et l'évolution temporelle de l'énergie du signal ont été étudiées afin de déterminer notamment si des caractéristiques particulières de ces variables pouvaient être utilisées : évolutions synchrones de l'amplitude et/ou de la fréquence des partiels d'une note par exemple. La deuxième partie décrit l'algorithme d'identification dont les deux points fondamentaux sont, d'une part, l'utilisation d'une base de données (créée par apprentissage statistique) et d'autre part, l'emploi d'un algorithme original d'identification de notes autorisant l'absence temporaire et/ou les variations fréquentielles, ainsi que la coïncidence de partiels de notes différentes. Une partie, enfin, est consacrée aux résultats expérimentaux, et montre que, d'une part, les taux d'identification correctes des notes sont très élevés (supérieurs à 95%) pour des combinaisons de quatre notes, et que d'autre part, l'algorithme utilisé est très résistant au bruit blanc gaussien. Ce travail contribue aux domaines de l'acoustique musicale, du traitement du signal et de la perception auditive.

Découverte Automatique de Régularités Dans les Séquences et Application à l'Analyse Musicale
Pierre-Yves Rolland, University of Paris 6, Paris, France, July 1998. [BibTex, Abstract, External Link]

La découverte de régularités dans les séquences (DRS) est un problème très général intervenant dans un large éventail de domaines d'application : biologie moléculaire, finance, télécommunications, analyse musicale, etc.

Nous nous intéressons plus particulièrement à la localisation et à l'extraction de patterns séquentiels -- un tel pattern correspond à un ensemble, appelé bloc, de facteurs (segments de séquences) soit identiques, soit "équipollents", c'est à dire significativement similaires. L'équipollence entre facteurs est définie a l'aide d'un modèle numérique de similitude (ou de dissimilitude) entre facteurs, qui joue un rôle centraldans la DRS.

Le premier volet du travail de thèse a été de montrer, expérimentations à l'appui, les limitations principales des approches existantes dans un domaine d'application comme la musique. Ces limitations se rapportent principalement à la représentation des séquences (et de leurs éléments), aux modèles de similitude entre facteurs employés (distance de Hamming ou distance d'édition a coûts constants par exemple), et aux algorithmes combinatoires de DRS eux-mêmes.

Pour pallier ces limitations, nous proposons : (1) l'insertion, au sein du processus de DRS, d'une phase d'enrichissement (ou de changement) de la représentation, partiellement ou totalement automatique. A partir de connaissances du domaine, on adjoint aux descriptions de base des séquences et de leurs éléments une hiérarchie éventuellement redondante de descriptions traduisant des propriétés supplémentaires, structurelles, locales ou globales ; (2) un nouveau modèle général de similitude entre séquences ou facteurs, le modèle d'édition valué multi-descriptions (MEVM). Ce modèle, généralisant la notion de distance d'édition, peut intégrer simultanément, de façon pondérée, un nombre arbitraire de descriptions issues en particulier de la phase d'enrichissement de la représentation ; et (3) un nouvel algorithme combinatoire d'extraction de patterns, appelé FlExPat, utilisant notre modèle de similitude. FlExPat comprend deux phases algorithmiques : la première phase, dite "d'appariement de facteurs", permet d'obtenir, a partir d'un ensemble (corpus) de séquences un graphe, appelé graphe de similitude, dont chaque sommet représente un facteur et chaque arête, un lien d'équipollence entre deux facteurs. La seconde, dite "phase de catégorisation", permet d'extraire des patterns à partir de ce graphe.

Notre logiciel Imprology implémente le modèle d'édition valué multi-descriptions et FlExPat. Imprology, implémenté en Smalltalk-80, intègre des fonctionnalités d'enrichissement automatique de la représentation des séquences et de leurs éléments. Les résultats expérimentaux obtenus sur des séquences musicales (mélodies), en particulier sur un corpus constitué d'environ 150 transcriptions de jazz improvisé, illustrent très clairement la validité de nos concepts et algorithmes. Ceux-ci sont généraux, et applicables à d'autres domaines que la musique.

Towards a General Computational Theory of Musical Structure
Emilios Cambouropoulos, University of Edinburgh, UK, May 1998. [BibTex, Abstract, External Link]

The General Computational Theory of Musical Structure (GCTMS) is a theory that may be employed to obtain a structural description (or set of descriptions) of a musical surface. This theory is based on general cognitive and logical principles, is independent of any specific musical style or idiom, and can be applied to any musical surface.
The musical work is presented to GCTMS as a sequence of discrete symbolically represented events (e.g. notes) without higher-level structural elements (e.g. articulation marks, time-signature etc.) - although such information may be used to guide the analytic process.
The aim of the application of the theory is to reach a structural description of the musical work that may be considered as 'plausible' or 'permissible' by a human music analyst. As style-dependent knowledge is not embodied in the general theory, highly sophisticated analyses (similar to those an expert analyst may provide) are not expected. The theory gives, however, higher rating to descriptions that may be considered more reasonable or acceptable by human analysts and lower to descriptions that are less plausible.
The analytic descriptions given by GCTMS may be said to relate to and may be compared with the intuitive 'understanding' a listener has when repeatedly exposed to a specific musical work. Although the theory does not make any claim of simulating cognitive processes as these are realised in the mind, it does give insights into the intrinsic requirements of musical analytic tasks and its results may be examined with respect to cognitive validity.
The proposed theory comprises two distinct but closely related stages of development: a) the development of a number of individual components that focus on specialised musical analytic tasks, and b) the development of an elaborate account of how these components relate to and interact with each other so that plausible structural descriptions of a given musical surface may be arrived at.
A prototype computer system based on the GCTMS has been implemented. As a test case, the theory and prototype system have been applied on various melodic surfaces from the 12-tone equal-temperament system.

A Study of Real-time Beat Tracking for Musical Audio Signals (in Japanese)
Masataka Goto, Waseda University, Japan, March 1998. [BibTex, Abstract]

Although a great deal of music-understanding research has been undertaken, it is still difficult to build a computer system that can understand musical audio signals in a human-like fashion. One of popular approaches to the computational modeling of music understanding is to build an automatic music transcription system or a sound source separation system, which typically transforms audio signals into a symbolic representation such as a musical score or MIDI data. Although such transcription technologies are important, they have difficulty in dealing with real-world audio signals such as ones sampled from compact discs. Because only a trained listener can identify musical notes, it can be inferred that musical transcription is an advanced skill difficult even for human beings to acquire.

On the other hand, this study gives attention to the fact that an untrained listener understands music to some extent without mentally representing audio signals as musical scores. The approach of this study is to first build a computational model that can understand music the way untrained listeners do, without relying on transcription, and then extend the model so that it can understand music the way musicians do. The first appropriate step of this approach is to build a computational model of beat tracking that is a process of understanding musical beats and measures, which are fundamental and important concepts in music, because beat tracking is a fundamental skill for both trained and untrained listeners and is indispensable to the perception of Western music. Moreover, beat tracking is useful in various applications in which music synchronization is necessary.

Most previous beat-tracking related systems have dealt with symbolic musical information like MIDI signals. They were, however, not able to process audio signals that were difficult to be transformed into a symbolic representation. Although some systems dealt with audio signals, they had difficulty in processing, in real time, audio signals containing sounds of various instruments and in tracking beats above the quarter-note level.

This thesis describes a real-time beat tracking system that recognizes a hierarchical beat structure in audio signals of popular music containing sounds of various instruments. The hierarchical beat structure consists of the quarter-note (beat) level, the half-note level, and the measure (bar-line) level. The system can process both music with drums and music without drums.

This thesis consists of the following nine chapters.

Chapter 1 presents the goal, background, and significance of this study. This study has relevant to both computational auditory scene analysis and musical information processing and contributes to various research fields such as music understanding, signal processing, artificial intelligence, parallel processing, and computer graphics.

Chapter 2 specifies that the beat-tracking problem is defined as a process that organizes musical audio signals into the hierarchical beat structure. This problem can be considered the inverse problem of the following three processes: indicating or implying the beat structure when performing music, playing musical instruments, and acoustic transmission of those sounds. The principal reason why beat tracking is intrinsically difficult is that this is the problem of inferring the original beat structure, which is not explicitly expressed in music. The main issues of solving this problem are: detecting beat-tracking cues in audio signals, interpreting the cues to infer the beat structure, and dealing with ambiguity of interpretation.

Chapter 3 proposes a beat-tracking model that consists of the inverse model of the process of indicating the beat structure and a model of extracting musical elements. The inverse model is represented by three kinds of musical knowledge corresponding to three kinds of musical elements: onset times, chord changes, and drum patterns. This chapter then addresses the three issues as follows:

(1) Detecting beat-tracking cues in audio signals
The three kinds of musical elements are detected as the beat-tracking cues. Since it is difficult to detect chord changes and drum patterns by bottom-up frequency analysis, this chapter proposes a method of detecting them by making use of provisional beat times as top-down information.

(2) Interpreting the cues to infer the beat structure
The quarter-note level is inferred on the basis of the musical knowledge of onset times. The half-note and measure levels are inferred on the basis of the musical knowledge of chord changes and drum patterns that is applied selectively according to the presence or absence of drum-sounds.

(3) Dealing with ambiguity of interpretation
To examine multiple hypotheses of beat positions in parallel, a multiple-agent model is introduced. Each agent makes a hypothesis according to different strategy while interacting with another agent and evaluates the reliability of its own hypothesis. The final beat-tracking result is determined on the basis of the most reliable hypothesis.

Chapter 4 describes the processing model of the system. In the frequency analysis stage, the system detects onset-time vectors representing onset times of all the frequency ranges. It also detects onset times of a bass drum and a snare drum and judges the presence of drum-sounds by using autocorrelation of the snare drum's onset times. In the beat prediction stage, each agent infers the quarter-note level by using autocorrelation and cross-correlation of the onset-time vectors. The quarter-note level is then utilized as the top-down information to detect chord changes and drum patterns. On the basis of those detected results, each agent infers the higher levels and evaluates the reliability. Finally, the beat-tracking result is transmitted to other application programs via a computer network.

Chapter 5 proposes a method of parallelizing the processing model to perform it in real time. This method applies four kinds of parallelizing techniques to execute heterogeneous processes simultaneously. The processes are first pipelined, and then each stage of the pipeline is implemented with data/control parallel processing, pipeline processing, and distributed cooperative processing. This chapter also proposes a time-synchronization mechanism for real-time processing on AP1000 on which the system has been implemented.

Chapter 6 proposes quantitative measures for analyzing the beat-tracking accuracies and shows experimental results of the system. By using the proposed measures, the system was evaluated on 85 songs sampled from compact discs of popular music. The results showed that the recognition accuracies were more then 86.7% at each level of the beat structure. It was also confirmed that musical decisions based on chord changes and drum patterns were effective.

Chapter 7 concludes that the processing model proposed by this thesis was robust enough to track beats in real-world audio signals sampled from compact discs. Moreover, the validity of the three kinds of musical knowledge introduced as the inverse model is verified by the results described in the previous chapter. The main contribution of this thesis is to propose a new computational model that can recognize the hierarchical beat structure in audio signals in real time. Although such a beat-tracking model has been desired, it was not built in previous work.

Chapter 8 introduces various applications in which beat tracking is useful. To confirm that the system is effective in a real-world application, an application that displays real-time computer graphics dancers whose motions change in time to musical beats has already been developed.

Chapter 9 concludes this thesis by summarizing the main results. This thesis shows that it is possible to build a computational model of real-time beat tracking for audio signals, which is one of important processes of music perception, and makes a step toward a complete computational model that can understand music in a human-like fashion.

1996

Computer Modeling of Sound for Transformation and Synthesis of Musical Signals
Paul Masri, University of Bristol, UK, 1996. [BibTex]

Prediction-driven Computational Auditory Scene Analysis
Dan P. W. Ellis, Massachusetts Institute of Technology, MA, USA, June 1996. [BibTex, Abstract, PDF]

The sound of a busy environment, such as a city street, gives rise to a perception of numerous distinct events in a human listener – the ‘auditory scene analysis’ of the acoustic information. Recent advances in the understanding of this process from experimental psychoacoustics have led to several efforts to build a computer model capable of the same function. This work is known as ‘computational auditory scene analysis’.

The dominant approach to this problem has been as a sequence of modules, the output of one forming the input to the next. Sound is converted to its spectrum, cues are picked out, and representations of the cues are grouped into an abstract description of the initial input. This ‘data-driven’ approach has some specific weaknesses in comparison to the auditory system: it will interpret a given sound in the same way regardless of its context, and it cannot ‘infer’ the presence of a sound for which direct evidence is hidden by other components.

The ‘prediction-driven’ approach is presented as an alternative, in which analysis is a process of reconciliation between the observed acoustic features and the predictions of an internal model of the sound-producing entities in the environment. In this way, predicted sound events will form part of the scene interpretation as long as they are consistent with the input sound, regardless of whether direct evidence is found. A blackboard-based implementation of this approach is described which analyzes dense, ambient sound examples into a vocabulary of noise clouds, transient clicks, and a correlogram-based representation of wide-band periodic energy called the weft.

The system is assessed through experiments that firstly investigate subjects’ perception of distinct events in ambient sound examples, and secondly collect quality judgments for sound events resynthesized by the system. Although rated as far from perfect, there was good agreement between the events detected by the model and by the listeners. In addition, the experimental procedure does not depend on special aspects of the algorithm (other than the generation of resyntheses), and is applicable to the assessment and comparison of other models of human auditory organization.

1995

A Quantitative Rule System for Musical Performance
Anders Friberg, KTH - Royal Institute of Technology, Stockholm, Sweden, 1995. [BibTex, Abstract, External Link]

A rule system is described that translates an input score file to a musical performance. The rules model different principles of interpretation used by real musicians, such as phrasing, punctuation, harmonic and melodic structure, micro timing, accents, intonation, and final ritard. These rules have been applied primarily to Western classical music but also to contemporary music, folk music and jazz. The rules consider mainly melodic aspects, i. e., they look primarily at pitch and duration relations, disregarding repetitive rhythmic patterns. A complete description and discussion of each rule is presented. The effect of each rule applied to a music example is demonstrated on the CD-ROM. A complete implementation is found in the program Director Musices, also included on the CD-ROM.
The smallest deviations that can be perceived in a musical performance, i. e., the JND, was measured in three experiments. In one experiment the JND for displacement of a single tone in an isochronous sequence was found to be 6 ms for short tones and 2.5% for tones longer than 250 ms. In two other experiments the JND for rule-generated deviations was measured. Rather similar values were found despite different musical situations, provided that the deviations were expressed in terms of the maximum span, MS. This is a measure of a parameter's maximum deviation from a deadpan performance in a specific music excerpt. The JND values obtained were typically 3-4 times higher than the corresponding JNDs previously observed in psychophysical experiments.
Evaluation, i. e. the testing of the generality of the rules and the principles they reflect, has been carried out using four different methods: (1) listening tests with fixed quantities, (2) preference tests where each subject adjusted the rule quantity, (3) tracing of the rules in measured performances, and (4) matching of rule quantities to measured performances. The results confirmed the validity of many rules and suggested later realized modifications of others.
Music is often described by means of motion words. The origin of such analogies was pursued in three experiments. The force envelope of the foot while walking or dancing was transferred to sound level envelopes of tones. Sequences of such tones, repeated at different tempi were perceived by expert listeners as possessing motion character, particularly when presented at the original walking tempo. Also, some of the character of the original walking or dancing could be mediated to the listeners by means of these tone sequences. These results suggest that the musical expressivity might be increased in rule-generated performances if rules are implemented which reflect locomotion patterns.

A Perceptual Approach to the Description and Analysis of Acousmatic Music
William Luke Windsor, City University, London, UK, September 1995. [BibTex, Abstract, External Link]

This thesis investigates the problems of describing and analysing music that is composed for, and presented from, a fixed medium, and diffused over loudspeakers with minimal intervention, especially where such music resembles everyday sounds as much as it does traditional musical material. It is argued that most existing theories of acousmatic music are closely tied to prescriptive rather than descriptive concerns, and concentrate upon intrinsic aspects of acousmatic music to the detriment of its extrinsic potential. In contrast to such approaches a method of description based upon an ecological theory of listening which accounts for the relationship between structured information and the perception of events is proposed. This descriptive approach is used as the basis for analysing acousmatic pieces, revealing a complex interpretative relationship between listener, piece and environment. Such an approach, it is argued, accounts for those aspects of acousmatic music excluded by most current approaches, but more importantly provides a theoretical framework within which descriptions may be arrived at which avoid the prescriptive bias of exisiting theories. The perspective provided by this analytical approach is reinterpreted through a critical approach to aesthetics, showing how acousmatic music can be seen as both autonomous and mimetic and how the dialectic between these two aspects is potentially critical of our relationship with the world. The relationship between musical techniques, materials and technology is discussed in response to this perspective showing how acousmatic music might be regarded as part of a broader aesthetic context. In conclusion, it is argued that acousmatic music does not merely challenge the view that music is primarily self-referential, but also that it reaffirms the possibility that music may be both intrinsically and extrinsically significant.

1991

Event Formation and Separation in Musical Sound
David K. Mellinger, Stanford University, CA, USA, 1991. [BibTex, Abstract, External Link]

This thesis reviews psychoacoustic and neurophysiological studies that show how the human auditory system is capable of hearing out one source of sound from the mixture of sounds that reaches the ear. A number of cues are used for identifying which parts of the spectrum originate with a single source: common onset, the beginning of sound energy at different frequencies at one time; harmonicity, the arrangement of the partials of a tone into a harmonic series; common frequency variation, the motion of partials in frequency at the same relative rate; common spatial location; and several others.

A multistage architecture is described for early auditory processing. After the input sound signal is transduced into a map of neural firings in the cochlea, filters extract the various cues for source separation from the cochlear image. The model uses these cues to group local features into single sound events and further groups events over time into sound sources.

The implemented model groups parts of teh spectrum together over time to make separate sound events, using principles and constraints present in natural auditory systems. The model includes filters for detecting onsets and frequency variation in sound. These filters are tuned to work on musical sounds. Their output is used to find and separate notes in the signal, producing time-frequency images of the parts of the sound determined to belong to each event. This processing is applied to musical sounds made up of several notes played at a time, revealing the strengths and weaknesses of the computational model. The thesis offers directions for future work in computational auditory modelling.

1989

An Approach for the Separation of Voices in Composite Musical Signals
Robert C. Maher, University of Illinois, IL, USA, 1989. [BibTex]

A System for Sound Analysis/Transformation/Synthesis based on a Deterministic plus Stochastic Decomposition
Xavier Serra, Stanford University, CA, USA, 1989. [BibTex]

1988

Computer Tools for Music Information Retrieval
Stephen Dowland Page, Oxford, New College, UK, 1988. [BibTex]

1985

The Computer Analysis of Polyphonic Music
Charles R. Watson, University of Sydney, Australia, 1985. [BibTex]

On the Automatic Transcription of Percussive Music: From Acoustic Signal to High Level Analysis
W. Andrew Schloss, Stanford University, CA, USA, May 1985. [BibTex, Abstract, PDF]

This dissertation is concerned with the use of a computer to analyze and understand rhythm in music. The research focuses on the development of a program that automatically transcribes percussive music, investigating issues of timing and rhythmic complexity in a rich musical setting. Beginning with a recording of an improvised performance, the intent is to be able to produce a score of the performance, to be able to resynthesize the performance in various ways, and also to make inferencesabout rhythmic structure and style.

In order to segment percussive sound from the given acoustic waveform, automatic slope-detection algorithms have been developed and implemented. Initially, a simple amplitude envelope is found by tracing the peaks of the waveform. This provides a data reduction of about 200:1 and is useful for obtaining an overview of the musical material. The data are then segmented by repeatedly performing a linear regression over a small moving window of the envelope data, moving the window one point at a time over the envelope. The linear regressions create a sequence of line segments that "float" over the data and allow segmentation by carefully set slope thresholds.

The slope threshold determines the attacks. Once the attacks are determined, the decay time-constant, tau, is determined by fitting a one-pole model to the amplitude envelope. From the value of tau, a decision can be made as to whether a given stroke is damped or undamped. This corresponds to the method of striking the drum. Once the damped/undamped decision is made, a portion of the original time waveform is sent to a "stroke-detector" that determines how the drum was struck in greater detail.

At this point, enough information about the performance has been obtained to begin a higher-level analysis. Given the timing information and the patterns of strokes, it is possible to track tempo automatically, and to try to make inferences about the meter. These two issues are in fact quite deep, and are the focus of a body of work that involves detection of "macro-periodicity", that is a repetition rate over longer periods of time than signal processing would normally yield. Also included in this thesis is an historical and theoretical overview of research on rhythm, from several perspectives.

1975

On the Segmentation and Analysis of Continuous Musical Sound by Digital Computer
James A. Moorer, Stanford University, CA, USA, July 1975. [BibTex, External Link]