Perception and Modeling of Segment Boundaries in Popular Music|
Michael J. Bruderer, Technische Universiteit Eindhoven, Eindhoven, Netherlands, 2008.
Music Recommendation and Discovery in the Long Tail
Óscar Celma, University Pompeu Fabra, Barcelona, Spain, 2008.
[BibTex, Abstract, External Link]
Music consumption is biased towards a few popular artists. For instance, in 2007 only 1% of all digital tracks accounted for 80% of all sales. Similarly, 1,000 albums accounted for 50% of all album sales, and 80% of all albums sold were purchased less than 100 times. There is a need to assist people to filter, discover, personalise and recommend from the huge amount of music content available along the Long Tail.|
Current music recommendation algorithms try to accurately predict what people demand to listen to. However, quite often these algorithms tend to recommend popular -or well-known to the user- music, decreasing the effectiveness of the recommendations. These approaches focus on improving the accuracy of the recommendations. That is, try to make accurate predictions about what a user could listen to, or buy next, independently of how useful to the user could be the provided recommendations.
In this Thesis we stress the importance of the user's perceived quality of the recommendations. We model the Long Tail curve of artist popularity to predict -potentially-interesting and unknown music, hidden in the tail of the popularity curve. Effective recommendation systems should promote novel and relevant material (non-obvious recommendations), taken primarily from the tail of a popularity distribution.
The main contributions of this Thesis are: (i) a novel network-based approach for recommender systems, based on the analysis of the item (or user) similarity graph, and the popularity of the items, (ii) a user-centric evaluation that measures the user's relevance and novelty of the recommendations, and (iii) two prototype systems that implement the ideas derived from the theoretical work. Our findings have significant implications for recommender systems that assist users to explore the Long Tail, digging for content they might like.
Automatic Transcription of Pitch Content in Music and Selected Applications
Matti Ryynänen, Tampere University of Technology, Tampere, Finland, December 2008.
[BibTex, Abstract, PDF]
Transcription of music refers to the analysis of a music signal in order to produce a parametric representation of the sounding notes in the signal. This is conventionally carried out by listening to a piece of music and writing down the symbols of common musical notation to represent the occurring notes in the piece. Automatic transcription of music refers to the extraction of such representations using signal-processing methods.|
This thesis concerns the automatic transcription of pitched notes in musical audio and its applications. Emphasis is laid on the transcription of realistic polyphonic music, where multiple pitched and percussive instruments are sounding simultaneously. The methods included in this thesis are based on a framework which combines both low-level acoustic modeling and high-level musicological modeling. The emphasis in the acoustic modeling has been set to note events so that the methods produce discrete-pitch notes with onset times and durations as output. Such transcriptions can be efficiently represented as MIDI files, for example, and the transcriptions can be converted to common musical notation via temporal quantization of the note onsets and durations. The musicological model utilizes musical context and trained models of typical note sequences in the transcription process. Based on the framework, this thesis presents methods for generic polyphonic transcription, melody transcription, and bass line transcription. A method for chord transcription is also presented.
All the proposed methods have been extensively evaluated using realistic polyphonic music. In our evaluations with 91 half-a-minute music excerpts, the generic polyphonic transcription method correctly found 39% of all the pitched notes (recall) where 41% of the transcribed notes were correct (precision). Despite the seemingly low recognition rates in our simulations, this method was top-ranked in the polyphonic note tracking task in the international MIREX evaluation in 2007 and 2008. The methods for the melody, bass line, and chord transcription were evaluated using hours of music, where F-measure of 51% was achieved for both melodies and bass lines. The chord transcription method was evaluated using the first eight albums by The Beatles and it produced correct frame-based labeling for about 70% of the time.
The transcriptions are not only useful as human-readable musical notation but in several other application areas too, including music information retrieval and content-based audio modification. This is demonstrated by two applications included in this thesis. The first application is a query by humming system which is capable of searching melodies similar to a user query directly from commercial music recordings. In our evaluation with a database of 427 full commercial audio recordings, the method retrieved the correct recording in the topthree list for the 58% of 159 hummed queries. The method was also top-ranked in “query by singing/humming” task in MIREX 2008 for a database of 2048 MIDI melodies and 2797 queries. The second application uses automatic melody transcription for accompaniment and vocals separation. The transcription also enables tuning the user singing to the original melody in a novel karaoke application.
Modeling musical anticipation: From the time of music to the music of time
Arshia Cont, University of California in San Diego, CA, USA, October 2008.
[BibTex, Abstract, External Link]
This thesis studies musical anticipation, both as a cognitive process and design principle for applications in music information retrieval and computer music. For this study, we reverse the problem of modeling anticipation addressed mostly in music cognition literature for the study of musical behavior, to anticipatory modeling, a cognitive design principle for modeling artificial systems. We propose anticipatory models and applications concerning three main preoccupations of expectation: What to expect?, How to expect?, and When to expect?|
For the first question, we introduce a mathematical framework for music information geometry combining information theory, differential geometry, and statistical learning, with the aim of representing information content and gaining access to music structures. The second question is addressed as a machine learning planning problem in an environment, where interactive learning methods are employed on parallel agents to learn anticipatory profiles of actions to be used for decision making. To address the third question, we provide a novel anticipatory design for the problem of synchronizing a live performer to a pre-written music score, leading to Antescofo, a preliminary tool for writing of time and interaction in computer music. Despite the variety of topics present in this thesis, the anticipatory design concept is common in all propositions with the following premises: that an anticipatory design can reduce the structural and computational complexity of modeling, and helps address complex problems in computational aesthetics and most importantly computer music.
Novel Techniques for Audio Music Classification and Search
Kris West, University of East Anglia, UK, September 2008.
[BibTex, Abstract, PDF]
This thesis presents a number of modified or novel techniques for the analysis of music audio for the purposes of classifying it according genre or implementing so called `search-by-example’ systems, which recommend music to users and generate playlists and personalised radio stations. Novel procedures for the parameterisation of music audio are introduced, including an audio event-based segmentation of the audio feature streams and methods of encoding rhythmic information in the audio signal. A large number of experiments are performed to estimate the performance of different classification algorithms when applied to the classification of various sets of music audio features. The experiments show differing trends regarding the best performing type of classification procedure to use for different feature sets and segmentations of feature streams.|
A novel machine learning algorithm (MVCART), based on the classic Decision Tree algorithm (CART), is introduced to more effectively deal with multi-variate audio features and the additional challenges introduced by event-based segmentation of audio feature streams. This algorithm achieves the best results on the classification of event-based music audio features and approaches the performance of state-of-the-art techniques based on summaries of the whole audio stream.
Finally, a number of methods of extending music classifiers, including those based on event-based segmentations and the MVCART algorithm, to build music similarity estimation and search procedures are introduced. Conventional methods of audio music search are based solely on music audio profiles, whereas the methods introduced allow audio music search and recommendation indices to utilise cultural information (in the form of music genres) to enhance or scale their recommendations, without requiring this information to be present for every track. These methods are shown to yield very significant reductions in computational complexity over existing techniques (such as those based on the KL-Divergence) whilst providing a comparable or greater level of performance. Not only does the significantly reduced complexity of these techniques allow them to be applied to much larger collections than the KL-Divergence, but they also produce metric similarity spaces, allowing the use of standard techniques for the scaling of metric search spaces.
Cross-Domain Content-Based Retrieval of Audio Music through Transcription
Iman S. H. Suyoto, Royal Melbourne Institute of Technology (RMIT), Melbourne, September 2008.
[BibTex, Abstract, External Link]
Research in the field of music information retrieval (MIR) is concerned with methods to effectively retrieve a piece of music based on a user's query. An important goal in MIR research is the ability to successfully retrieve music stored as recorded audio using note-based queries.|
In this work, we consider the searching of musical audio using symbolic queries. We first examined the effectiveness of using a relative pitch approach to represent queries and pieces. Our experimental results revealed that this technique, while effective, is optimal when the whole tune is used as a query. We then suggested an algorithm involving the use of pitch classes in conjunction with the longest common subsequence algorithm between a query and target, also using the whole tune as a query. We also proposed an algorithm that works effectively when only a small part of a tune is used as a query. The algorithm makes use of a sliding window in addition to pitch classes and the longest common subsequence algorithm between a query and target. We examined the algorithm using queries based on the beginning, middle, and ending parts of pieces.
We performed experiments on an audio collection and manually-constructed symbolic queries. Our experimental evaluation revealed that our techniques are highly effective, with most queries used in our experiments being able to retrieve a correct answer in the first rank position.
In addition, we examined the effectiveness of duration-based features for improving retrieval effectiveness over the use of pitch only. We investigated note durations and inter-onset intervals. For this purpose, we used solely symbolic music so that we could focus on the core of the problem. A relative pitch approach alongside a relative duration representation were used in our experiments. Our experimental results showed that durations fail to significantly improve retrieval effectiveness, whereas inter-onset intervals significantly improve retrieval effectiveness.
From Sparse Models to Timbre Learning: New Methods for Musical Source Separation
Juan José Burred, Technical University of Berlin, Berlin, Germany, September 2008.
[BibTex, Abstract, External Link]
The goal of source separation is to detect and extract the individual signals present in a mixture. Its application to sound signals and, in particular, to music signals, is of interest for content analysis and retrieval applications arising in the context of online music services. Other applications include unmixing and remixing for post-production, restoration of old recordings, object-based audio compression and upmixing to multichannel setups.|
This work addresses the task of source separation from monaural and stereophonic linear musical mixtures. In both cases, the problem is underdetermined, meaning that there are more sources to separate than channels in the observed mixture. This requires taking strong statistical assumptions and/or learning a priori information about the sources in order for a solution to be feasible. On the other hand, constraining the analysis to instrumental music signals allows exploiting specific cues such as spectral and temporal smoothness, note-based segmentation and timbre similarity for the detection and extraction of sound events.
The statistical assumptions and, if present, the a priori information, are both captured by a given source model that can greatly vary in complexity and extent of application. The approach used here is to consider source models of increasing levels of complexity, and to study their implications on the separation algorithm.
The starting point is sparsity-based separation, which makes the general assumption that the sources can be represented in a transformed domain with few high-energy coefficients. It will be shown that sparsity, and consequently separation, can both be improved by using nonuniform-resolution time-frequency representations. To that end, several types of frequency-warped filter banks will be used as signal front-ends in conjunction with an unsupervised stereo separation approach.
As a next step, more sophisticated models based on sinusoidal modeling and statistical training will be considered in order to improve separation and to allow the consideration of the maximally underdetermined problem: separation from single-channel signals. An emphasis is given in this work to a detailed but compact approach to train models of the timbre of musical instruments. An important characteristic of the approach is that it aims at a close description of the temporal evolution of the spectral envelope. The proposed method uses a formant-preserving, dimension-reduced representation of the spectral envelope based on spectral interpolation and Principal Component Analysis. It then describes the timbre of a given instrument as a Gaussian Process that can be interpreted either as a prototype curve in a timbral space or as a time-frequency template in the spectral domain.
A monaural separation method based on sinusoidal modeling and on the mentioned timbre modeling approach will be presented. It exploits common-fate and good-continuation cues to extract groups of sinusoidal tracks corresponding to the individual notes. Each group is compared to each one of the timbre templates on the database using a specially-designed measure of timbre similarity, followed by a Maximum Likelihood decision. Subsequently, overlapping and missing parts of the sinusoidal tracks are retrieved by interpolating the selected timbre template. The method is later extended to stereo mixtures by using a preliminary spatial-based blind separation stage, followed by a set of refinements performed by the above sinusoidal modeling and timbre matching methods and aiming at reducing interferences with the undesired sources.
A notable characteristic of the proposed separation methods is that they do not assume harmonicity, and are thus not based on a previous multipitch estimation stage, nor on the input of detailed pitch-related information. Instead, grouping and separation relies solely on the dynamic behavior of the amplitudes of the partials. This also allows separating highly inharmonic sounds and extracting chords played by a single instrument as individual entities.
The fact that the presented approaches are supervised and based on classification and similarity allows using them (or parts thereof) for other content analysis applications. In particular the use of the timbre models, and the timbre matching stages of the separation systems will be evaluated in the tasks of musical instrument classification and detection of instruments in polyphonic mixtures.
Real Time Automatic Harmonisation (in French)
Giordano Cabra, University of Paris 6, Paris, France, July 2008.
We define real-time automatic harmonization systems (HATR in french) as computer programs that accompany a possibly improvised melody, by finding an appropriate harmony to be applied to a rhythmic pattern. In a real-time harmonization situation, besides performance and scope constraints, the system and the user are in symbiosis. Consequently, the system incorporates elements from accompaniment, composition, and improvisation, remaining a challenging project, not only for its complexity but also for the lack of solid references in the scientific literature.|
In this work, we propose some extensions to techniques developed in the recent past by the music information retrieval (MIR) community, in order to create programs able to work directly with audio signals. We have performed a series of experiments, which allowed us to systematize the main parameters involved in the development of such systems. This systematization led us to the construction of a HATR framework to explore possible solutions, instead of individual applications.
We compared the applications implemented with this framework by an objective measure as well as by a human subjective evaluation. This thesis presents the pros and cons of each solution, and estimates its musical level by comparing it to real musicians. The results of these experiments show that a simple solution may overcome complex ones. Other experiments have been made to test the robustness and scalability of the framework solutions.
Finally, the technology we constructed has been tested in novel situations, in order to explore possibilities of future work.
Automatically Extracting, Analyzing, and Visualizing Information on Music Artists from the World Wide Web
Markus Schedl, Johannes Kepler University, Linz, Austria, July 2008.
[BibTex, Abstract, PDF]
In the context of this PhD thesis, methods for automatically extracting music-related information from the World Wide Web have been elaborated, implemented, and analyzed. Such information is becoming more and more important in times of digital music distribution via the Internet as users of online music stores nowadays expect to be offered additional music-related information beyond the pure digital music file. Novel techniques have been developed as well as existing ones refined in order to gather information about music artists and bands from the Web. These techniques are related to the research fields of music information retrieval, Web mining, and information visualization. More precisely, on sets of Web pages that are related to a music artist or band, Web content mining techniques are applied to address the following categories of information:|
- similarities between music artists or bands
- prototypicality of an artist or a band for a genre
- descriptive properties of an artist or a band
- band members and instrumentation
- images of album cover artwork
Different approaches to retrieve the corresponding pieces of information for each of these categories have been elaborated and evaluated thoroughly on a considerable variety of music repositories. The results and main findings of these assessments are reported. Moreover, visualization methods and user interaction models for prototypical and similar artists as well as for descriptive terms have evolved from this work.
Based on the insights gained by the various experiments and evaluations conducted, the core application of this thesis, the "Automatically Generated Music Information System" (AGMIS) was build. AGMIS demonstrates the applicability of the elaborated techniques on a large collection of more than 600,000 artists by providing a Web-based user interface to access a database that has been populated automatically with the extracted information.
Although AGMIS does not always give perfectly accurate results, the automatic approaches to information retrieval have some advantages in comparison with those employed in existing music information systems, which are either based on labor-intensive information processing by music experts or on community knowledge that is vulnerable to distortion of information.
A System for Acoustic Chord Transcription and Key Extraction from Audio Using Hidden Markov Models Trained on Synthesized Audio
Kyogu Lee, Stanford University, CA, USA, March 2008.
[BibTex, Abstract, PDF]
Extracting high-level information of musical attributes such as melody, harmony, key, or rhythm from the raw waveform is a critical process in Music Information Retrieval (MIR) systems. Using one or more of such features in a front end, one can efficiently and effectively search, retrieve, and navigate through a large collection of musical audio. Among those musical attributes, harmony is a key element in Western tonal music. Harmony can be characterized by a set of rules stating how simultaneously sounding (or inferred) tones create a single entity (commonly known as a chord), how the elements of adjacent chords interact melodically, and how sequences of chords relate to one another in a functional hierarchy. Patterns of chord changes over time allow for the delineation of structural features such as phrases, sections and movements. In addition to structural segmentation, harmony often plays a crucial role in projecting emotion and mood. This dissertation focuses on two aspects of harmony, chord labeling and chord progressions in diatonic functional tonal music. Recognizing the musical chords from the raw audio is a challenging task. In this dissertation, a system that accomplishes this goal using hidden Markov models is described. In order to avoid the enormously time-consuming and laborious process of manual annotation, which must be done in advance to provide the ground-truth to the supervised learning models, symbolic data like MIDI files are used to obtain a large amount of labeled training data. To this end, harmonic analysis is first performed on noise-free symbolic data to obtain chord labels with precise time boundaries. In parallel, a sample-based synthesizer is used to create audio files from the same symbolic files. The feature vectors extracted from synthesized audio are in perfect alignment with the chord labels, and are used to train the models.|
Sufficient training data allows for key- or genre-specific models, where each model is trained on music of specific key or genre to estimate key- or genre-dependent model parameters. In other words, music of a certain key or genre reveals its own characteristics reflected by chord progression, which result in the unique model parameters represented by the transition probability matrix. In order to extract key or identify genre, when the observation input sequence is given, the forward-backward or Baum- Welch algorithm is used to efficiently compute the likelihood of the models, and the model with the maximum likelihood gives key or genre information. Then the Viterbi decoder is applied to the corresponding model to extract the optimal state path in a maximum likelihood sense, which is identical to the frame-level chord sequence. The experimental results show that the proposed system not only yields chord recognition performance comparable to or better than other previously published systems, but also provides additional information of key and/or genre without using any other algorithms or feature sets for such tasks. It is also demonstrated that the chord sequence with precise timing information can be successfully used to find cover songs from audio and to detect musical phrase boundaries by recognizing the cadences or harmonic closures.
This dissertation makes a substantial contribution to the music information retrieval community in many aspects. First, it presents a probabilistic framework that combines two closely related musical tasks — chord recognition and key extraction from audio — and achieves state-of-the-art performance in both applications. Second, it suggests a solution to a bottleneck problem in machine learning approaches by demonstrating the method of automatically generating a large amount of labeled training data from symbolic music documents. This will help free researchers of laborious task of manual annotation. Third, it makes use of more efficient and robust feature vector called tonal centroid and proves, via a thorough quantitative evaluation, that it consistently outperforms the conventional chroma feature, which was almost exclusively used by other algorithms. Fourth, it demonstrates that the basic model can easily be extended to key- or genre-specific models, not only to improve chord recognition but also to estimate key or genre. Lastly, it demonstrates the usefulness of recognized chord sequence in several practical applications such as cover song finding and structural music segmentation.
Computer-Based Music Theory and Acoustics
Matt Wright, Stanford University, CA, USA, March 2008.
[BibTex, Abstract, External Link]
A musical event’s Perceptual Attack Time (“PAT”) is its perceived moment of rhythmic placement; in general it is after physical or perceptual onset. If two or more events sound like they occur rhythmically together it is because their PATs occur at the same time, and the perceived rhythm of a sequence of events is the timing pattern of the PATs of those events. A quantitative model of PAT is useful for the synthesis of rhythmic sequences with a desired perceived timing as well as for computer-assisted rhythmic analysis of recorded music. Musicians do not learn to make their notes' physical onsets have a certain rhythm; rather, they learn to make their notes' perceptual attack times have a certain rhythm.|
PAT is notoriously difficult to measure, because all known methods can measure a test sound’s PAT only in relationship to a physical action or to a second sound, both of which add their own uncertainty to the measurements. A novel aspect of this work is the use of the ideal impulse (the shortest possible digital audio signal) as a reference sound. Although the ideal impulse is the best possible reference in the sense of being perfectly isolated in time and having a very clear and percussive attack, it is quite difficult to use as a reference for most sounds because it has a perfectly broad frequency spectrum, and it is more difficult to perceive the relative timing of sounds when their spectra differ greatly. This motivates another novel contribution of this work, Spectrally Matched Click Synthesis, the creation of arbitrarily short duration clicks whose magnitude frequency spectra approximate those of arbitrary input sounds.
All existing models represent the PAT of each event as a single instant. However, there is often a range of values that sound equally correct when aligning sounds rhythmically, and this range depends on perceptual characteristics of the specific sounds such as the sharpness of their attacks. Therefore this work represents each event’s PAT as a continuous probability density function indicating how likely a typical listener would be to hear the sound’s PAT at each possible time. The methodological problem of deriving each sound’s own PAT from measurements comparing pairs of sounds therefore becomes the problem of estimating the distributions of the random variables for each sound’s intrinsic PAT given only observations of a random variable corresponding to difference between the intrinsic PAT distributions for the two sounds plus noise. Methods presented to address this draw from maximum likelihood estimation and the graphtheoretical shortest path problem.
This work describes an online listening test, in which subjects download software that presents a series of PAT measurement trials and allows them to adjust their relative timing until they sound synchronous. This establishes perceptual “ground truth” for the PAT of a collection of 20 sounds compared against each other in various combinations. As hoped, subjects were indeed able to align a sound more reliably to one of that sound’s spectrally matched clicks than to other sounds of the same duration.
The representation of PAT with probability density functions provides a new perspective on the long-standing problem of predicting PAT directly from acoustical signals. Rather than choosing a single moment for PAT given a segment of sound known a priori to contain a single musical event, these regression methods estimate continuous shapes of PAT distributions from continuous (not necessarily presegmented) audio signals, formulated as a supervised machine learning regression problem whose inputs are DSP functions computed from the sound, the detection functions used in the automatic onset detection literature. This work concludes with some preliminary musical applications of the resulting models.
Studies on Hybrid Music Recommendation Using Timbral and Rhythmic Features
Kazuyoshi Yoshii, Kyoto University, Kyoto, Japan, March 2008.
[BibTex, Abstract, PDF]
The importance of music recommender systems is increasing because most online services that manage large music collections cannot provide users with entirely satisfactory access to their collections. Many users of music streaming services want to discover “unknown” pieces that truly match their musical tastes even if these pieces are scarcely known by other users. Recommender systems should thus be able to select musical pieces that will likely be preferred by estimating user tastes. To develop a satisfactory system, it is essential to take into account the contents and ratings of musical pieces. Note that the content should be automatically analyzed from multiple viewpoints such as timbral and rhythmic aspects whereas the ratings are provided by users.|
We deal with hybrid music recommendation in this thesis based on users’ ratings and timbral and rhythmic features extracted from polyphonic musical audio signals. Our goal was to simultaneously satisfy six requirements: (1) accuracy, (2) diversity, (3) coverage, (4) promptness, (5) adaptability, and (6) scalability. To achieve this, we focused on a model-based hybrid filtering method that has been proposed in the field of document recommendation. This method can be used to make accurate and prompt recommendations with wide coverage and diversity in musical pieces, using a probabilistic generative model that unifies both content-based and rating-based data in a statistical way.
To apply this method to enable satisfying music recommendation, we tackled four issues: (i) lack of adaptability, (ii) lack of scalability, (iii) no capabilities for using musical features, and (iv) no flexibility for integrating multiple aspects of music. To solve issue (i), we propose an incremental training method that partially updates the model to promptly reflect partial changes in the data (addition of rating scores and registration of new users and pieces) instead of training the entire model from scratch. To solve issue (ii), we propose a cluster-based training method that efficiently constructs the model at a fixed computational cost regardless of the numbers of users and pieces. To solve issue (iii), we propose a bag-of-features model that represents the time-series features of a musical piece as a set of existence probabilities of predefined features. To solve issue (iv), we propose a flexible method that integrates the musical features of timbral and rhythmic aspects into bag-of-features representations.
In Chapter 3, we first explain the model-based method of hybrid filtering that takes into account both rating-based and content-based data, i.e., rating scores awarded by users and the musical features of audio signals. The probabilistic model can be used to formulate a generative mechanism that is assumed to lie behind the observed data from the viewpoint of probability theory. We then present incremental training and its application to cluster-based training. The model formulation enables us to incrementally update the partial parameters of the model according to the increase in observed data. Cluster-based training initially builds a compact model called a core model for fixed numbers of representative users and pieces, which are the centroids of clusters of similar users and pieces. To obtain the complete model, the core model is then extended by registering all users and pieces with incremental training. Finally, we describe the bag-of-features model to enable hybrid filtering to deal with musical features extracted from polyphonic audio signals. To capture the timbral aspects of music, we created a model for the distribution of Mel frequency cepstral coefficients (MFCCs). To also take into account the rhythmic aspects of music, we effectively combined rhythmic features based on drum-sound onsets with timbral features (MFCCs) by using principal component analysis (PCA). The onsets of drum sounds were automatically obtained as described in the next chapter.
Chapter 4 describes a system that detects onsets of the bass drum, snare drum, and hi-hat cymbals from polyphonic audio signals. The system takes a template-matchingbased approach that uses the power spectrograms of drum sounds as templates. However, there are two problems. The first is that no appropriate templates are known for all songs. The second is that it is difficult to detect drum-sound onsets in sound mixtures including various sounds. To solve these, we propose two methods of template adaptation and harmonic-structure suppression. First, an initial template for each drum sound (seed template) is prepared. The former method adapted it to actual drum-sound spectrograms appearing in the song spectrogram. To make our system robust to the overlapping of harmonic sounds with drum sounds, the latter method suppressed harmonic components in the song spectrogram. Experimental results with 70 popular songs demonstrated that our methods improved the recognition accuracy and respectively achieved 83%, 58%, and 46% in detecting the onsets of the bass drum, snare drum, and hi-hat cymbals.
In Chapter 5, we discuss the evaluation of our system by using audio signals from commercial CDs and their corresponding rating scores obtained from an e-commerce site. The results revealed that our system accurately recommended pieces including non-rated ones from a wide diversity of artists and maintained a high degree of accuracy even when new rating score, users, and pieces were added. Cluster-based training, which can speed up model training a hundred fold, had the potential to improve the accuracy of recommendations. That is, we found a breakthrough that overcame the trade-off, i.e., accuracy v.s. efficiency, which has been considered to be unavoidable. In addition, we verified the importance of timbral and rhythmic features in making accurate recommendations.
Chapter 6 discusses the major contributions of this study to different research fields, particularly to music recommendation and music analysis. We also discuss issues that still remain to be resolved and future directions of research.
Chapter 7 concludes this thesis.
A Study on Developing Applications Systems Based on Singing Understanding and Singing Expression (in Japanese)
Tomoyasu Nakano, University of Tsukuba, Tsukuba, Japan, March 2008.
Since the singing voice is one of the most familiar ways of expressing music, a music information processing system that utilizes the singing voice is a promising research topic with various applications in scope. The singing voice has been the subject of various studies in various fields including physiology, anatomy, acoustics, psychology, and engineering. Recently, the research interests have been directed towards developing systems that support use by non-musician users, such as singing training assistance and music information retrieval by using hummed melody (query-by-humming) or singing voice timbre. The basic studies of the singing voice can be applied to broadening the scope of the music information processing.|
The aim of this research is to develop a system that enriches a relationship between humans and music, through the study of singing understanding and singing expression. The specific research themes treated in this thesis are the evaluation of singing skills in the scope of singing understanding, and voice percussion recognition in the scope of singing expression.
This thesis consists of two parts, corresponding to the two major topics of the research work. Part 1 deals with the study of human singing skill evaluation, as part of a more broader research domain of understanding of human singing. Part 2 deals with the study of voice percussion, as part of the study of singing expression.
In both studies, the approach and methodology follows what is called the HMI approach, which is a unification of three research approaches investigating the Human (H), Machine (M), and Interaction/Interface (I) aspects of singing.
Part1: Singing understanding (singing skill)
Chapter 2 presents the results of two experiments on singing skill evaluation, where human subjects (raters) judge the subjective quality of previously unheard melodies (H domain). This will serve as a preliminary basis for developing an automatic singing skill evaluation method for unknown melodies. Such an evaluation system can be a useful tool for improving singing skills, and also can be applied to broadening the scope of music information retrieval and singing voice synthesis. Previous research on singing skill evaluation for unknown melodies has focused on analyzing the characteristics of the singing voice, but were not directly applied to automatic evaluation or studied in comparison with the evaluation by human subjects.
The two experiments used the rank ordering method, where the subjects ordered a group of given stimuli according to their preferred ratings. Experiment 1 was intended to explore the criteria that human subjects use in judging singing skill and the stability of their judgments, using unaccompanied singing sequences (solo singing) as the stimuli. Experiment 2 uses the F0 sequences (F0 singing) extracted from solo singing, and was resynthesized as a sinusoidal wave. The experiment was intended to identify the contribution of F0 in the judgment. In experiment 1, six key features were extracted from the introspective reports of the subjects as being significant for judging singing skill. The results of experiment 1 show that 88.9% of the correlation between the subjects' evaluations were significant at the 5 % level. This drops to 48.6% in experiment 2, meaning that F0 contribution is relatively low, although the median ratings of stimuli evaluated as good were higher than the median ratings of stimuli evaluated as poor in all cases.
Human subjects can be seen to consistently evaluate the singing skills for unknown melodies. This suggests that their evaluation utilizes easily discernible features which are independent of the particular singer or melody. The approach presented in Chapter 3 uses pitch interval accuracy and vibrato (intentional, periodic fluctuation of F0) which are independent from specific characteristics of the singer or melody (M domain). These features was tested by a 2-class (good/poor) classification test with 600 song sequences, and achieved a classification rate of 83.5%.
Following the results of the subjective evaluation (H domain), MiruSinger, a singing skill visualization interface, was implemented (I domain). MiruSinger provides realtime visual feedback of singing voice, and focuses on the visualization of two key features . F0 (for pitch accuracy improvement) and vibrato sections (for singing technique improvement). Unlike previous systems, real-world music CD recordings are used as referential data. The F0 of vocal-part is estimated automatically from music CD recordings, which can further be hand-corrected interactively using a graphical interface on the MiruSinger screen.
Part2: Singing expression (voice percussion)
Voice percussion in our context is the mimicking of drum sounds by voice, expressed in verbal form that can be transcribed into phonemic representation, or onomatopoeia (e.g. don-tan-do-do-tan). Chapter 5 describes a psychological experiment, voice percussion expression experiment, which gathers data on how subjects express drum patterns (H domain). This will serve as a preliminary basis for developing a voice percussion recognition method. Previous studies on query-by-humming focused on pitch detection and melodic feature extraction, but these features have less relevance in voice percussion recognition, which is primarily concerned with classification of timbre and identification of articulation methods. The methods for handling such features can be useful tools for music notation interface, and also can be applied to have promising applications in widening the scope of music information retrieval.
A "drum pattern" in our context means a sequence of drum beats that form minimum unit (one measure). In this thesis, drum patterns consist of only two percussion instruments . bass drum (BD) and snare drum (SD). In the expression experiment, there were 17 subjects of ages 19 to 31 (two with experience in percussion). The voice percussion sung by the subjects were recorded and analyzed. Significant discoveries from the expression experiment include: "the onomatopoeic expression had correspondence with the length and rhythmic patterns of the beats" and "some subjects may verbally expressed rest notes".
Chapter 6 describes a voice percussion recognition method. The voice percussion was compared with all the patterns in a drum pattern database, and the pattern that was estimated to be acoustically most close to the voice percussion is selected as the recognized result (M domain). The search first looks for drum patterns over onomatopoeic sequences. This selects instrument sequences with the highest likelihood ratings, which are then checked over their onset timings. The pattern with the highest ranking is output as the final result. The recognition method was tested by recognition experiments over a combination of different settings of the acoustic model and the pronunciation dictionary. The following 4 conditions were evaluated.
(A) General acoustic model of speech
(B) Acoustic model tuned by voice percussion utterances not in evaluation data
(C) Acoustic model tuned to individual subjects
(D) Same acoustic model, with the pronunciation dictionary restricted to the expressions used by the subject
The recognition rate in the evaluation experiments were (A)58.5%, (B)58.5%, (C)85.0%, and (D)92.0%.
Following the encouraging results of the proposed method as a practical tool for voice percussion recognition, a score input interface, Voice Drummer, was developed, as its application (I domain). Voice Drummer consists of a score input mode which is used for drum pattern input intended for use in composition, and an arrangement mode which edits drum patterns in a given music piece. There is also a practice/adaptation mode where the user can practice and adapt the system to his/her voice, thus increasing the recognition rate.
Part 1 presented the results of the subjective evaluation experiments, and presented two acoustical features, pitch interval accuracy and vibrato, as key features for evaluating singing skills. The results of the subjective evaluation suggested that the singing skill evaluation of human listeners are generally consistent and in mutual agreement. In the classification experiment, the acoustical features are shown to be effective for evaluating singing skills without score information.
Part 2 presented the results of the voice percussion expression experiment, and presented a voice percussion recognition method. The onomatopoeic expressions utilized in the regcognition experiment were extracted from the expression experiment. In the recognition experiment, the voice percussion recognition method achieved a recognition rate of 91.0% for the highest-tuned setting.
The results of these two studies were adapted to the development of two applications systems, MiruSinger for singing training assistance and Voice Drummer for percussion instrument notation. Trial usage of the systems suggest that they would be a useful tool and fun for average users.
The presented work can be seen as pioneering work in the fields of singing understanding and expression, contributing to the advance of singing voice research.