Signal Processing Methods for the Automatic Transcription of Music

Anssi Klapuri
Tampere University of Technology, Finnland (March, 2004)


Signal processing methods for the automatic transcription of music are developed in this thesis. Music transcription is here understood as the process of analyzing a music signal so as to write down the parameters of the sounds that occur in it. The applied notation can be the traditional musical notation or any symbolic representation which gives sufficient information for performing the piece using the available musical instruments. Recovering the musical notation automatically for a given acoustic signal allows musicians to reproduce and modify the original performance. Another principal application is structured audio coding: a MIDI-like representation is extremely compact yet retains the identifiability and characteristics of a piece of music to an important degree.

The scope of this thesis is in the automatic transcription of the harmonic and melodic parts of real-world music signals. Detecting or labeling the sounds of percussive instruments (drums) is not attempted, although the presence of these is allowed in the target signals. Algorithms are proposed that address two distinct subproblems of music transcription. The main part of the thesis is dedicated to multiple fundamental frequency (F0) estimation, that is, estimation of the F0s of several concurrent musical sounds. The other subproblem addressed is musical meter estimation. This has to do with rhythmic aspects of music and refers to the estimation of the regular pattern of strong and weak beats in a piece of music.

For multiple-F0 estimation, two different algorithms are proposed. Both methods are based on an iterative approach, where the F0 of the most prominent sound is estimated, the sound is cancelled from the mixture, and the process is repeated for the residual. The first method is derived in a pragmatic manner and is based on the acoustic properties of musical sound mixtures. For the estimation stage, an algorithm is proposed which utilizes the frequency relationships of simultaneous spectral components, without assuming ideal harmonicity. For the cancelling stage, a new processing principle, spectral smoothness, is proposed as an efficient new mechanism for separating the detected sounds from the mixture signal.

The other method is derived from known properties of the human auditory system. More specifically, it is assumed that the peripheral parts of hearing can be modelled by a bank of bandpass filters, followed by half-wave rectification and compression of the subband signals. It is shown that this basic structure allows the combined use of time-domain periodicity and frequency-domain periodicity for F0 extraction. In the derived algorithm, the higher-order (unresolved) harmonic partials of a sound are processed collectively, without the need to detect or estimate individual partials. This has the consequence that the method works reasonably accurately for short analysis frames. Computational efficiency of the method is based on calculating a frequency-domain approximation of the summary autocorrelation function, a physiologically-motivated representation of sound.

Both of the proposed multiple-F0 estimation methods operate within a single time frame and arrive at approximately the same error rates. However, the auditorily-motivated method is superior in short analysis frames. On the other hand, the pragmatically-oriented method is “complete” in the sense that it includes mechanisms for suppressing additive noise (drums) and for estimating the number of concurrent sounds in the analyzed signal. In musical interval and chord identification tasks, both algorithms outperformed the average of ten trained musicians.

For musical meter estimation, a method is proposed which performs meter analysis jointly at three different time scales: at the temporally atomic tatum pulse level, at the tactus pulse level which corresponds to the tempo of a piece, and at the musical measure level. Acoustic signals from arbitrary musical genres are considered. For the initial time-frequency analysis, a new technique is proposed which measures the degree of musical accent as a function of time at four different frequency ranges. This is followed by a bank of comb filter resonators which perform feature extraction for estimating the periods and phases of the three pulses. The features are processed by a probabilistic model which represents primitive musical knowledge and performs joint estimation of the tatum, tactus, and measure pulses. The model takes into account the temporal dependencies between successive estimates and enables both causal and noncausal estimation. In simulations, the method worked robustly for different types of music and improved over two state-of-the-art reference methods. Also, the problem of detecting the beginnings of discrete sound events in acoustic signals, onset detection, is separately discussed.

[BibTex, PDF, Return]