Computational Musical Instrument Recognition and Its Application to Content-based Music Information Retrieval

Tetsuro Kitahara
Kyoto University, Japan (March, 2007)


The current capability of computers to recognize auditory events is severely limited when compared to human ability. Although computers can accurately recognize sounds that are sufficiently close to those trained in advance and that occur without other sounds simultaneously, they break down whenever the inputs are degraded by competing sounds.

In this thesis, we address computational recognition of non-percussive musical instruments in polyphonic music. Music is a good domain for computational recognition of auditory events because multiple instruments are usually played simultaneously. The difficulty in handling music resides in the fact that signals (events to be recognized) and noises (events to be ignored) are not uniquely defined. This is the main difference from studies of speech recognition under noisy environments. Musical instrument recognition is also important from an industrial standpoint. The recent development of digital audio and network technologies has enabled us to handle a tremendous number of musical pieces and therefore efficient music information retrieval (MIR) is required. Musical instrument recognition will serve as one of the key technologies for sophisticated MIR because the types of instruments played characterize musical pieces; some musical forms, in fact, are based on instruments, for example “piano sonata” and “string quartet.”

Despite the importance of musical instrument recognition, studies have until recently mainly dealt with monophonic sounds. Although the number of studies dealing with polyphonic music has been increasing, their techniques have not yet achieved a sufficient level to be applied to MIR or other real applications. We investigate musical instrument recognition in two stages. At the first stage, we address instrument recognition for monophonic sounds to develop basic technologies for handling musical instrument sounds. In instrument recognition for monophonic sounds, we deal with two issues: (1) the pitch dependency of timbre and (2) the input of non-registered instruments. Because musical instrument sounds have wide pitch ranges in contrast to other kinds of sounds, the pitch dependency of timbre is an important issue. The second issue, that is, handling instruments that are not contained in training data, is also an inevitable problem. This is because it is impossible in practice to build a thorough training data set due to a virtually infinite number of instruments. At the second stage, we address instrument recognition in polyphonic music. To deal with polyphonic music, we must solve the following two issues: (3) the overlapping of simultaneously played notes and (4) the unreliability of precedent note estimation process. When multiple instruments simultaneously play, partials (harmonic components) of their sounds overlap and interfere. This makes the acoustic features different from those of monophonic sounds. The overlapping of simultaneous notes is therefore an essential problem for polyphonic music. In addition, note estimation, that is, estimating the onset time and fundamental frequency (F0) of each note, is usually used as a preprocess in a typical instrument recognition framework. It remains, however, a challenging problem for polyphonic music.

In Chapter 3, we propose an F0-dependent multivariate normal distribution to resolve the first issue. The F0-dependent multivariate normal distribution is an extension of a multivariate normal distribution where the mean vector is defined as a function of F0. The key idea behind this is to approximate variation of each acoustic feature from pitch to pitch as a function of F0. This approximation makes it possible to separately model the pitch and non-pitch dependencies of timbres. We also investigate acoustic features for musical instrument recognition in this chapter. Experimental results with 6,247 solo tones of 19 instruments showed improvement of the recognition rate from 75.73% to 79.73% on average.

In Chapter 4, we solve the second issue by recognizing non-registered instruments at the category level. When a given sound is registered, its instrument name, e.g. violin, is recognized. Even if it is not registered, its category name, e.g. strings, can be recognized. The important issue in achieving such recognition is to adopt a musical instrument taxonomy that reflects acoustical similarity. We present a method for acquiring such a taxonomy by applying hierarchical clustering to a large-scale musical instrument sound database. Experimental results showed that around 77% of non-registered instrument sounds, on average, were correctly recognized at the category level.

In Chapter 5, we tackle the third issue by weighting features based on how much they are affected by overlapping; that is, we give lower weights to features affected more and higher weights to features affected less. For this kind of weighting, we have to evaluate the influence of the overlapping on each feature. It was, however, impossible in previous studies to evaluate the influence by analyzing training data because the training data were only taken from monophonic sounds. Taking training data from polyphonic music (called a mixed-sound template), we evaluate the influence as the ratio of the within-class variance to the between-class variance in the distribution of the training data. We then generate feature axes using a weighted mixture that minimizes the influence by means of linear discriminant analysis. We also introduced musical context to avoid musically unnatural errors (e.g., only one clarinet note within a sequence of flute notes). Experimental results showed that the recognition rates obtained using the above were 84.1% for duo music, 77.6% for trio music, and 72.3% for quartet music.

In Chapter 6, we describe a new framework of musical instrument recognition to solve the fourth issue. We formulate musical instrument recognition as the problem of calculating instrument existence probabilities at every point on the time-frequency plane. The instrument existence probabilities are calculated by multiplying two kinds of probabilities, one of which is calculated using the PreFEst and the other of which is calculated using hidden Markov models. The instrument existence probabilities are visualized in the spectrogram-like graphical representation called the instrogram. Because the calculation is performed for each time and each frequency, not for each note, estimation of the onset time and F0 of each note is not necessary. We obtained promising results for both synthesized music and recordings of real performances of classical and jazz music.

In Chapter 7, we describe an application of the instrogram analysis to similarity-based MIR. Because most previous similarity-based MIR systems used low-level features such as MFCCs, similarities for musical elements such as the melody, rhythm, harmony, and instrumentation could not be separately measured. As the first step toward measuring such music similarity, we develop a music similarity measure that reflects instrumentation only based on the instrogram representation. We confirmed that the instrogram can be applied to content-based MIR by developing a prototype system that searches for musical pieces that have instrumentation similar to that specified by the user.

In Chapter 8, we discuss the major contributions of this study toward research fields including computational auditory scene analysis, content-based MIR, and music visualization. We also discuss remaining issues and future directions of research.

Finally, we present our conclusions of this work in Chapter 9.

[BibTex, PDF, Return]