Transcription of music refers to the analysis of a music signal in order to produce a parametric representation of the sounding notes in the signal. This is conventionally carried out by listening to a piece of music and writing down the symbols of common musical notation to represent the occurring notes in the piece. Automatic transcription of music refers to the extraction of such representations using signal-processing methods.|
This thesis concerns the automatic transcription of pitched notes in musical audio and its applications. Emphasis is laid on the transcription of realistic polyphonic music, where multiple pitched and percussive instruments are sounding simultaneously. The methods included in this thesis are based on a framework which combines both low-level acoustic modeling and high-level musicological modeling. The emphasis in the acoustic modeling has been set to note events so that the methods produce discrete-pitch notes with onset times and durations as output. Such transcriptions can be efficiently represented as MIDI files, for example, and the transcriptions can be converted to common musical notation via temporal quantization of the note onsets and durations. The musicological model utilizes musical context and trained models of typical note sequences in the transcription process. Based on the framework, this thesis presents methods for generic polyphonic transcription, melody transcription, and bass line transcription. A method for chord transcription is also presented.
All the proposed methods have been extensively evaluated using realistic polyphonic music. In our evaluations with 91 half-a-minute music excerpts, the generic polyphonic transcription method correctly found 39% of all the pitched notes (recall) where 41% of the transcribed notes were correct (precision). Despite the seemingly low recognition rates in our simulations, this method was top-ranked in the polyphonic note tracking task in the international MIREX evaluation in 2007 and 2008. The methods for the melody, bass line, and chord transcription were evaluated using hours of music, where F-measure of 51% was achieved for both melodies and bass lines. The chord transcription method was evaluated using the first eight albums by The Beatles and it produced correct frame-based labeling for about 70% of the time.
The transcriptions are not only useful as human-readable musical notation but in several other application areas too, including music information retrieval and content-based audio modification. This is demonstrated by two applications included in this thesis. The first application is a query by humming system which is capable of searching melodies similar to a user query directly from commercial music recordings. In our evaluation with a database of 427 full commercial audio recordings, the method retrieved the correct recording in the topthree list for the 58% of 159 hummed queries. The method was also top-ranked in “query by singing/humming” task in MIREX 2008 for a database of 2048 MIDI melodies and 2797 queries. The second application uses automatic melody transcription for accompaniment and vocals separation. The transcription also enables tuning the user singing to the original melody in a novel karaoke application.