MIR PhD Thesis: David Gerhard (2003)

Computationally Measurable Differences Between Speech and Song

David Gerhard
Simon Fraser University, CA, USA (April, 2003)

ABSTRACT

Automatic audio signal classification is one of the general research areas in which algorithms are developed to allow computer systems to understand and interact with the audio environment. Human utterance classification is a specific subset of audio signal classification in which the domain of audio signals is restricted to those likely to be encountered when interacting with humans. Speech recognition software performs classification in a domain restricted to human speech, but human utterances can also include singing, shouting, poetry and prosodic speech, for which current recognition engines are not designed.

Another recent and relevant audio signal classification task is the discrimination between speech and music. Many radio stations have periods of speech (news, information reports, commercials) interspersed with periods of music, and systems have been designed to search for one type of sound in preference over another. Many of the current systems used to distinguish between speech and music use characteristics of the human voice, so such systems are not able to distinguish between speech and music when the music is an individual unaccompanied singer.

This thesis presents research into the problem of human utterance classification, specifically differentiation between talking and singing. The question is addressed: “Are there measurable differences between the auditory waveforms produced by talking and singing?” Preliminary background is presented to acquaint the reader with some of the science used in the algorithm development. A corpus of sounds was collected to study the physical and perceptual differences between singing and talking, and the procedures and results of this collection are presented. A set of 17 features is developed to diff erentiate between talking and singing, and to investigate the intermediate vocalizations between talking and singing. The results of these features are examined and evaluated.

[BibTex, PDF, Return]