Data-Driven Concatenative Sound Synthesis

Diemo Schwarz
Ircam - Centre Pompidou, Paris, France (January, 2004)


Concatenative data-driven sound synthesis methods use a large database of source sounds, segmented into heterogeneous units, and a unit selection algorithm that finds the units that match best the sound or musical phrase to be synthesised, called the target. The selection is performed according to the features of the units. These are characteristics extracted from the source sounds, e.g. pitch, or attributed to them, e.g. instrument class. The selected units are then transformed to fully match the target specification, and concatenated. However, if the database is sufficiently large, the probability is high that a matching unit will be found, so the need to apply transformations is reduced.

Usual synthesis methods are based on a model of the sound signal. It is very difficult to build a model that would preserve all the fine details of sound. Concatenative synthesis achieves this by using actual recordings. This data-driven approach (as opposed to a rule-based approach) takes advantage of the information contained in the many sound recordings. For example, very naturally sounding transitions can be synthesized, since unit selection is aware of the context of the database units.

In speech synthesis, concatenative synthesis methods are the most widely used. They resulted in a considerable gain of naturalness and intelligibility. Results in other fields, for instance speech recognition, confirm the general superiority of data-driven approaches. Concatenative data-driven approaches have made their way into some musical synthesis applications which are briefly presented.

The CATERPILLAR software system developed in this thesis allows data-driven musical sound synthesis from a large database. However, musical creation is an artistic activity and thus not based on clearly definable criteria, like in speech synthesis. That's why a flexible, interactive use of the system allows composers to obtain new sounds.

To constitute a unit database, alignment of music to a score is used to segment musical instrument recordings. It is based on spectral peak structure matching and the two approaches using Dynamic Time Warping and Hidden Markov Models are compared.

Descriptor extraction analyses the sounds for their signal, spectral, harmonic, and perceptive characteristics, and temporal modeling techniques characterise the temporal evolution of the units uniformly. However, it is possible to attribute score information like playing style, or arbitrary information to the units, which can later be used for selection.

The database is implemented using a relational SQL database management system for optimal flexibility and reliability. A database interface cleanly separates the synthesis system from the database.

The best matching sequence of units is found by a Viterbi unit selection algorithm. To incorporate a more flexible specification of the resulting sequence of units, the constraint solving algorithm of adaptive local search has been alternatively applied to unit selection. Both algorithms are based on two distance functions: the target distance expresses the similarity of a target unit to the database units, and the concatenation distance the quality of the join of two database units.

Data-driven concatenative synthesis is then applied to instrument synthesis with high level control, explorative free synthesis from arbitrary sound databases, resynthesis of a recording with sounds from the database, and artistic speech synthesis. For these applications, unit corpora of violin sounds, environmental noises, and speech have been built.

[BibTex, External Link, Return]