Sound source separation refers to the task of estimating the signals produced by individual sound sources from a complex acoustic mixture. It has several applications, since monophonic signals can be processed more eciently and flexibly than polyphonic mixtures.|
This thesis deals with the separation of monaural, or, one-channel music recordings. We concentrate on separation methods, where the sources to be separated are not known beforehand. Instead, the separation is enabled by utilizing the common properties of real-world sound sources, which are their continuity, sparseness, and repetition in time and frequency, and their harmonic spectral structures. One of the separation approaches taken here use unsupervised learning and the other uses model-based inference based on sinusoidal modeling.
Most of the existing unsupervised separation algorithms are based on a linear instantaneous signal model, where each frame of the input mixture signal is modeled as a weighted sum of basis functions. We review the existing algorithms which use independent component analysis, sparse coding, and non-negative matrix factorization to estimate the basis functions from an input mixture signal.
Our proposed unsupervised separation algorithm based on the instantaneous model combines non-negative matrix factorization with sparseness and temporal continuity objectives. The algorithm is based on minimizing the reconstruction error between the magnitude spectrogram of the observed signal and the model, while restricting the basis functions and their gains to non-negative values, and the gains to be sparse and continuous in time. In the minimization, we consider iterative algorithms which are initialized with random values and updated so that the value of the total objective cost function decreases at each iteration. Both multiplicative update rules and a steepest descent algorithm are proposed for this task. To improve the convergence of the projected steepest descent algorithm, we propose an augmented divergence to measure the reconstruction error. Simulation experiments on generated mixtures of pitched instruments and drums were run to monitor the behavior of the proposed method. The proposed method enables average signal-to-distortion ratio (SDR) of 7.3 dB, which is higher than the SDRs obtained with the other tested methods based on the instantaneous signal model.
To enable separating entities which correspond better to real-world sound objects, we propose two convolutive signal models which can be used to represent time-varying spectra and fundamental frequencies. We propose unsupervised learning algorithms extended from non-negative matrix factorization for estimating the model parameters from a mixture signal. The objective in them is to minimize the reconstruction error between the magnitude spectrogram of the observed signal and the model while restricting the parameters to non-negative values. Simulation experiments show that time-varying spectra enable better separation quality of drum sounds, and time-varying frequencies representing different fundamental frequency values of pitched instruments conveniently.
Another class of studied separation algorithms is based on the sinusoidal model, where the periodic components of a signal are represented as the sum of sinusoids with time-varying frequencies, amplitudes, and phases. The model provides a good representation for pitched instrument sounds, and the robustness of the parameter estimation is here increased by restricting the sinusoids of each source to harmonic frequency relationships.
Our proposed separation algorithm based on sinusoidal modeling minimizes the reconstruction error between the observed signal and the model. Since the rough shape of spectrum of natural sounds is continuous as a function of frequency, the amplitudes of overlapping overtones can be approximated by interpolating from adjacent overtones, for which we propose several methods. Simulation experiments on generated mixtures of pitched musical instruments show that the proposed methods allow average SDR above 15 dB for two simultaneous sources, and the quality decreases gradually as the number of sources increases.