We deal through this paper with the problem of estimating "information" of each sound source separately from an acoustic signal of compound sound. Here "information" is used in a wide sense to include not only the waveform itself of the separate source signal but also the power spectrum, fundamental frequency (F0), spectral envelope and other features. Such a technique could be potentially useful for a wide range of applications such as robot auditory sensor, robust speech recognition, automatic transcription of music, waveform encoding for the audio CODEC (compression-decompression) system, a new equalizer system enabling bass and treble controls for separate source, and indexing of music for music retrieval system.|
Generally speaking, if the compound signal were separated, then it would be a simple matter to obtain an F0 estimate from each stream using a single voice F0 estimation method and, on the other hand, if the F0s were known in advance, could be very useful information available for separation algorithms. Therefore, source separation and F0 estimation are essentially a "chicken-and-egg problem", and it is thus perhaps better if one could formulate these two tasks as a joint optimization problem. In Chapter 2, we introduce a method called "Harmonic Clustering", which searches for the optimal spectral masking function and the optimal F0 estimate for each source by performing the source separation step and the F0 estimation step iteratively.
In Chapter 3, we establish a generalized principle of Harmonic Clustering by showing that Harmonic Clustering can be understood as the minimization of the distortion between the power spectrum of the mixed sound and a mixture of spectral cluster models. Based on this fact, it becomes clear that this problem amounts to a maximum likelihood problem with the continuous Poisson distribution as the likelihood function. This Bayesian reformulation enables us not only to impose empirical constraints, which are usually necessary for any underdetermined problems, to the parameters by introducing prior probabilities but also to derive a model selection criterion, that leads to estimating the number of sources. We confirmed through the experiments the effectiveness of the two techniques introduced in this chapter: multiple F0 estimation and source number estimation.
Human listeners are able to concentrate on listening to a target sound without difficulty even in the situation where many speakers are talking at the same time or many instruments are played together. Recent efforts are being directed toward the attempt to implement this ability by human called the \auditory stream segregation". Such an approach is referred to as the "Computational Auditory Scene Analysis (CASA)". In Chapter 4, we aim at developing a computational algorithm enabling the decomposition of the time-frequency components of the signal of interest into distinct clusters such that each of them is associated with a single auditory stream. To do so, we directly model a spectro-temporal model whose shape can be taken freely within the constraint called \Bregman's grouping cues", and then try to fit the mixture of this model to the observed spectrogram as well as possible. We call this approach "Harmonic-Temporal Clustering". While most of the conventional methods usually perform separately the extraction of the instantaneous features at each discrete time point and the estimation of the whole tracks of these features, the method described in this chapter performs these procedures simultaneously. We confirmed the advantage of the proposed method over conventional methods through experimental evaluations.
Although many efforts have been devoted to both F0 estimation and spectral envelope estimation intensively in the speech processing area, the problem of determining F0 and spectral envelope seems to have been tackled independently. If the F0 were known in advance, then the spectral envelope could be estimated very reliably. On the other hand, if the spectral envelope were known in advance, then we could easily correct subharmonic errors. F0 estimation and spectral envelope estimation, having such a chicken and egg relationship, should thus be done jointly rather than independently with successive procedures. From this standpoint, we will propose a new speech analyzer that jointly estimates pitch and spectral envelope using a parametric speech source-filter model. We found through the experiments a significant advantage of jointly estimating F0 and spectral envelope in both F0 estimation and spectral envelope estimation.
The approaches of the preceding chapters are based on the approximate assumption of additivity of the power spectra (neglecting the terms corresponding to interferences between frequency components), but it becomes usually difficult to infer F0s when two voices are mixed with close F0s as far as we are only looking at the power spectrum. In this case not only the harmonic structure but also the phase difference of each signal becomes an important cue for separation. Moreover, having in mind future source separation methods designed for multi-channel signals of multiple sensory input, analysis methods in the complex spectrum domain including the phase estimation are indispensable. Taking into account the significant effectiveness and the advantage of the approach described in the preceding chapters, we have been motivated to extend it to a complex-spectrum-domain approach without losing its essential characteristics. The main topic of Chapter 6 is the development of a nonlinear optimization algorithm to obtain the maximum likelihood parameter of the superimposed periodic signal model: focusing on the fact that the difficulty of the single tone frequency estimation or the fundamental frequency estimation, which are at the core of the parameter estimation problem for the sinusoidal signal model, comes essentially from the nonlinearity of the model in the frequency parameter, we introduce a new iterative estimation algorithm using a principle called the "auxiliary function method". This idea was inspired by the principle of the EM algorithm. Through simulations, we confirmed that the advantage of the proposed method over the existing gradient descent-based method in the ability to avoid local solutions and the convergence speed. We also confirmed the basic performance of our method through 1ch speech separation experiments on real speech signal.