Elias Pampalk - MA Toolbox - Documentation


The Matlab functions and scripts in the MA toolbox are:

- ma_g1c_ExtractFeatures    extract the features necessary to compute the G1C distance (using the MIREX'06 input format)
- ma_g1c_ComputeSimilarities    compute the G1C distance (using MIREX'06 output format)
- ma_sone    wav (PCM) to sone (specific loudness sensation)
- ma_mfcc   wav (PCM) to MFCCs (Mel Frequency Cepstrum Coefficients)
- ma_sh   sone to Spectrum Histogram
- ma_ph   sone to Periodicity Histogram
- ma_fp   sone to Fluctuation Pattern
- ma_fc   frame based representation (MFCCs or sone) to cluster model (Frame Clustering)
- ma_cms   cluster models to distance (Cluster Model Similarity)
- ma_kmeans   kmeans clustering (used by "ma_fc")
- ma_cm_visu   visualize a cluster model (as returned by "ma_fc")
- ma_simple_eval   script for a simple evaluation of similarity measures
- ma_simple_iom   script for a simple islands of music interface

Most of the functions run a simple test when called without an input argument. These built-in tests are an easy way to play around with the parameters.

Most functions have an input argument called "p". This is a Matlab structure variable and is used to store all parameter values conveniently. See the test methods on how to set them, in some cases default values are defined.

To use the functions it is necessary to load the WAV (PCM) files using Matlab's "wavread ".


%%         G1C, implementation as described in thesis
%%         for mirex'06, "music audio search" (was audio music similarity)
%%       FeatureExtraction(in_file,out_dir)
%%       FeatureExtraction('mypath/myInputFile.txt','mypath/myOutputDirectory')
%%       FeatureExtraction('mypath/myInputFile','mypath/myOutputDirectory/')
%% in_file: path fo file containing list of wav files to extract features from
%%          all files are 22050Hz mono wav (this function checks if this is true)
%% out_dir: directory to which this function can write output. (logfiles,
%%          and extracted features), files written to out_dir must be read
%%          by function "DistanceCumputation"
%% in_file format (text):
%%      path/to/audiofile/0000001.wav
%%      path/to/audiofile/0000002.wav
%%      path/to/audiofile/0000003.wav
%%      ...
%%      path/to/audiofile/9999999.wav


%%         G1C, implementation as described in thesis
%%         for mirex, "music audio search" (was audio music similarity)
%%       FeatureExtraction(in_file,out_dir)
%%       ComputeSimilarities(out_dir,'somepath/distance_matrix.txt')
%% in_dir:   directory to which function "FeatureExtraction" was writing to.
%%           a log file will be created in this directory
%% out_file: whole distance matrix in the following format
%%     1   path\file1.wav
%%     2   path\file2.wav
%%     ...
%%     N  path\fileN.wav
%%     Q\R   1       2      ...      N
%%     1     0.000  10.234  ...   123.32
%%     2    10.234   0.000  ...    23.45
%%     .     ...     ...    0.000   ...
%%     N     4.1234  6.345  ...     0.0
%%   delimiter: tabulator space, number format: float


Estimates the loudness sensation per frequency band using auditory models. The main components are (1) outer-ear model, (2) critical-band rate scale (Bark-scale), (3) spectral masking, and (4) Sone.
Compute sonogram for a pcm signal.
   [sone, Ntot, p] = ma_sone(wav,p)

  ma_sone;                %% uses test-sound and creates figures
  sone = ma_sone(wav,p);  %% normal usage
  [sone, Ntot, p] = ma_sone(wav,p);

  wav (vector) obtained from wavread (use mono input! 11kHz recommended)
  p (struct) parameters e.g. 
   p.fs          = 11025;     %% sampling frequency of given wav (unit: Hz)
 * p.do_visu     = 0;         %% create some figures
 * p.do_sone     = 1;         %% compute sone (otherwise dB)
 * p.do_spread   = 1;         %% apply spectral masking
 * p.outerear = 'terhardt';   %% outer ear model {'terhardt' | 'none'}
 * p.fft_size = 256;          %% window size (unit: samples) 
                              %% 256 are ~23ms @ 11kHz 
 * p.hopsize  = 128;          %% fft window hopsize (unit: samples)
 * p.bark_type   = 'table';   %% type of bark scale to use (either:
                              %% 'table' lookup (=default) with max 24 bands 
                              %% (for 44kHz), or 
                              %% vector 
[min_freq max_freq num_bands]
 * p.dB_max      = 96;        %% max dB of input wav 
                              %% (for 16 bit input 96dB is SPL)
all fields marked with * are optional (defaults are defined)

  sone (matrix) rows are bark bands, columns are time intervalls,
                values are sone (or dB) 
  Ntot (vector) total loudness (stevens method, see "signal sound 
                and sensation" p73, hartmann)


Figure 1. The main characteristics of the sone model. The upper subplot shows the width of the critical-bands and the outer-ear model. The lower left shows the relationship between the Bark-scale and Hz. The lower right shows the spectral masking function. Some things to notice: The Bark-scale is linear up to about 500Hz. The spectral masking is not symmetric. Higher tones are masked stronger by lower ones than vice versa. The outer-ear is most responsive to frequencies around 3-4kHz. For details on the models see the source code (e.g. "type ma_sone").

Figure 2. The characteristics of the sone model can be demonstrated with a simple sine-tone. The tone is generated at different frequencies (center frequencies of the critical-bands) with fixed amplitude (3 times, each time the fixed amplitude is increased). Note the difference between the linear frequency scale (used in the outer-ear and FFT representations) and the Bark-scale (the lower 3 subplots). The sinusoids in each sweep from low to high have the same amplitudes, they are not perceived equally loud (uncomment the "sound(wav,p.fs)" line in the test method code to hear the sound, this pitch dependent loudness sensation is reflected in the model (in contrast to the MFCC model).


Mel Frequency Cepstrum Coefficients are very popular in speech processing. They are based on the Mel-scale. The Mel scale is defined such that a tone of 1000Hz equals 1000mel. If tone A is perceived twice as high as tone B then its mel value is twice as high. (The Bark-scale and in particular the width of the critical-bands are defined by masking tresholds, i.e., how close can two tones be too each other on the frequency scale so that they are both perceived and one doesn't mask the other).
In addition to using the Mel-scale, dB values are computed and the spectrum is compressed (and smoothed) using a Discrete Cosine Transform (DCT). This compressed representation is well suited for computational heavy algorithms. (Note that a DCT compression could also be applied to the sonogram.)

Compute Mel Frequency Cepstrum Coefficients.
  [mfcc,DCT] = ma_mfcc(wav, p)

  ma_mfcc;                         %% test mode
  mfcc = ma_mfcc(wav, p);          %% normal usage
  [mfcc, DCT] = ma_mfcc(wav, p);   %% DCT matrix is needed to decode the 
                                   %% coefficients (e.g. for visualization)

based on function "mfcc" in Auditory Toolbox by Malcolm Slaney 
(http://www.slaney.org/malcom) see Auditory Toolbox for details

  wav (vector) obtained from wavread or ma_mp3read 
               (use mono input! 11kHz recommended)
  p (struct) parameters e.g. 
     p.fs              = 11025; %% sampling frequency of given wav (unit: Hz)
   * p.visu            = 0;     %% create some figures
   * p.fft_size        = 256;   %% (unit: samples) 256 are about 23ms @ 11kHz 
   * p.hopsize         = 128;   %% (unit: samples) aka overlap
   * p.num_ceps_coeffs = 20;   
   * p.use_first_coeff = 1;     %% aka 0th coefficient (contains information 
                                %% on average loudness)

   * p.mel_filt_bank   = 'auditory-toolbox'; 
                                %% mel filter bank choice 
                                %% {'auditory-toolbox' | [f_min f_max num_bands]}
                                %% e.g. [20 16000 40], (default)
                                %% note: auditory-toobox is optimized for 
                                %%       speech (133Hz...6.9kHz)

   * p.dB_max      = 96;        %% max dB of input wav 
                                %% (for 16 bit input 96dB is SPL)
   all fields marked with * are optional (defaults are defined)
  mfcc (matrix) size coefficients x frames
  DCT (matrix) 

Figure 3. The basic model of the MFCCs. The upper subplot shows the filterbank of triangular filters. The lower subplot shows the Discrete Cosine Transform matrix.

Figure 4. Using the same input sound as in Figure 2 the characteristics of the MFCC can be studied. Some things to notice: The mel-scale differs from the bark-scale (the line is not straight). Lower frequencies have higher values. Artefacts from the DCT transform are clearly visible.


The Spectrum Histogram is a very simple approach to summarize the variations in the dynamics. It counts for each frequency band how many times certain loudness levels are exceeded.

compute Spectrum Histogram (SH)
  very simple approach to summarize the sonogram
  see Pampalk et al. ISMIR'03 for details

sh = ma_sh(sone,hist_res)

  [wav, p.fs] = wavread(myfile);
  sone = ma_sone(wav,p);
  sh = ma_sh(sone,hist_res)

  sone (matrix) as returned from "ma_sone" (size: critical-bands x time)
  hist_res = 25; %% loudness resolution in histogram (used 50 for ismir03)

  sh (matrix) size: critical-bands x hist_res

note: to compute distances between two SHs use Euclidean distance, e.g.
      distance = norm(sh1(:)-sh2(:));

Figure 5. Same test sound as in the examples above. The spectrum histogram counts how often a loudness level was exceeded in each critical-band.


The Periodicity Histogram is based on a comb-filter and describes periodically reoccurring sharp attacks in the audio. Although the PH is not a useful similarity measure it serves as a good starting point to compute various descriptors.

compute Periodicity Histogram (PH)
  see Pampalk et al. ISMIR'03 for details

[ph, ph_low, p] = ma_ph(sone,p)

  sone (matrix) as returned from "ma_sone" (size: critical-bands x time)
  parameter structure p
      p.hist_res = 30; energy resolution in histogram (used 50 for ismir03)
      p.sequence.length = 1024; %% ~12 sec @ 11025, size of sequence (window) to 
                                %% analyze periodicities
      p.sequence.hopsize = 512; %% smaller values make sense, 
      p.sequence.windowfunction = 'hann';
      p.fft_hopsize = 256; %% (~12ms @ 11kHz) hopsize used to create 
                           %% sonogram (see ma_sone)
      p.acc = 5;       %% ismir version orig: 5 (recommended for 12ms hopsize in 
                       %% sone), combfilter accuracy
      p.minbpm = 40;   %% ismir version orig: 40, minimum bpm to look for
      p.maxbpm = 240;  %% ismir version orig: 240
      p.hist_res = 30; %% resolution of histogram
      p.maxval = 0.4;  %% highest energy to expect in histogram
      p.minval = 0;    %% 1/p.hist_res
      p.visu = 0;      %% do some visualizations
      p.preferred_tempo = 'moleants'; %% {'moleants' | 'none'}

  ph (matrix) size: critical-bands x hist_res
  ph_low (matrix), same as ph but only for the lowest frequency bands

note: to compute distances between two PHs use Euclidean distance, e.g.
      distance = norm(ph1(:)-ph2(:));
      however, PHs best serve as starting point for higher level descriptors e.g. 
      using sum(ph(:)) (= total energy contained in periodic beats)

Figure 6. The different steps in computing the periodicities for a 12sec time frame. The input is a very simple test sound generated with "ma_test_create_wav".

Figure 7. If the input is long enough, then the results from each 12sec frame are summarized in a histogram, counting for each periodicity (bpm) how often specific energy levels were exceeded.


The Fluctuation Pattern is based on a model of the perceived fluctuation for amplitude modulated tones.

compute Fluctuation Pattern (FP)
  see Pampalk et al. ACM-MM'02 for details
  december 2001 version (islands of music, master's thesis)

fp = ma_fp(sone,p)

  sone (matrix) as returned from "ma_sone" (size: critical-bands x time)
  parameter structure p
      p.sequence.length = 512;   %% windows size (ca 6sec @ 11kHz with 
                                 %% 128 sone hopsize)

      p.sequence.hopsize = 256;  
      p.sequence.windowfunction = 'boxcar';
      p.fs = 11025;              %% sampling frequency of wav file
      p.fft_hopsize = 128;       %% (~12ms @ 11kHz) hopsize used to create sone 
      p.visu = 0;                %% do some visualizations
  fp (matrix) size: critical-bands x 60 (modulation frequencies)

note: to compute distances between two FPs use Euclidean distance, e.g.
      distance = norm(fp1(:)-fp2(:));

Figure 8. The different steps in computing the fluctuation pattern for a 6sec sequence.


Frame Clustering takes a frame based representation as input (e.g. MFCCs) and creates a cluster model. The cluster models can be compared to each other with "ma_cms" (Cluster Model Similarity).
A variation using the Earth Mover's Distance was originally published by Beth Logan and Ariel Salomon (2001). A second variation using monte carlo sampling was publised by Jean-Julien Aucouturier and Francois Pachet (2002).

To run "ma_fc" with the "gmm" (Gaussian Mixture Model) option, the directory of the Netlab toolbox must be added to the Matlab path (e.g. "addpath ../netlab").

compute Frame-based Clustering (FC)
  see Beth Logan and Ariel Salomon, ICME'01
  and Jean-Julien Aucouturier and Francois Pachet, ISMIR'02
  for details

frames = ma_mfcc(wav,p); or frames = ma_sone(wav,p);
cm = ma_fc(frames,p);
distance = ma_cms(cm1,cm2,p);

specific parameters for k-means and gmm need to be set in the code.

  frames (matrix) as returned from "ma_sone" or "ma_mfcc" 
                  (size: frequency-bands x time)
  parameter structure p
      p.cluster_type = 'kmeans'; %% {'kmeans' | 'gmm'} 
                                 %% 'gmm' requires NETLAB toolbox
      p.num_clusters = 3;        %% number of clusters
      p.covar_type   = 'diag';   %% {'spherical' | 'diag' | 'full'}

  cm (structure) cluster model (netlab gmm model structure)
         type: 'gmm'
          nin: 19
     ncentres: 25
   covar_type: 'full'
       priors: [1x25 double]
      centres: [25x19 double]
       covars: [19x19x25 double]
         nwts: 9525

note: to compute distances between two CMs use "ma_cms"
      to visualize a CM use "ma_cm_visu"

For figures see "ma_cms".


Cluster Model Similarity takes the models computed by "ma_fc" as input and outputs a distance. There are basically two ways to compute the similarity: using the EMD-KL (Earth Movers Distance and Kullback Leibler Divergence), or by Monte Carlo sampling.

EMD-KL requires functions implemented by Yossi Rubner. Monte Carlo sampling requires the NETLAB toolbox.

 compute Cluster Model Similarity (CMS)
   see Beth Logan and Ariel Salomon, ICME 2001
       (using Kulback Leibler Divergence and Earth Mover's Distance)
   and Jean-Julien Aucouturier and Francois Pachet, ISMIR 2002 + 
         IEEE Workshop on Model Based Processing and Coding of Audio 2002
       (using Gaussian Mixture Models and monte carlo sampling)
   for details

   frames = ma_mfcc(wav,p);  or frames = ma_sone(wav,p);
   cm = ma_fc(frames,p);
   distance = ma_cms(cm1,cm2,p);

 note: 'KL_EMD' needs functions supplied by Yossi Rubner
       'monte_carlo' sampling needs netlab toolbox

  if called without arguments, 3 test sounds are created using 
  "ma_test_create_wav" and the distances between these are computed

   cluster models (structure) as returned from "ma_fc" 
   parameter structure p
       p.cm_similarity = 'monte_carlo';  %% {'KL_EMD' | 'monte_carlo'}
       p.mc_samples = 2000;  %% only needed for 'monte_carlo'
   d (scalar) distance

Figure 9. The three sounds used in this test.

Figure 10. The cluster models can be visualized using "ma_cm_visu". In this case each sound was represented by 3 Gaussian clusters. The shapes and variances of each clusters (cummulated probability) are shown.

Figure 11. The distance matrix of the 3 sounds.


 simple batch k-means

 [C,Qe,N,W,Q] = ma_kmeans(X, iter, c, C)

   X    .. Data
   iter .. Iterations (if set to 0, only initialize)
   c    .. number of clusters
   C    .. initial Codebooks (default = randperm select k)
   C    .. centers
   Qe   .. final quantisation error
   N    .. number of items mapped to each cluster
   W    .. winner (best matching cluster for each item)
   Q    .. quantisation error of each item (Qe = mean(Q))
 used by ma_fc (frame clustering)


 my_pdf = ma_cm_visu(cm,p,s)

 visualize Cluster Model (CM)
   see "ma_fc" for details
   cm (structure) as returned from "ma_fc"
   parameter structure p
       p.cluster_type = 'kmeans';  %% {'kmeans' | 'gmm'}
       p.DCT (matrix) as returned from ma_mfcc, 
                      if not given assume no DCT compression used
   s (handle to subplot) optional, if given no new figure is created

   some visualization parameters can only be set in the code

   my_pdf (matrix) dimensions: dB vs frequency band (same as visualized)

 note: to compute distances between two CMs use "ma_cms"

See "ma_cms" for figures.


This is a simple script to demonstrate how to combine the different functions if the goal is to evaluate different parameter settings. Input is a bunch of wav files sorted into directories such that similar pieces are in the same directory. At the end the R-precision is computed.


This is a simple script to demonstrate how the SOM toolbox (GNU-GPL) and the SDH toolbox (GNU-GPL) can be used to create "islands of music". (Both toolboxes need to be in the Matlab path.)

Last update: 27 May 2007