Wednesday, August 3, 2011

Non-Negative Matrix Factorization with Eigen

My Masters thesis focuses on source separation of drum sounds from a mono (1-channel) mixdown. The basic idea is to split a drum loop into separate loops for the kick, snare, hi-hat, etc. A number of source-separation techniques exist, but a particularly appealing strategy utilizes a mathematical process known as Non-Negative Matrix Factorization, also known as NMF.

In much the same way that we can factor a number into it's constituent factors (6 = 2 x 3), we can factorize a matrix into two constituent matrices (V = GW, where V, W, and H are all matrices). In this representation, V represents the spectrogram of the input audio, W represents a set of N stationary spectra (spectra is pretty much the FFT), and H represents a set of N volume envelopes (Where N is the number of sources/drums/instrument types present in the recording). Each of the spectra in W is related to a particular volume envelope in H. The combination of a spectra and its corresponding volume envelope represents a single source/drum/instrument. The image below shows the result of a factorization where N = 3. The spectrogram in the upper right is V, the Basis Matrix is W, and the Gain Matrix is H. The first row of the Gain Matrix is the volume envelope for the bass drum, the second row is the volume envelope for a snare drum, and the third row is the volume envelope for a hi-hat (note the wide-band nature of it's spectra in the corresponding Basis Matrix column).



The 'Non-Negative' bit comes from the constraint that no element in V, W or H may be negative. In other words, we can only add up bits of sound, we can not subtract bits of sound in our factorization. The nice bit about this constraint is that it encourages the factorization to be parts-based.

If you're more interested in the mathematics behind NMF, Tuomas Virtanen is a bit of an NMF expert and he has a number of quite nice papers on the subject as it applies to Audio.

There are a number of nice MATLAB implementations of NMF available, chiefly - it seems - because MATLAB comes with all of the necessary data structures and algorithms to easily deal with matrices. C++ is well suited for matrix manipulations, but not out of the box. I'm not a terribly experienced C programmer, but after a bit of research, it seems like the best available options for matrix mathematics are BLAS, LAPACK, Armadillo, and Eigen. I have decided to go with Eigen, mainly because the BLAS/LAPACK libraries have a quite scary interface/syntax, and because of this interview with the Eigen developers.

The Eigen syntax is quite nice if you come from an object-oriented background, and possibly most appealing is the fact that you do not need to build or compile any source when using Eigen; you simply copy some header files into your project, and you are ready to create and manipulate matrices.

The following code implements a version of Virtanen's algorithm in the Eigen C++ library (an appropriate number of iterations/updates is generally something like 300. You can also re-write this sort of algorithm to terminate when the reconstruction error between GW and V is low):




void factorizeMatrix(MatrixXf V_, MatrixXf W_, MatrixXf H_, int num_updates_) {
 
int xSize = V_.cols();
int ySize = V_.rows();
// Create our Gain matrix with randomized initial values
H_ = MatrixXf(num_sources_, xSize);
H_.setRandom();
H_ = H_.array().abs();
// Create our Spectra matrix with randomized initial values
W_ = MatrixXf(ySize, num_sources_);
W_.setRandom();
W_ = W_.array().abs();
// Create an identity matrix
MatrixXf ones = MatrixXf::Constant(xSize,ySize,1.0);
// Run the update rules for numUpdate iterations
for(int i = 0; i < num_updates_; i++) {
// update spectra
MatrixXf WH = (W_ * H_).array() + 0.00001;
MatrixXf numerator = (V_.array() / WH.array());
numerator = numerator * H_.transpose();
MatrixXf denominator = (H_ * ones).array() + 0.00001;
W_ = W_.array() * numerator.array();
W_ = W_.array() / denominator.transpose().array();

// update gains
WH = (W_ * H_).array() + 0.00001;
numerator = (V_.array() / WH.array());
numerator = W_.transpose() * numerator;
denominator = (ones * W_).array() + 0.00001;
H_ = H_.array() * numerator.array();
H_ = H_.array() / denominator.transpose().array();
}

// At this point, you have the relevant spectra and volume envelopes in W and H!
}

Magnitude and Phase of a Complex Number

As a follow-up to the post about the STFT, here are two functions that will come in handy if you are looking to get magnitude or phase information from the complex-valued FFT output. These are similar to the abs() and angle() functions in MATLAB. Note that MATLAB uses the atan2() function as opposed to simply the atan() routine.

Remember that a complex number specifies a vector in the complex plane, so the overloaded nature of abs() in MATLAB can be slightly confusing. What you really want is the magnitude, or length of the vector, which you can get with Pythagoras' theorem.




Check out this link for a nice primer on complex numbers.


// Computes the magnitude of a complex number/vector in the complex plane (similar to abs() in MATLAB)
float computeMag(fftw_complex num) {
double real = num[0];
double imag = num[1];
return sqrt((real * real) + (imag * imag));
}

// Computes the phase of a complex number/vector in the complex plane (similar to angle() in MATLAB)
float computePhase(fftw_complex num) {
double real = num[0];
double imag = num[1];
//SOHCAHTOA
// Tan(@) = imag/real
return atan2(imag, real);
}

Short Time Fourier Transform with FFTW3

For an OpenFrameworks project I was recently working on, I needed a spectrogram, or STFT representation of some audio data. The spectrogram gives you an idea of how the frequency content of a signal changes over time. While the formula may look a little hairy, you can really just think of the STFT as  a process that chops up your signal into chunks, and applies a DFT or FFT to each chunk. When the outputs of each FFT are lined up side by side, you have yourself a spectrogram.

The basic process is as follows:

1) Slice out a chunk of the signal
2) 'window' the chunk*
3) Compute the DFT/FFT of the chunk
4) Store the DFT/FFT output somewhere
5) Slice the next chunk out of the signal**
6) Keep repeating until you hit the end of the signal


* Because of the eccentricities of the DFT and FFT (spectral leakage, etc), the formulation gets a little confused with notions of windowing and other miscellany. The question you might ask, is why we are 'windowing' the chunk, and what does 'windowing' mean? This link will give you the proper answers, I'm sure, but I'll throw out an intuitive answer.

First off, 'windowing' simply means multiplying one signal by another signal. This 'other' signal is known as a window. If our window was an infinitely long series of zeros, the result of windowing would be to completely zero out our signal. If our window was an infinitely long series of zeros save for a single value of 1 at time 0 (also known as an 'impulse'), the result of windowing would be to zero out every element in our signal except for the sample at time 0.

This means that if we slice out a 'chunk' of samples from our signal, we have in strict terms just windowed our signal. For example, if we slice out the first 1024 samples from our signal, we have just windowed our signal with a 'rectangular window'. The first 1024 samples in our rectangular window have a value of 1, and the rest of the window's samples have a value of 0.

Generally, a rectangular window shape is undesirable, as it introduces unwanted frequency content. If you're feeling interested in alternative window shapes and their effect on DFT/FFT calculations, look here. Also, remember the convolution theorem, which states that multiplication in the time-domain is equivalent to convolution in the frequency domain. This should help you intuitively understand why certain window shapes are more or less desirable. For our purposes, we are using a Hamming window, as it is the default window used in the MATLAB STFT/spectrogram routine.  Here is C++ code to create a Hamming window in a float buffer, equivalent I believe, to the MATLAB function hamming():


// Create a hamming window of windowLength samples in buffer
void hamming(int windowLength, float *buffer) {
 for(int i = 0; i < windowLength; i++) {
   buffer[i] = 0.54 - (0.46 * cos( 2 * PI * (i / ((windowLength - 1) * 1.0))));
 }
}

** Each 'chunk' that we slice out of the signal will have the same length (It is wise to choose a length that is a power of two, as FFT routines are optimized to deal with signals whose length is a power of two. You also need to ensure that the length is neither too long, nor too short. For audio applications, it seems like 2048 is usually a good choice. If your chunks are too short, you cannot properly analyze the frequency content of low-frequency signals, and if your chunks are too long, you get poor time localization of frequency analysis. This whole issue actually relates to Heisenberg's Uncertainty Principle!)

Let's assume that we decide to divide up our signal into chunks of 2048 samples. The first chunk will be composed of the first 2048 samples of the signal, or the 0th to 2047th samples.

We now have to decide what samples make up the second 'chunk'. One option is to have no overlap in our chunks, which would amount to taking the 2048th to 4095th samples from our audio file. Another option would be to have lots of overlap in our chunks: we could use the 1st to 2048th samples for the second chunk. In the former case, we say that the STFT has a 'hop size' of 2048 samples (no overlap). In the latter case, we say that the STFT has a 'hop size' of 1 sample (lots of overlap). The 'hop size' can be thought of as the distance between the starting point/index of of successive chunks.

Your choice of hop size depends largely upon the specific concerns of your analysis. If you are looking for a rough, general picture of how the frequency content of the signal evolves over time, a large (little overlap) hop size is probably fine. If you need a very faithful representation, or are interested in re synthesizing the STFT data as audio in the time-domain, you should choose a small hop size. You also need to think about your storage and performance requirements. A small hop size means more DFT/FFT calculations!

Ok, I think that's enough to get the general picture of what's going on. Here is some rough, probably buggy C++ code that implements an STFT routine using the FFTW3 library. It's probably not optimal performance-wise, but might give you a good starting point for writing your own:

void STFT(std::vector <float> *signal, int signalLength, int windowSize, int hopSize) {

    fftw_complex    *data, *fft_result, *ifft_result;
    fftw_plan       plan_forward, plan_backward;
    int             i;

    data        = ( fftw_complex* ) fftw_malloc( sizeof( fftw_complex ) * windowSize );
    fft_result  = ( fftw_complex* ) fftw_malloc( sizeof( fftw_complex ) * windowSize );
    ifft_result = ( fftw_complex* ) fftw_malloc( sizeof( fftw_complex ) * windowSize );
    plan_forward  = fftw_plan_dft_1d( windowSize, data, fft_result, FFTW_FORWARD, FFTW_ESTIMATE );
    // Create a hamming window of appropriate length
    float window[windowSize];
    hamming(windowSize, window);
    int chunkPosition = 0;
    int readIndex;
    // Should we stop reading in chunks? 
    int bStop = 0;
int numChunks = 0;
// Process each chunk of the signal
while(chunkPosition < signalLength && !bStop) {
// Copy the chunk into our buffer
for(i = 0; i < windowSize; i++) {
        readIndex = chunkPosition + i;
        if(readIndex < signalLength) {
              // Note the windowing!
data[i][0] = (*signal)[readIndex] * window[i]; 
data[i][1] = 0.0
        } else {
 
             // we have read beyond the signal, so zero-pad it!
data[i][0] = 0.0
data[i][1] = 0.0
bStop = 1;
}
}
// Perform the FFT on our chunk
fftw_execute( plan_forward );
/* Uncomment to see the raw-data output from the FFT calculation
std::cout << "Column: " << chunkPosition << std::endl;
for(i = 0 ; i < windowSize ; i++ ) {
fprintf( stdout, "fft_result[%d] = { %2.2f, %2.2f }\n",
i, fft_result[i][0], fft_result[i][1] );
}
*/
// Copy the first (windowSize/2 + 1) data points into your spectrogram.
// We do this because the FFT output is mirrored about the nyquist 
// frequency, so the second half of the data is redundant. This is how
  // Matlab's spectrogram routine works.

for (i = 0; i < windowSize/2 + 1; i++) {
yourDataStructureGoesHere = fft_result[i][0];
yourDataStructureGoesHere = fft_result[i][1];
}

chunkPosition += hopSize;
numChunks++;
// Excuse the formatting, the while ends here.
fftw_destroy_plan( plan_forward );
fftw_free( data );
fftw_free( fft_result );
fftw_free( ifft_result );
}