Wednesday, August 3, 2011

Short Time Fourier Transform with FFTW3

For an OpenFrameworks project I was recently working on, I needed a spectrogram, or STFT representation of some audio data. The spectrogram gives you an idea of how the frequency content of a signal changes over time. While the formula may look a little hairy, you can really just think of the STFT as  a process that chops up your signal into chunks, and applies a DFT or FFT to each chunk. When the outputs of each FFT are lined up side by side, you have yourself a spectrogram.

The basic process is as follows:

1) Slice out a chunk of the signal
2) 'window' the chunk*
3) Compute the DFT/FFT of the chunk
4) Store the DFT/FFT output somewhere
5) Slice the next chunk out of the signal**
6) Keep repeating until you hit the end of the signal

* Because of the eccentricities of the DFT and FFT (spectral leakage, etc), the formulation gets a little confused with notions of windowing and other miscellany. The question you might ask, is why we are 'windowing' the chunk, and what does 'windowing' mean? This link will give you the proper answers, I'm sure, but I'll throw out an intuitive answer.

First off, 'windowing' simply means multiplying one signal by another signal. This 'other' signal is known as a window. If our window was an infinitely long series of zeros, the result of windowing would be to completely zero out our signal. If our window was an infinitely long series of zeros save for a single value of 1 at time 0 (also known as an 'impulse'), the result of windowing would be to zero out every element in our signal except for the sample at time 0.

This means that if we slice out a 'chunk' of samples from our signal, we have in strict terms just windowed our signal. For example, if we slice out the first 1024 samples from our signal, we have just windowed our signal with a 'rectangular window'. The first 1024 samples in our rectangular window have a value of 1, and the rest of the window's samples have a value of 0.

Generally, a rectangular window shape is undesirable, as it introduces unwanted frequency content. If you're feeling interested in alternative window shapes and their effect on DFT/FFT calculations, look here. Also, remember the convolution theorem, which states that multiplication in the time-domain is equivalent to convolution in the frequency domain. This should help you intuitively understand why certain window shapes are more or less desirable. For our purposes, we are using a Hamming window, as it is the default window used in the MATLAB STFT/spectrogram routine.  Here is C++ code to create a Hamming window in a float buffer, equivalent I believe, to the MATLAB function hamming():

// Create a hamming window of windowLength samples in buffer
void hamming(int windowLength, float *buffer) {
 for(int i = 0; i < windowLength; i++) {
   buffer[i] = 0.54 - (0.46 * cos( 2 * PI * (i / ((windowLength - 1) * 1.0))));

** Each 'chunk' that we slice out of the signal will have the same length (It is wise to choose a length that is a power of two, as FFT routines are optimized to deal with signals whose length is a power of two. You also need to ensure that the length is neither too long, nor too short. For audio applications, it seems like 2048 is usually a good choice. If your chunks are too short, you cannot properly analyze the frequency content of low-frequency signals, and if your chunks are too long, you get poor time localization of frequency analysis. This whole issue actually relates to Heisenberg's Uncertainty Principle!)

Let's assume that we decide to divide up our signal into chunks of 2048 samples. The first chunk will be composed of the first 2048 samples of the signal, or the 0th to 2047th samples.

We now have to decide what samples make up the second 'chunk'. One option is to have no overlap in our chunks, which would amount to taking the 2048th to 4095th samples from our audio file. Another option would be to have lots of overlap in our chunks: we could use the 1st to 2048th samples for the second chunk. In the former case, we say that the STFT has a 'hop size' of 2048 samples (no overlap). In the latter case, we say that the STFT has a 'hop size' of 1 sample (lots of overlap). The 'hop size' can be thought of as the distance between the starting point/index of of successive chunks.

Your choice of hop size depends largely upon the specific concerns of your analysis. If you are looking for a rough, general picture of how the frequency content of the signal evolves over time, a large (little overlap) hop size is probably fine. If you need a very faithful representation, or are interested in re synthesizing the STFT data as audio in the time-domain, you should choose a small hop size. You also need to think about your storage and performance requirements. A small hop size means more DFT/FFT calculations!

Ok, I think that's enough to get the general picture of what's going on. Here is some rough, probably buggy C++ code that implements an STFT routine using the FFTW3 library. It's probably not optimal performance-wise, but might give you a good starting point for writing your own:

void STFT(std::vector <float> *signal, int signalLength, int windowSize, int hopSize) {

    fftw_complex    *data, *fft_result, *ifft_result;
    fftw_plan       plan_forward, plan_backward;
    int             i;

    data        = ( fftw_complex* ) fftw_malloc( sizeof( fftw_complex ) * windowSize );
    fft_result  = ( fftw_complex* ) fftw_malloc( sizeof( fftw_complex ) * windowSize );
    ifft_result = ( fftw_complex* ) fftw_malloc( sizeof( fftw_complex ) * windowSize );
    plan_forward  = fftw_plan_dft_1d( windowSize, data, fft_result, FFTW_FORWARD, FFTW_ESTIMATE );
    // Create a hamming window of appropriate length
    float window[windowSize];
    hamming(windowSize, window);
    int chunkPosition = 0;
    int readIndex;
    // Should we stop reading in chunks? 
    int bStop = 0;
int numChunks = 0;
// Process each chunk of the signal
while(chunkPosition < signalLength && !bStop) {
// Copy the chunk into our buffer
for(i = 0; i < windowSize; i++) {
        readIndex = chunkPosition + i;
        if(readIndex < signalLength) {
              // Note the windowing!
data[i][0] = (*signal)[readIndex] * window[i]; 
data[i][1] = 0.0
        } else {
             // we have read beyond the signal, so zero-pad it!
data[i][0] = 0.0
data[i][1] = 0.0
bStop = 1;
// Perform the FFT on our chunk
fftw_execute( plan_forward );
/* Uncomment to see the raw-data output from the FFT calculation
std::cout << "Column: " << chunkPosition << std::endl;
for(i = 0 ; i < windowSize ; i++ ) {
fprintf( stdout, "fft_result[%d] = { %2.2f, %2.2f }\n",
i, fft_result[i][0], fft_result[i][1] );
// Copy the first (windowSize/2 + 1) data points into your spectrogram.
// We do this because the FFT output is mirrored about the nyquist 
// frequency, so the second half of the data is redundant. This is how
  // Matlab's spectrogram routine works.

for (i = 0; i < windowSize/2 + 1; i++) {
yourDataStructureGoesHere = fft_result[i][0];
yourDataStructureGoesHere = fft_result[i][1];

chunkPosition += hopSize;
// Excuse the formatting, the while ends here.
fftw_destroy_plan( plan_forward );
fftw_free( data );
fftw_free( fft_result );
fftw_free( ifft_result );