Tinkering with ideas about using an RNN to process audio signals for speech segmentation and separation from other acoustic sources.
This project has basic support for for speech segmentation using an rnn "auto encoder" type architecture. It is written in python using tensorflow for the neural network.
I got the best results so far using BasicRNNCell rather than GRUCell. I trained the model by giving it isolated speech and an 'ideal' speech mask based on the Short-Time Fourier Transform (STFT) analysis of 3 second long sound clips. I also trained using other (non-speech) sounds with a mask of all zeros.
Here's a few details of the model:
I used two layers of stacked rnn cells...a hidden layer and an output layer. These are dynamically unrolled in time over 95 time steps.
n_steps=95, # for a 3 second training clip.
n_inputs=513, # for a 1024 point long STFT window
n_hidden=256,
n_outputs=513,
keep_prob=0.5 # dropout applied to inputs and outputs of the hidden layer.
Here's a few examples:
The following figure shows isolated speech from a female speaker, with no interfering sounds mixed into it. The speech segmentation mask (third subplot) was estimated by the trained BasicRNNCell model. The final subplot shows the result of applying the mask onto the original sound clip. When I listen to the original vs masked audio, they sound substantially the same. This shows that the model is capable of generating a mask to segment the time/frequency points of speech (from an isolated speaker).
Similar to example 1 (above), but this is a different speaker reading a different text passage. The speech segmentation mask is remarkably good.
This example presents a complex musical sound clip (Techno- or Synth-pop). The speech segmentation mask is almost all zeros...i.,e., no speech. This shows that the trained model can recognize the difference between speech and other complex sounds, and generate a mask that keeps the speech and reduces other sounds. At least, for the isolated presentations shown so far.
This example shows the inference results for a complex auditory scene consisting of many voices chattering in the background, but no clear isolated speaker. The speech mask generated is interesting...it identifies speech-like sounds even though I cannot tell what anyone is saying. I was hoping the model would suppress these non-isolated speakers...I guess need a more sophisticated model to learn to do that.
This final example shows a female speaker mixed with Techno-music in the background. The estimated speech mask looks pretty good, although after applying the mask to the mixed sound clip leaves room for improvement. The background music in the post-masked sound clip is definitely much reduced, but the speech is also reduced and a bit muffled sounding--not crisp. So far, I have only done training usung isolated speech or isolated other sounds...Perhaps I need to do some training using mixed clips like this one.
I developed and tested this using Enthought Python 3.6.0, and tensorflow 1.3.0. See the edm_bundle.json file for all the details about how to recreate my development environment.