image

Listen Attend Spell – Speech to Text Model

In this project I built a speech to text transcription model that transcribes a given speech utterance (Mel Spectograms of WSJ dataset) to its corresponding transcript.This model uses an encoder-decoder approach, calledListener and Speller respectively.

The Listener consists of a Pyramidal Bi-LSTM Network structure that takes in the given utterances and compresses it to produce high-level representations for the Speller network.

The Speller takes in the high-level feature output from the Listener network and uses it to compute a probability distribution over sequences of characters using the attention mechanism.

Attention intuitively can be understood as trying to learn a mapping from a word vector to some areas of the utterance map. The Listener produces a high-level representation of the given utterance and the Speller uses parts of the representation (produced from the Listener) to predict the next word in the sequence.

Another problem we faced was that given a particular state as input to our model, the model always generated the same next state output, this is because once trained, the model will give a fixed set of outputs for a given input state with no randomness. To introduce randomness in our prediction, we added some noise into our prediction (only during generation time) specifically the Gumbel noise. Obtained Levenshtein score of 8.9.