Speech Emotion Recognition (SER) on CREMA-D (Crowd Sourced Emotional Multimodal Actors Dataset)
Speech emotion recognition is the task of recognizing the emotion from speech (duh?) regardless of the specific meaning or semantic content being expressed. It’s challenging even for humans to notate emotions accurately since they are subjective.
In this article, we will delve into the exciting world of Speech Emotion Recognition (SER) using the power of Convolutional Neural Networks (CNNs) on the CREMA-D dataset. Our aim is to explore the fascinating realm of discerning emotions from speech and uncover the hidden patterns within. By leveraging the capabilities of CNNs, we can develop a robust model that can automatically classify and understand emotions conveyed through speech.
Dataset Description
First, let’s take a look at the dataset.
You can download the CREMA-D dataset here.
After extracting the zip file, you’ll see 7442 audio files in .wav format. The names of the files are the emotion of the speech in the audio file. The audio clips are from 91 different actors with diverse ethnic backgrounds.
We’ll have to put the location of the audio files and labels in a dataframe.
The data is distributed like this.
1271 for anger, disgust, fear, happy and sad each and 1087 for neutral. So it’s quite evenly distributed.
Now, let’s visualize the audio files by converting them into waveform plot, spectrogram and Mel-spectrograms.
Also, here is a great article if you want to understand more about Mel-spectrograms.
Data Preprocessing
Let’s convert all audio files into Mel-spectrogram images.
Now that we have spectrogram images, let’s encode “emotions” into numerical representations and split the data into 80% training and 20% testing.
Model Architecture
We’ll create a simple CNN using keras.
We also need to write some methods to convert images into numpy arrays to pass into the model’s input layer.
Now, we’ll train the model for 50 epochs.
Performance Evaluation
Here is the confusion matrix for test dataset after training for 50 epochs.
The model only have 35% accuracy on test dataset. So, I tweaked the architecture a little bit.
Here is the new model.
And here is the confusion matrix on test dataset.
Conclusion and Future Work
In conclusion, our current model achieved an accuracy of approximately 42% on the test dataset after training for 20 epochs. While this performance is not optimal, it provides a promising starting point for further exploration in speech emotion recognition.
To improve the model’s performance, several avenues can be explored. Adding more data, changing model architecture, and configurating data augmentation.
Furthermore, instead of using spectrogram images, a promising avenue for future work involves extracting features using Mel-frequency cepstral coefficients (MFCCs). MFCCs capture acoustic characteristics of speech signals more directly, allowing the model to potentially capture relevant information for emotion recognition with greater precision.
Thanks for reading.
References:
https://importchris.medium.com/how-to-create-understand-mel-spectrograms-ff7634991056