Speech Emotion Recognition (SER) on CREMA-D (Crowd Sourced Emotional Multimodal Actors Dataset)

Phonn Pyae Kyaw
4 min readJun 11


SER using Mel-spectrograms and CNN architecture

Speech emotion recognition is the task of recognizing the emotion from speech (duh?) regardless of the specific meaning or semantic content being expressed. It’s challenging even for humans to notate emotions accurately since they are subjective.

In this article, we will delve into the exciting world of Speech Emotion Recognition (SER) using the power of Convolutional Neural Networks (CNNs) on the CREMA-D dataset. Our aim is to explore the fascinating realm of discerning emotions from speech and uncover the hidden patterns within. By leveraging the capabilities of CNNs, we can develop a robust model that can automatically classify and understand emotions conveyed through speech.

Dataset Description

First, let’s take a look at the dataset.

You can download the CREMA-D dataset here.

After extracting the zip file, you’ll see 7442 audio files in .wav format. The names of the files are the emotion of the speech in the audio file. The audio clips are from 91 different actors with diverse ethnic backgrounds.

We’ll have to put the location of the audio files and labels in a dataframe.

Put the file path and emotion of the audio file in a dataframe for easier processing
Adding file paths and create labels for emotions

The data is distributed like this.

Char for data distribution

1271 for anger, disgust, fear, happy and sad each and 1087 for neutral. So it’s quite evenly distributed.

Now, let’s visualize the audio files by converting them into waveform plot, spectrogram and Mel-spectrograms.

Waveform plot for an audio file
Spectrogram for an audio file
Mel-spectrogram for an audio file

Also, here is a great article if you want to understand more about Mel-spectrograms.

Data Preprocessing

Let’s convert all audio files into Mel-spectrogram images.

Convert audio files into Mel-spectrogram images

Now that we have spectrogram images, let’s encode “emotions” into numerical representations and split the data into 80% training and 20% testing.

Encoding labels and splitting data into training and testing sets

Model Architecture

We’ll create a simple CNN using keras.

Simple Convolutional Neural Network

We also need to write some methods to convert images into numpy arrays to pass into the model’s input layer.

Preprocess data to pass into the model

Now, we’ll train the model for 50 epochs.

Model training for 50 epochs

Performance Evaluation

Here is the confusion matrix for test dataset after training for 50 epochs.

Confusion Matrix-1

The model only have 35% accuracy on test dataset. So, I tweaked the architecture a little bit.

Here is the new model.

New model architecture

And here is the confusion matrix on test dataset.

Confusion Matrix-2

Conclusion and Future Work

In conclusion, our current model achieved an accuracy of approximately 42% on the test dataset after training for 20 epochs. While this performance is not optimal, it provides a promising starting point for further exploration in speech emotion recognition.

To improve the model’s performance, several avenues can be explored. Adding more data, changing model architecture, and configurating data augmentation.

Furthermore, instead of using spectrogram images, a promising avenue for future work involves extracting features using Mel-frequency cepstral coefficients (MFCCs). MFCCs capture acoustic characteristics of speech signals more directly, allowing the model to potentially capture relevant information for emotion recognition with greater precision.

Thanks for reading.






Phonn Pyae Kyaw

Turning data into insights, one algorithm at a time.