Analysis Of Emotions Through Speech Recognition

Speech emotion recognition (SER) is a burgeoning field in AI that analyzes vocal characteristics to understand human emotions. It delves deeper than the literal meaning of words, uncovering emotional cues hidden within speech patterns. Pitch, loudness, and speech rate are just a few features that vary with emotional state. SER utilizes machine learning algorithms to classify these features into categories like happiness, sadness, or anger. This technology offers a treasure trove of possibilities, from enhancing human-computer interaction to revolutionizing customer service and even aiding in mental health assessments. As SER continues to evolve, it holds the potential to transform how we connect with machines, fostering deeper understanding and richer emotional experiences.


I. Introduction
Emotions play a vital role in human communication.The human voice can be held by attributes such as pitch, loudness & vocal tone.The systems must be able to read emotional content of a person's speech in order to interact successfully with them.If we want to communicate like humans, we need to design technologies that are able to read our emotions.Human-machine contact may be improved and made more natural by developing robots that can comprehend paralinguistic data (emotions).Speech Emotion recognition is fastest growing engineering technologies in recent years laeding to Speech Emotion Recognition emergent in investigation topics has potential for several advances, such as automated translation, can't be overstated systems, machine-human interaction, synthesizing speech from text.

Speech
Speech as we all know it is a way of expressing one`s thought and emotions through voice and hand gestures.It's a signal with a changeable duration that conveys information as well as feelings.As a result, one may retrieve regional or global characteristics, depending on the strategy you use.This is also known as a global function.Statistical measures such as the mean, maximum, minimal, and standard deviation are considered long-term or supra-fix properties.Because emotional content of audio signal isn't spread equally across transmission, such stable states are critical.We are using python libraries for extracting human voice as input and to analyze it with trained datasets

Emotion
Emotions are the mental states we feel about situations and events.From our emotions, most people can read our attitude towards people, places, food, theaters, art and more.You can express your thoughts more clearly without giving the wrong impression .In our project we are using dataset named as IEMOCAP which has various emotions which is implemented

II. Existing System
Digital signature processing is now a new field of study.Recently, at this point many researchers have advanced various methods of SER over the last decade.In the field of digital character processing, there is a great deal of interest and effort.New SER approaches based on digital audio signals have recently been developed by researchers of this generation.Speech emotion identification based on clustering and using learnt characteristics and Deep Learning (BiLSTM), Mustaqem, Muhammad Sajjad, And Soonil Kwon (May 2020).SER's existing CNN system presents many challenges, including: B. Improves accuracy and reduces computational complexity for the entire model.The concept RBF (Radial Basis Kernel)based Kimply clustering is used to select a large number of green series from the audio and use the STFT ruleset to convert it to a spectrogram.Therefore, we use a CNN version of the "FC1000" layer called Resnet for extracting salient as well as discriminating characteristics through spectrogram of audio signal and apply the proposed priority deviations to remove the versions which coordinate them.Once its coordinate, these identification functions are supplied for Deep BiLSTM to analyze veiled data, identify final country of the sequence, also classifying emotional areas of audio system.

III. Proposed System
The project we designed has training data provided by a voice database created using either simulated, collected, or natural sources.The data is available for download.IEMOCAP is among most fascinating nonverbal communications we exchange with one another, expressing our emotional state via both our words and our body language.As real human communication unfolds, the tone and intensity of the voice, facial expressions, hand gestures/postured, trunk posture and gazing have all been mixed in a non-trivial manner.Research into how expressive human communication might be modelled and maximized would benefit from the collection of this corpus, which took roughly 20 months from design to post-processing.They wanted to be able to use this database to extend and generalize previous results on compounds.The signal is then preprocessed for making it appropriate for extraction of feature.In order to extract features, the signal is first preprocessed.A broad variety of emotions may be supported by prosodic and spectral properties, which are often used in the SER system.An additional modality, such as visual or verbal, may be used to enhance outcome of experiment.When it comes to visual image processing, convolutional neural networks (CNNs) are most often utilized deep neural networks.As of right now, CNN is among most effective tools for sifting through and uncovering buried facts.In contrast, we used CNN to convert voice signals into several segments, each of which has a distinct section.With of features Librosa is a Python library for analyzing audio and sound.It has a package layout, integrate interfaces and also names, including backwards compatibility, featuring modular functions, and readable code.

Dataset
The data is among most fascinating paralanguage signals in human contact, IEMOCAP conveys the subject's emotional condition through both linguistic and gesture.In the course of normal human communication, a variety of non-trivial factors come into play, including pitch and intensity of speech, facial expressions, trunk & head posture, gesture of hands, and eye contact.Research into how expressive human communication might be modelled and maximized would benefit from the collection of this corpus, which took roughly 20 months from design to postprocessing.They wanted to be able to use this database to extend and generalize previous results on compounds.

Features
Each voice and music has different characteristics, which is an significant characteristic of emotional speech identification.Various functions are used in the SER system.As a result of this, there's really no universally agreedupon set of traits that can be used to accurately and clearly classify.Experimentation has accounted for all prior studies.Methodology and analyze characteristics from audio files using Librosa, a Python package for audio processing.Librosa is Python library for music and audio analysis.It has all features necessary to build voice-recognition system.Using the librosa library, I was capable to get the features.According to MFCC (Mel Frequency CepstrumCoefficient).It is a common characteristic in voice recognition and speaker identification systems.We also used identifiers provided on the site to distinguish between female and male voices.Experiments have shown that the separation of male and female voices has increased by 15%.It may be due to the fact that the pitch of the voice affects the result.Time is on the x-axis and y-axis draws a graph that determines vibration.Fig. 2 sample audio waveform which shows features of a sound signal.

Fig.2. Audio waveform
Each audio file provided many features, which are basically many sets of values.These features are complemented by the labels created in the previous step.The next step was to fix the missing features of some short audio files.We doubled the sample rate to maintain the unique characteristics of each emotional speech.We have not increased the sampling frequency any further as it can collect noise and affect the results.

Analysis
A convolutional neural network (CNN) is a special type of neural network designed to process data with a grid-like topology such as pictures.hrough applications of several relevant filters, CNN can successfully capture temporal and spatial dependencies from an input source.Input is reduced to a lossless format, reducing computational complexity and increasing the success rate of the algorithm.CNN consists of several layers, such as a convolution layer and polling, and Fully-Connected layer.CNN is the most powerful source of information for presenting and discovering information hidden in the data of this era.In contrast, we converted the audio signal into multiple segments.Audio signals are computationally expensive and contain a lot of redundant information that affects the overall efficiency of the model.Using a simple convolutional neural network strategy, we propose a new CNN architecture, , to learn the salient features of the convolutional layer.It uses a special procedure within the convolution layer to downs sample the feature map instead of bundling the layer.DSCNN algorithm is specially created for SER issue utilising the spectrogram.The proposed CNN architecture learns deep, distinctive, identifiable features from audio spectrograms, using a minimal convolutional layer, each with a small field, and with the simple structure of the proposed CNN model.Increases accuracy and reduces computational complexity.This has been proven experimentally.

V. Conclusion
In this paper detailed introduction about speech emotion recognition is provided through speech emotion recognition flow diagram.Another important part of speech emotion recognition is the use of Deep learning: CNN [ Convolutional Neural Network ].The enhanced audio signal is then converted to a spectrogram to improve accuracy and reduce the computational complexity of the proposed model.Using CNN architecture with spectrograms, we learned the most prominent and discriminating features of the convolutional layer using a special Stride set that downscales the feature map instead of the convolutional layer.

Fig. 3
Fig. 3 Spectogram of sample audio that shows the frequency and time.