EmoVoice - Real-time emotion recognition from speech

Projektstart: 01.01.2005
Projektträger: EU (Europäische Union)
Projektverantwortung vor Ort: Prof. Dr. Elisabeth André
Beteiligte Wissenschaftler der Universität Augsburg: Dr. Johannes Wagner
Dr. Thurid Vogt


emovoice_logoEmoVoice is a comprehensive framework for real-time recognition of emotions from acoustic properties of speech (not using word information). Linking output to other applications is easy and thus allows the implementation of prototypes of affective interfaces.



EmoVoice has been recently integrated as toolbox into the Social Signal Interpration (SSI) framework which is also from the Lab for Multimedia Concepts and Applications.
In combination with SSI, EmoVoice includes the following modules:
  • database creation
  • feature extraction + classifier building and testing
  • online recognition
Overview of EmoVoice architecture

Database creation

ModelUI, the graphical user interface of SSI, supports the creation of an emotional speech database. Stimuli to elicit emotions can be provided by the interface, for example by reading a set of emotional sentences. We have defined a set of sentences that is loosely based on the Velten mood induction technique [1] which should facilitate the real experience of the emotions. However, the sentences can also be personalised so as to help the reader to better immerse into emotional states. This procedure reduces the effort of building a prototypical personalised emotion recogniser to just a few minutes.
Of course, also already available emotional speech databases can be used with EmoVoice.

Feature extraction + classifier building

The phonetic analysis largely uses algorithms from the Praat phonetic software and the ESMERALDA environment for speech recognition. Features are based on global statistics derived from pitch, energy, MFCCs, duration, voice quality and spectral information.
Currently, two classifiers are integrated into the framework: Naive Bayes as a fast but simple classifier, and Support Vector Machines as a more sophisticated classifier.

Online recognition

Online recognition works as a command line application that outputs to the command line or over a socket using the Open Sound Control (OSC) protocol. The tool reads constantly from the microphone and extracts suitable voice segments by voice activity detection. After feature extraction, each segment is directly assigned an emotion label with the help of a previously trained classifier.


So far, we have developed a number of demo applications that use EmoVoice, and it has been used for showcases by partners in the Callas project.
Examples for these are:
  • an emotional caleidoscope [2]:

    Caleidoscope joyCaleidoscope sadnessCaleidoscope anger

  • the virtual agent Greta [3] as an emotionally reacting listener. Greta mirrors the emotion of the speaker with her face and gives emotionally appropriate verbal feedback:

    Greta joyGreta sadnessGreta anger

  • E-Tree [4], a Callas showcase, is an AR art installation of a tree growing and changing colour and shape according to multimodal input, one of them emotional valence and arousal recognised from voice:

    E-Tree positive activeE-Tree neutralE-Tree negative passive



EmoVoice is part of the SSI and available freely for download under the GNU Public Licence.


If you use EmoVoice for your own projects or publications, please cite the following papers:

T. Vogt, E. André and N. Bee, "EmoVoice - A framework for online recognition of emotions from voice," in Proceedings of Workshop on Perception and Interactive Technologies for Speech-Based Systems, 2008.

J. Wagner, F. Lingenfelser, and E. Andre, The Social Signal Interpretation Framework (SSI) for Real Time Signal Processing and Recognitions," in Proceedings of INTERSPEECH 2011, Florence, Italy, 2011.




[1] Velten, E. (1968). A laboratory task for induction of mood states. Behavior Research & Therapy, (6):473-482.
[2] Sichert, J. (2008). Visualisierung des emotionalen Ausdrucks aus der Stimme. Vdm Verlag Dr. Müller. Saarbrücken, Germany.
[3] de Rosis, F., Pelachaud, C., Poggi, I., Carofiglio, V., and de Carolis, B. (2003). From Greta's mind to her face: modelling the dynamics of affective
states in a conversational embodied agent. International Journal of Human-Computer Studies, 59:81-118.
[4] Gilroy, S. W., Cavazza, M., Chaignon, R., Mäkelä, S.-M., Niiranen, M., André, E., Vogt, T., Billinghurst, M., Seichter, H., and Benayoun, M. (2007). An emotionally responsive AR art installation. In Proceedings of ISMAR Workshop 2: Mixed Reality Entertainment and Art, Nara, Japan.