In Your Face (and Voice) – Can Digital Facial and Voiced Expression Tools Enhance the Social Sciences?
MRIC 2009/10
Please note: this is archived content harvested from a web page and may not display as originally intended. Some images, links, and functionality may be broken or out of date.
"In Your Face (and Voice) – Can Digital Facial and Voiced Expression Tools Enhance the Social Sciences?"
October 27th
Akira Tokuhiro - Mechanical Engineering (CAES)
Abstract: Research on communication dynamics has shown that human beings, across nations and cultures, ‘talk’ to each other using facial and voiced expressions, gestures, body language and of course, by listening. In fact, facial expressions can be as much as 70% of the information (content) that is transmitted and received. Further, it’s likely that the ‘state-of-being’, ‘state-of-mind’ or ‘emotional state’ is expressed via communication; that is, facial and voiced expressions can directly represent human emotions. Most of us ‘understand’ facial and voiced expressions effortlessly.
With the recent accessibility of digital images and audio, the speaker and his students realized a natural foray into the social sciences using ‘tools’ well-developed in the digital signal process engineering. The ‘team’ developed an initial multi-modal emotion recognition system using cues from digitally recorded facial images and voiced recordings. This is achieved by extracting features from each of the modalities using signal processing techniques, and then classifying these features with the help of artificial neural networks (ANN). The features extracted from the face are the eyes, eyebrows, mouth and nose; this is done using image processing techniques such as seeded region growing algorithm, particle swarm optimization and general properties of the feature being extracted. In contrast, features of interest in speech are pitch, frequencies and spectra along with some statistical properties and also the rate of change of these properties. These features are extracted using techniques such as Fourier transform.
In the course of research the team developed a toolbox that can read an audio and/or video file and ‘perform emotion recognition’ on the face in the video and speech in the audio channel. The features extracted from the face and voices are independently classified into emotions using two separate feed forward types of ANNs. This toolbox then presents the output of the artificial neural networks from one/both the modalities on a synchronized time scale. Some interesting results from this research is consistent misclassification of facial expressions between two databases (one European, one Asian), suggesting a cultural basis for this misinterpretation. Addition of voice component has been shown to partially help in better classification.
Akira Tokuhiro - Mechanical Engineering (CAES)
Abstract: Research on communication dynamics has shown that human beings, across nations and cultures, ‘talk’ to each other using facial and voiced expressions, gestures, body language and of course, by listening. In fact, facial expressions can be as much as 70% of the information (content) that is transmitted and received. Further, it’s likely that the ‘state-of-being’, ‘state-of-mind’ or ‘emotional state’ is expressed via communication; that is, facial and voiced expressions can directly represent human emotions. Most of us ‘understand’ facial and voiced expressions effortlessly.
With the recent accessibility of digital images and audio, the speaker and his students realized a natural foray into the social sciences using ‘tools’ well-developed in the digital signal process engineering. The ‘team’ developed an initial multi-modal emotion recognition system using cues from digitally recorded facial images and voiced recordings. This is achieved by extracting features from each of the modalities using signal processing techniques, and then classifying these features with the help of artificial neural networks (ANN). The features extracted from the face are the eyes, eyebrows, mouth and nose; this is done using image processing techniques such as seeded region growing algorithm, particle swarm optimization and general properties of the feature being extracted. In contrast, features of interest in speech are pitch, frequencies and spectra along with some statistical properties and also the rate of change of these properties. These features are extracted using techniques such as Fourier transform.
In the course of research the team developed a toolbox that can read an audio and/or video file and ‘perform emotion recognition’ on the face in the video and speech in the audio channel. The features extracted from the face and voices are independently classified into emotions using two separate feed forward types of ANNs. This toolbox then presents the output of the artificial neural networks from one/both the modalities on a synchronized time scale. Some interesting results from this research is consistent misclassification of facial expressions between two databases (one European, one Asian), suggesting a cultural basis for this misinterpretation. Addition of voice component has been shown to partially help in better classification.
Original url: http://www.uidaho.edu/class/mric/archives/pre-2010/fall2009/tokuhiro