A Survey of Affect Recognition Methods: Audio,
Visual, and Spontaneous Expressions
Zhihong Zeng, Member, IEEE Computer Society, Maja Pantic, Senior Member, IEEE,
Glenn I. Roisman, and Thomas S. Huang, Fellow, IEEE
Abstract—Automated analysis of human affective behavior has attracted increasing attention from researchers in psychology,
computer science, linguistics, neuroscience, and related disciplines. However, the existing methods typically handle only deliberately
displayed and exaggerated expressions of prototypical emotions, despite the fact that deliberate behavior differs in visual appearance,
audio profile, and timing from spontaneously occurring behavior. To address this problem, efforts to develop algorithms that can
process naturally occurring human affective behavior have recently emerged. Moreover, an increasing number of efforts are reported
toward multimodal fusion for human affect analysis, including audiovisual fusion, linguistic and paralinguistic fusion, and multicue
visual fusion based on facial expressions, head movements, and body gestures. This paper introduces and surveys these recent
advances. We first discuss human emotion perception from a psychological perspective. Next, we examine available approaches for
solving the problem of machine understanding of human affective behavior and discuss important issues like the collection and
availability of training and test data. We finally outline some of the scientific and engineering challenges to advancing human affect
sensing technology.
Index Terms—Evaluation/methodology, human-centered computing, affective computing, introductory, survey.
Ç
1 INTRODUCTION
A widely accepted prediction is that computing willmove to the background, weaving itself into the fabric
of our everyday living spaces and projecting the human
user into the foreground. Consequently, the future
“ubiquitous computing” environments will need to have
human-centered designs instead of computer-centered
designs [26], [31], [100], [107], [109]. Current human-
computer interaction (HCI) designs, however, usually
involve traditional interface devices such as the keyboard
and mouse and are constructed to emphasize the transmis-
sion of explicit messages while ignoring implicit informa-
tion about the user, such as changes in the affective state.
Yet, a change in the user’s affective state is a fundamental
component of human-human communication. Some affec-
tive states motivate human actions, and others enrich the
meaning of human communication. Consequently, the
traditional HCI, which ignores the user’s affective states,
filters out a large portion of the information available in
the interaction process. As a result, such interactions are
frequently perceived as cold, incompetent, and socially
inept. The human computing paradigm suggests that user
interfaces of the future need to be anticipatory and human
centered, built for humans, and based on naturally
occurring multimodal human communication [100], [109].
Specifically, human-centered interfaces must have the
ability to detect subtleties of and changes in the user’s
behavior, especially his/her affective behavior, and to
initiate interactions based on this information rather than
simply responding to the user’s commands.
Examples of affect-sensitive multimodal HCI systems
include the following:
1. the system of Lisetti and Nasoz [85], which combines
facial expression and physiological signals to recog-
nize the user’s emotions, like fear and anger, and
then to adapt an animated interface agent to mirror
the user’s emotion,
2. the multimodal system of Duric et al. [39], which
applies a model of embodied cognition that can
be seen as a detailed mapping between the
user’s affective states and the types of interface
adaptations,
3. the proactive HCI tool of Maat and Pantic [89],
which is capable of learning and analyzing the
user’s context-dependent behavioral patterns from
multisensory data and of adapting the interaction
accordingly,
4. the automated Learning Companion of Kapoor et al.
[72], which combines information from cameras, a
sensing chair, and mouse, wireless skin sensor, and
task state to detect frustration in order to predict
when the user needs help, and
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 31, NO. 1, JANUARY 2009 39
. Z. Zeng and T.S. Huang are with the Beckman Institute, University of
Illinois at Urbana-Champaign, 405 N. Mathews Ave., Urbana, 61801.
E-mail: {zhzeng, huang}@ifp.uiuc.edu.
. M. Pantic is with the Department of Computing, Imperial College London,
180 Queen’s Gate, London SW7 2AZ, United Kingdom, and with the
Faculty of Electrical Engineering, Mathematics, and Computer Science,
University of Twente, The Netherlands.
E-mail: m.pantic@imperial.ac.uk.
. G.I. Roisman is with the Psychology Department, University of Illinois at
Urbana-Champaign, 603 East Daniel St., Champaign, IL 61820.
E-mail: roisman@uiuc.edu.
Manuscript received 12 Aug. 2007; revised 30 Dec. 2007; accepted 25 Jan.
2008; published online 27 Feb. 2008.
Recommended for acceptance by T. Darrell.
For information on obtaining reprints of this article, please send e-mail to:
tpami@computer.org, and reference IEEECS Log Number
TPAMI-2007-08-0496.
Digital Object Identifier no. 10.1109/TPAMI.2008.52.
0162-8828/09/$25.00 � 2009 IEEE Published by the IEEE Computer Society
Authorized licensed use limited to: WASEDA UNIVERSITY. Downloaded on June 1, 2009 at 22:22 from IEEE Xplore. Restrictions apply.
5. the multimodal computer-aided learning system1 in
the Beckman Institute, University of Illinois, Urbana-
Champaign (UIUC), where the computer avatar
offers an appropriate tutoring strategy based on
the information of the user’s facial expression,
keywords, eye movement, and task state.
These systems represent initial efforts toward the future
human-centered multimodal HCI.
Except in standard HCI scenarios, potential commercial
applications of automatic human affect recognition include
affect-sensitive systems for customer services, call centers,
intelligent automobile systems, and game and entertain-
ment industries. These systems will change the ways in
which we interact with computer systems. For example, an
automatic service call center with an affect detector would
be able to make an appropriate response or pass control
over to human operators [83], and an intelligent automobile
system with a fatigue detector could monitor the vigilance
of the driver and apply an appropriate action to avoid
accidents [69].
Another important application of automated systems for
human affect recognition is in affect-related research (e.g.,
in psychology, psychiatry, behavioral science, and neu-
roscience), where such systems can improve the quality of
the research by improving the reliability of measurements
and speeding up the currently tedious manual task of
processing data on human affective behavior [47]. The
research areas that would reap substantial benefits from
such automatic tools include social and emotional devel-
opment research [111], mother-infant interaction [29],
tutoring [54], psychiatric disorders [45], and studies on
affective expressions (e.g., deception) [65], [47]. Automated
detectors of affective states and moods, including fatigue,
depression, and anxiety, could also form an important step
toward personal wellness and assistive technologies [100].
Because of this practical importance and the theoretical
interest of cognitive scientists, automatic human affect
analysis has attracted the interest of many researchers in
the last three decades. Suwa et al. [127] presented an early
attempt in 1978 to automatically analyze facial expressions.
The vocal emotion analysis has an even longer history,
starting with the study of Williams and Stevens in 1972
[145]. Since the late 1990s, an increasing number of efforts
toward automatic affect recognition were reported in the
literature. Early efforts toward machine affect recognition
from face images include those of Mase [90], and Kobayashi
and Hara [76] in 1991. Early efforts toward the machine
analysis of basic emotions from vocal cues include studies
like that of Dellaert et al. in 1996 [33]. The study of Chen
et al. in 1998 [22] represents an early attempt toward
audiovisual affect recognition. For exhaustive surveys of the
past work in the machine analysis of affective expressions,
readers are referred to [115], [31], [102], [49], [96], [105],
[130], [121], and [98], which were published in 1992 to 2007,
respectively.
Overall, most of the existing approaches to automatic
human affect analysis are the following:
. approaches that are trained and tested on a
deliberately displayed series of exaggerated affective
expressions,
. approaches that are aimed at recognition of a small
number of prototypical (basic) expressions of emo-
tion (i.e., happiness, sadness, anger, fear, surprise,
and disgust), and
. single-modal approaches, where information pro-
cessed by the computer system is limited to either
face images or the speech signals.
Accordingly, reviewing the efforts toward the single-
modal analysis of artificial affective expressions have been
the focus in the previously published survey papers, among
which the papers of Cowie et al. in 2001 [31] and of Pantic
and Rothkrantz in 2003 [102] have been the most
comprehensive and widely cited in this field to date. At
the time when these surveys were written, most of the
available data sets of affective displays were small and
contained only deliberate affective displays (mainly of the
six prototypical emotions) recorded under highly con-
strained conditions. Multimedia data were rare, and there
was no 3D data on facial affective behavior, there was no
data of combined face and body displays of affective
behavior, and it was rare to find data that included
spontaneous displays of affective behavior.
Hence, while automatic detection of the six basic
emotions in posed controlled audio or visual displays can
be done with reasonably high accuracy, detecting these
expressions or any expression of human affective behavior
in less constrained settings is still a very challenging
problem due to the fact that deliberate behavior differs in
visual appearance, audio profile, and timing from sponta-
neously occurring behavior. Due to this criticism received
from both cognitive and computer scientists, the focus of
the research in the field started to shift to the automatic
analysis of spontaneously displayed affective behavior.
Several studies have recently emerged on the machine
analysis of spontaneous facial expressions (e.g., [10], [28],
[135], and [4]) and vocal expressions (e.g., [12] and [83]).
Also, it has been shown by several experimental studies
that integrating the information from audio and video
leads to an improved performance of affective behavior
recognition. The improved reliability of audiovisual ap-
proaches in comparison to single-modal approaches can be
explained as follows: Current techniques for the detection
and tracking of facial expressions are sensitive to head
pose, clutter, and variations in lighting conditions, while
current techniques for speech processing are sensitive to
auditory noise. Audiovisual fusion can make use of the
complementary information from these two channels. In
addition, many psychological studies have theoretically
and empirically demonstrated the importance of the
integration of information from multiple modalities (vocal
and visual expression in this paper) to yield a coherent
representation and inference of emotions [1], [113], [117].
As a result, an increased number of studies on audiovisual
human affect recognition have emerged in recent years
(e.g., [17], [53], and [151]).
This paper introduces and surveys these recent advances
in the research on human affect recognition. In contrast to
40 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 31, NO. 1, JANUARY 2009
1. http://itr.beckman.uiuc.edu.
Authorized licensed use limited to: WASEDA UNIVERSITY. Downloaded on June 1, 2009 at 22:22 from IEEE Xplore. Restrictions apply.
previously published survey papers in the field, it focuses
on the approaches that can handle audio and/or visual
recordings of spontaneous (as opposed to posed) displays of
affective states. It also examines the state-of-the-art methods
that have not been reviewed in previous survey papers but
are important, specifically for advancing human affect
sensing technology. Finally, we discuss the collection and
availability of training and test data in detail. This paper is
organized as follows: Section 2 describes the human
perception of affect from a psychological perspective.
Section 3 provides a detailed review of the related studies,
including multimedia emotion databases and existing
human affect recognition methods. Section 4 discusses
some of the challenges that researchers face in this field. A
summary and closing remarks conclude this paper.
2 HUMAN AFFECT (EMOTION) PERCEPTION
Automatic affect recognition is inherently a multidisciplin-
ary enterprise involving different research fields, including
psychology, linguistics, computer vision, speech analysis,
and machine learning. There is no doubt that the progress
in automatic affect recognition is contingent on the progress
of the research in each of those fields [44].
2.1 The Description of Affect
We begin by briefly introducing three primary ways that
affect has been conceptualized in psychological research.
Research on the basic structure and description of affect is
important in that these conceptualizations provide informa-
tion about the affective displays that automatic emotion
recognition systems are designed to detect.
Perhaps the most long-standing way that affect has been
described by psychologists is in terms of discrete categories,
an approach that is rooted in the language of daily life [40],
[41], [46], [131]. The most popular example of this
description is the prototypical (basic) emotion categories,
which include happiness, sadness, fear, anger, disgust, and
surprise. This description of basic emotions was specially
supported by the cross-cultural studies conducted by
Ekman [40], [42], indicating that humans perceive certain
basic emotions with respect to facial expressions in the same
way, regardless of culture. This influence of a basic emotion
theory has resulted in the fact that most of the existing
studies of automatic affect recognition focus on recognizing
these basic emotions. The main advantage of a category
representation is that people use this categorical scheme to
describe observed emotional displays in daily life. The
labeling scheme based on category is very intuitive and thus
matches people’s experience. However, discrete lists of
emotions fail to describe the range of emotions that occur in
natural communication settings. For example, although
prototypical emotions are key points of emotion reference,
they cover a rather small part of our daily emotional
displays. Selection of affect categories that can describe the
wide variety of affective displays that people show in daily
interpersonal interactions needs to be done in a pragmatic
and context-dependent manner [102], [105].
An alternative to the categorical description of human
affect is the dimensional description [58], [114], [140], where
an affective state is characterized in terms of a small
number of latent dimensions rather than in terms of a small
number of discrete emotion categories. These dimensions
include evaluation, activation, control, power, etc. In
particular, the evaluation and activation dimensions are
expected to reflect the main aspects of emotion. The
evaluation dimension measures how a human feels, from
positive to negative. The activation dimension measures
whether humans are more or less likely to take an action
under the emotional state, from active to passive. In contrast
to categorical representation, dimensional representation
enables raters to label a range of emotions. However, the
projection of the high-dimensional emotional states onto a
rudimentary 2D space results, to some degree, in the loss of
information. Some emotions become indistinguishable (e.g.,
fear and anger), and some emotions lie outside the space
(e.g., surprise). This representation is not intuitive, and
raters need special training to use the dimensional labeling
system (e.g., the Feeltrace system [30]). In automatic
emotion recognition systems that are based on the 2D
dimensional emotion representation (e.g., [17] and [53]), the
problem is often further simplified to two-class (positive
versus negative and active versus passive) or four-class
(quadrants of 2D space) classification.
One of the most influential emotion theories in modern
psychology is the appraisal-based approach [117], which
can be regarded as the extension of the dimensional
approach described above. In this representation, an
emotion is described through a set of stimulus evaluation
checks, including the novelty, intrinsic pleasantness, goal-
based significance, coping potential, and compatibility with
standards. However, translating this scheme into one
engineering framework for purposes of automatic emotion
recognition remains challenging [116].
2.2 Association between Affect, Audio, and Visual
Signals
Affective arousal modulates all human communicative
signals. Psychologists and linguists have various opinions
about the importance of different cues (audio and visual
cues in this paper) in human affect judgment. Ekman [41]
found that the relative contributions of facial expression,
speech, and body gestures to affect judgment depend both
on the affective state and the environment where the
affective behavior occurs. On the other hand, some studies
(e.g., [1] and [92]) indicated that a facial expression in the
visual channel is the most important affective cue and
correlates well with the body and voice. Many studies have
theoretically and empirically demonstrated the advantage
of the integration of multiple modalities (vocal and visual
expression) in human affect perception over single mod-
alities [1], [113], [117].
Different from the traditional message judgment, in
which the aim is to infer what underlies a displayed
behavior such as affect or personality, another major
approach to human behavior measurement is the sign
judgment [26]. The aim of sign judgment is to describe the
appearance, rather than the meaning, of the shown behavior
such as facial signal, body gesture, or speech rate. While
message judgment is focused on interpretation, sign
judgment attempts to be an objective description, leaving
the inference about the conveyed message to high-level
ZENG ET AL.: A SURVEY OF AFFECT RECOGNITION METHODS: AUDIO, VISUAL, AND SPONTANEOUS EXPRESSIONS 41
Authorized licensed use limited to: WASEDA UNIVERSITY. Downloaded on June 1, 2009 at 22:22 from IEEE Xplore. Restrictions apply.
decision making. As indicated by Cohn [26], the most
commonly used sign judgment method for the manual
labeling of facial behavior is the Facial Action Coding
System (FACS) proposed by Ekman et al. [43]. FACS is a
comprehensive and anatomically based system that is used
to measure all visually discernible facial movements in
terms of atomic facial actions called Action Units (AUs). As
AUs are independent of interpretation, they can be used for
any high-level decision-making process, including the
recognition of basic emotions according to Emotional FACS
(EMFACS) rules2, the recognition of various affective states
according to the FACS Affect Interpretation Database
(FACSAID)2 introduced by Ekman et al. [43], and the
recognition of other complex psychological states such as
depression [47] or pain [144]. AUs of the FACS are very
suitable to use in studies on human naturalistic facial
behavior, as the thousands of anatomically possible facial
expressions (independent of their high-le