A Survey of Affect Recognition下载_在线阅读_20

is_641736

暂无简介

A Survey of Affect Recognition A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions Zhihong Zeng, Member, IEEE Computer Society, Maja Pantic, Senior Member, IEEE, Glenn I. Roisman, and Thomas S. Huang, Fellow, IEEE Abstract—Automated analysis of human affective b...

A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions Zhihong Zeng, Member, IEEE Computer Society, Maja Pantic, Senior Member, IEEE, Glenn I. Roisman, and Thomas S. Huang, Fellow, IEEE Abstract—Automated analysis of human affective behavior has attracted increasing attention from researchers in psychology, computer science, linguistics, neuroscience, and related disciplines. However, the existing methods typically handle only deliberately displayed and exaggerated expressions of prototypical emotions, despite the fact that deliberate behavior differs in visual appearance, audio profile, and timing from spontaneously occurring behavior. To address this problem, efforts to develop algorithms that can process naturally occurring human affective behavior have recently emerged. Moreover, an increasing number of efforts are reported toward multimodal fusion for human affect analysis, including audiovisual fusion, linguistic and paralinguistic fusion, and multicue visual fusion based on facial expressions, head movements, and body gestures. This paper introduces and surveys these recent advances. We first discuss human emotion perception from a psychological perspective. Next, we examine available approaches for solving the problem of machine understanding of human affective behavior and discuss important issues like the collection and availability of training and test data. We finally outline some of the scientific and engineering challenges to advancing human affect sensing technology. Index Terms—Evaluation/methodology, human-centered computing, affective computing, introductory, survey. Ç 1 INTRODUCTION A widely accepted prediction is that computing willmove to the background, weaving itself into the fabric of our everyday living spaces and projecting the human user into the foreground. Consequently, the future “ubiquitous computing” environments will need to have human-centered designs instead of computer-centered designs [26], [31], [100], [107], [109]. Current human- computer interaction (HCI) designs, however, usually involve traditional interface devices such as the keyboard and mouse and are constructed to emphasize the transmis- sion of explicit messages while ignoring implicit informa- tion about the user, such as changes in the affective state. Yet, a change in the user’s affective state is a fundamental component of human-human communication. Some affec- tive states motivate human actions, and others enrich the meaning of human communication. Consequently, the traditional HCI, which ignores the user’s affective states, filters out a large portion of the information available in the interaction process. As a result, such interactions are frequently perceived as cold, incompetent, and socially inept. The human computing paradigm suggests that user interfaces of the future need to be anticipatory and human centered, built for humans, and based on naturally occurring multimodal human communication [100], [109]. Specifically, human-centered interfaces must have the ability to detect subtleties of and changes in the user’s behavior, especially his/her affective behavior, and to initiate interactions based on this information rather than simply responding to the user’s commands. Examples of affect-sensitive multimodal HCI systems include the following: 1. the system of Lisetti and Nasoz [85], which combines facial expression and physiological signals to recog- nize the user’s emotions, like fear and anger, and then to adapt an animated interface agent to mirror the user’s emotion, 2. the multimodal system of Duric et al. [39], which applies a model of embodied cognition that can be seen as a detailed mapping between the user’s affective states and the types of interface adaptations, 3. the proactive HCI tool of Maat and Pantic [89], which is capable of learning and analyzing the user’s context-dependent behavioral patterns from multisensory data and of adapting the interaction accordingly, 4. the automated Learning Companion of Kapoor et al. [72], which combines information from cameras, a sensing chair, and mouse, wireless skin sensor, and task state to detect frustration in order to predict when the user needs help, and IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 31, NO. 1, JANUARY 2009 39 . Z. Zeng and T.S. Huang are with the Beckman Institute, University of Illinois at Urbana-Champaign, 405 N. Mathews Ave., Urbana, 61801. E-mail: {zhzeng, huang}@ifp.uiuc.edu. . M. Pantic is with the Department of Computing, Imperial College London, 180 Queen’s Gate, London SW7 2AZ, United Kingdom, and with the Faculty of Electrical Engineering, Mathematics, and Computer Science, University of Twente, The Netherlands. E-mail: m.pantic@imperial.ac.uk. . G.I. Roisman is with the Psychology Department, University of Illinois at Urbana-Champaign, 603 East Daniel St., Champaign, IL 61820. E-mail: roisman@uiuc.edu. Manuscript received 12 Aug. 2007; revised 30 Dec. 2007; accepted 25 Jan. 2008; published online 27 Feb. 2008. Recommended for acceptance by T. Darrell. For information on obtaining reprints of this article, please send e-mail to: tpami@computer.org, and reference IEEECS Log Number TPAMI-2007-08-0496. Digital Object Identifier no. 10.1109/TPAMI.2008.52. 0162-8828/09/$25.00 � 2009 IEEE Published by the IEEE Computer Society Authorized licensed use limited to: WASEDA UNIVERSITY. Downloaded on June 1, 2009 at 22:22 from IEEE Xplore. Restrictions apply. 5. the multimodal computer-aided learning system1 in the Beckman Institute, University of Illinois, Urbana- Champaign (UIUC), where the computer avatar offers an appropriate tutoring strategy based on the information of the user’s facial expression, keywords, eye movement, and task state. These systems represent initial efforts toward the future human-centered multimodal HCI. Except in standard HCI scenarios, potential commercial applications of automatic human affect recognition include affect-sensitive systems for customer services, call centers, intelligent automobile systems, and game and entertain- ment industries. These systems will change the ways in which we interact with computer systems. For example, an automatic service call center with an affect detector would be able to make an appropriate response or pass control over to human operators [83], and an intelligent automobile system with a fatigue detector could monitor the vigilance of the driver and apply an appropriate action to avoid accidents [69]. Another important application of automated systems for human affect recognition is in affect-related research (e.g., in psychology, psychiatry, behavioral science, and neu- roscience), where such systems can improve the quality of the research by improving the reliability of measurements and speeding up the currently tedious manual task of processing data on human affective behavior [47]. The research areas that would reap substantial benefits from such automatic tools include social and emotional devel- opment research [111], mother-infant interaction [29], tutoring [54], psychiatric disorders [45], and studies on affective expressions (e.g., deception) [65], [47]. Automated detectors of affective states and moods, including fatigue, depression, and anxiety, could also form an important step toward personal wellness and assistive technologies [100]. Because of this practical importance and the theoretical interest of cognitive scientists, automatic human affect analysis has attracted the interest of many researchers in the last three decades. Suwa et al. [127] presented an early attempt in 1978 to automatically analyze facial expressions. The vocal emotion analysis has an even longer history, starting with the study of Williams and Stevens in 1972 [145]. Since the late 1990s, an increasing number of efforts toward automatic affect recognition were reported in the literature. Early efforts toward machine affect recognition from face images include those of Mase [90], and Kobayashi and Hara [76] in 1991. Early efforts toward the machine analysis of basic emotions from vocal cues include studies like that of Dellaert et al. in 1996 [33]. The study of Chen et al. in 1998 [22] represents an early attempt toward audiovisual affect recognition. For exhaustive surveys of the past work in the machine analysis of affective expressions, readers are referred to [115], [31], [102], [49], [96], [105], [130], [121], and [98], which were published in 1992 to 2007, respectively. Overall, most of the existing approaches to automatic human affect analysis are the following: . approaches that are trained and tested on a deliberately displayed series of exaggerated affective expressions, . approaches that are aimed at recognition of a small number of prototypical (basic) expressions of emo- tion (i.e., happiness, sadness, anger, fear, surprise, and disgust), and . single-modal approaches, where information pro- cessed by the computer system is limited to either face images or the speech signals. Accordingly, reviewing the efforts toward the single- modal analysis of artificial affective expressions have been the focus in the previously published survey papers, among which the papers of Cowie et al. in 2001 [31] and of Pantic and Rothkrantz in 2003 [102] have been the most comprehensive and widely cited in this field to date. At the time when these surveys were written, most of the available data sets of affective displays were small and contained only deliberate affective displays (mainly of the six prototypical emotions) recorded under highly con- strained conditions. Multimedia data were rare, and there was no 3D data on facial affective behavior, there was no data of combined face and body displays of affective behavior, and it was rare to find data that included spontaneous displays of affective behavior. Hence, while automatic detection of the six basic emotions in posed controlled audio or visual displays can be done with reasonably high accuracy, detecting these expressions or any expression of human affective behavior in less constrained settings is still a very challenging problem due to the fact that deliberate behavior differs in visual appearance, audio profile, and timing from sponta- neously occurring behavior. Due to this criticism received from both cognitive and computer scientists, the focus of the research in the field started to shift to the automatic analysis of spontaneously displayed affective behavior. Several studies have recently emerged on the machine analysis of spontaneous facial expressions (e.g., [10], [28], [135], and [4]) and vocal expressions (e.g., [12] and [83]). Also, it has been shown by several experimental studies that integrating the information from audio and video leads to an improved performance of affective behavior recognition. The improved reliability of audiovisual ap- proaches in comparison to single-modal approaches can be explained as follows: Current techniques for the detection and tracking of facial expressions are sensitive to head pose, clutter, and variations in lighting conditions, while current techniques for speech processing are sensitive to auditory noise. Audiovisual fusion can make use of the complementary information from these two channels. In addition, many psychological studies have theoretically and empirically demonstrated the importance of the integration of information from multiple modalities (vocal and visual expression in this paper) to yield a coherent representation and inference of emotions [1], [113], [117]. As a result, an increased number of studies on audiovisual human affect recognition have emerged in recent years (e.g., [17], [53], and [151]). This paper introduces and surveys these recent advances in the research on human affect recognition. In contrast to 40 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 31, NO. 1, JANUARY 2009 1. http://itr.beckman.uiuc.edu. Authorized licensed use limited to: WASEDA UNIVERSITY. Downloaded on June 1, 2009 at 22:22 from IEEE Xplore. Restrictions apply. previously published survey papers in the field, it focuses on the approaches that can handle audio and/or visual recordings of spontaneous (as opposed to posed) displays of affective states. It also examines the state-of-the-art methods that have not been reviewed in previous survey papers but are important, specifically for advancing human affect sensing technology. Finally, we discuss the collection and availability of training and test data in detail. This paper is organized as follows: Section 2 describes the human perception of affect from a psychological perspective. Section 3 provides a detailed review of the related studies, including multimedia emotion databases and existing human affect recognition methods. Section 4 discusses some of the challenges that researchers face in this field. A summary and closing remarks conclude this paper. 2 HUMAN AFFECT (EMOTION) PERCEPTION Automatic affect recognition is inherently a multidisciplin- ary enterprise involving different research fields, including psychology, linguistics, computer vision, speech analysis, and machine learning. There is no doubt that the progress in automatic affect recognition is contingent on the progress of the research in each of those fields [44]. 2.1 The Description of Affect We begin by briefly introducing three primary ways that affect has been conceptualized in psychological research. Research on the basic structure and description of affect is important in that these conceptualizations provide informa- tion about the affective displays that automatic emotion recognition systems are designed to detect. Perhaps the most long-standing way that affect has been described by psychologists is in terms of discrete categories, an approach that is rooted in the language of daily life [40], [41], [46], [131]. The most popular example of this description is the prototypical (basic) emotion categories, which include happiness, sadness, fear, anger, disgust, and surprise. This description of basic emotions was specially supported by the cross-cultural studies conducted by Ekman [40], [42], indicating that humans perceive certain basic emotions with respect to facial expressions in the same way, regardless of culture. This influence of a basic emotion theory has resulted in the fact that most of the existing studies of automatic affect recognition focus on recognizing these basic emotions. The main advantage of a category representation is that people use this categorical scheme to describe observed emotional displays in daily life. The labeling scheme based on category is very intuitive and thus matches people’s experience. However, discrete lists of emotions fail to describe the range of emotions that occur in natural communication settings. For example, although prototypical emotions are key points of emotion reference, they cover a rather small part of our daily emotional displays. Selection of affect categories that can describe the wide variety of affective displays that people show in daily interpersonal interactions needs to be done in a pragmatic and context-dependent manner [102], [105]. An alternative to the categorical description of human affect is the dimensional description [58], [114], [140], where an affective state is characterized in terms of a small number of latent dimensions rather than in terms of a small number of discrete emotion categories. These dimensions include evaluation, activation, control, power, etc. In particular, the evaluation and activation dimensions are expected to reflect the main aspects of emotion. The evaluation dimension measures how a human feels, from positive to negative. The activation dimension measures whether humans are more or less likely to take an action under the emotional state, from active to passive. In contrast to categorical representation, dimensional representation enables raters to label a range of emotions. However, the projection of the high-dimensional emotional states onto a rudimentary 2D space results, to some degree, in the loss of information. Some emotions become indistinguishable (e.g., fear and anger), and some emotions lie outside the space (e.g., surprise). This representation is not intuitive, and raters need special training to use the dimensional labeling system (e.g., the Feeltrace system [30]). In automatic emotion recognition systems that are based on the 2D dimensional emotion representation (e.g., [17] and [53]), the problem is often further simplified to two-class (positive versus negative and active versus passive) or four-class (quadrants of 2D space) classification. One of the most influential emotion theories in modern psychology is the appraisal-based approach [117], which can be regarded as the extension of the dimensional approach described above. In this representation, an emotion is described through a set of stimulus evaluation checks, including the novelty, intrinsic pleasantness, goal- based significance, coping potential, and compatibility with standards. However, translating this scheme into one engineering framework for purposes of automatic emotion recognition remains challenging [116]. 2.2 Association between Affect, Audio, and Visual Signals Affective arousal modulates all human communicative signals. Psychologists and linguists have various opinions about the importance of different cues (audio and visual cues in this paper) in human affect judgment. Ekman [41] found that the relative contributions of facial expression, speech, and body gestures to affect judgment depend both on the affective state and the environment where the affective behavior occurs. On the other hand, some studies (e.g., [1] and [92]) indicated that a facial expression in the visual channel is the most important affective cue and correlates well with the body and voice. Many studies have theoretically and empirically demonstrated the advantage of the integration of multiple modalities (vocal and visual expression) in human affect perception over single mod- alities [1], [113], [117]. Different from the traditional message judgment, in which the aim is to infer what underlies a displayed behavior such as affect or personality, another major approach to human behavior measurement is the sign judgment [26]. The aim of sign judgment is to describe the appearance, rather than the meaning, of the shown behavior such as facial signal, body gesture, or speech rate. While message judgment is focused on interpretation, sign judgment attempts to be an objective description, leaving the inference about the conveyed message to high-level ZENG ET AL.: A SURVEY OF AFFECT RECOGNITION METHODS: AUDIO, VISUAL, AND SPONTANEOUS EXPRESSIONS 41 Authorized licensed use limited to: WASEDA UNIVERSITY. Downloaded on June 1, 2009 at 22:22 from IEEE Xplore. Restrictions apply. decision making. As indicated by Cohn [26], the most commonly used sign judgment method for the manual labeling of facial behavior is the Facial Action Coding System (FACS) proposed by Ekman et al. [43]. FACS is a comprehensive and anatomically based system that is used to measure all visually discernible facial movements in terms of atomic facial actions called Action Units (AUs). As AUs are independent of interpretation, they can be used for any high-level decision-making process, including the recognition of basic emotions according to Emotional FACS (EMFACS) rules2, the recognition of various affective states according to the FACS Affect Interpretation Database (FACSAID)2 introduced by Ekman et al. [43], and the recognition of other complex psychological states such as depression [47] or pain [144]. AUs of the FACS are very suitable to use in studies on human naturalistic facial behavior, as the thousands of anatomically possible facial expressions (independent of their high-le

本文档为【A Survey of Affect Recognition】，请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑，图片更改请在作品中右键图片并更换，文字修改请直接点击文字进行修改，也可以新增和删除文档中的内容。

A Survey of Affect Recognition

热门搜索

历史搜索