Affective and human-centered computing have attracted an abundance of
attention during the past years, mainly due to the abundance of
environments and applications able to exploit and adapt to multimodal
input from the users. The combination of facial expressions with prosody
information allows us to capture the users’ emotional state in an
unintrusive manner, relying on the best performing modality in cases
where one modality suffers from noise or bad sensing conditions. In this
paper, we describe a multi-cue, dynamic approach to detect emotion in
naturalistic video sequences, where input is taken from nearly real
world situations, contrary to controlled recording conditions of
audiovisual material. Recognition is performed via a recurrent neural
network, whose short term memory and approximation capabilities cater
for modeling dynamic events in facial and prosodic expressivity. This
approach also differs from existing work in that it models user
expressivity using a dimensional representation, instead of detecting
discrete ‘universal emotions’, which are scarce in everyday
human-machine interaction. The algorithm is deployed on an audiovisual
database which was recorded simulating human-human discourse and,
therefore, contains less extreme expressivity and subtle variations of a
number of emotion labels. Results show that in turns lasting more than a
few frames, recognition rates rise to 98%.
Published in
Journal on Multimodal User Interfaces, Springer, ISSN 1783-7677 (Print)
1783-8738 (Online), Volume 3, pp 49-66