Online Multimodal-Laughter Detection System

Posted on Tue 16 Sep 2014
Right on time, and to coincide with the end of the project, ILHAIRE's online multimodal-laughter detection system has been finalized. In Year 3, the framework has been further developed and finalised. User interaction is captured in real-time using a headset delivering high-quality audio at 48 kHz; a Microsoft Kinect providing RGB video and a depth image at 25 Hz, as well as, tracking of facial points and action units; and a respiration belt recording exhalation at 125 Hz. Sensors streams are collected and synchronised through the Social Signal Interpretation (SSI) framework. If desired, raw signals are stored for later analysis. If interaction of several users is captured, SSI uses a global synchronisation signal to keep involved machines in sync. Activity recognisers (VAD and FAD modules) are used to detect actions in voice and face, which are further analysed using pre-trained models that convert the input signals into probabilities for Smile (from action units), as well as, Laughter and Speech (from voice). Apart from that, voiced parts in the audio signal are further analysed for laughter intensity using a pre-trained Weka model. Raw probabilities, as well as, combined decisions (using vector fusion) are then provided to the Dialog Manager (DM) through ActiveMQ. RGB and depth streams are forwarded to EyesWeb, where silhouette and shoulder features are extracted. Likewise, the raw respiration signal is sent to EyesWeb for further processing. The RGB video is also published as an UDP stream in the network using FFMPEG. In this way, multi-user sessions can be recorded in separate rooms, allowing each participant to watch the video streams of the other users. The following Figure shows the overall final architecture of the ILHAIRE laughter analysis and fusion framework. 






----- References:
J. Urbain, R. Niewiadomski, J. Hofmann, E. Bantegnie, T. Baur, N. Berthouze, H. Cakmak, R.T. Cruz, S. Dupont, M. Geist, H. Griffin, F. Lingenfelser, M. Mancini, M. Miranda, G. Mckeown, S. Pammi, O. Pietquin, B. Piot, T. Platt, W. Ruch, A. Sharma, G. Volpe, J. Wagner, 2012, "Laugh Machine", Proceedings Of The 8th International Summer Workshop On Multimodal Interfaces - Enterface'12, Pp. 13-34, July, Metz, France.

Johannes Wagner, Florian Lingenfelser, Tobias Baur, Ionut Damian, Felix Kistler, and Elisabeth André. 2013. The social signal interpretation (SSI) framework: multimodal signal processing and recognition in real-time. In Proceedings of the 21st ACM international conference on Multimedia (MM ’13). ACM, New York, NY, USA, 831-834. DOI=10.1145/2502081.2502223 http://doi.acm.org/10.1145/2502081.2502223

Antonio Camurri, Shuji Hashimoto, Matteo Ricchetti, Andrea Ricci, Kenji Suzuki, Riccardo Trocca, and Gualtiero Volpe. 2000. EyesWeb: Toward Gesture and Affect Recognition in Interactive Dance and Music Systems. Comput. Music J. 24, 1 (April 2000), 57-69. DOI=10.1162/014892600559182 http://dx.doi.org/10.1162/014892600559182