Page: 1 2 3 4 5 6 7 8 9 Next »

Online Multimodal-Laughter Detection System

Posted on Tue 16 Sep 2014
Right on time, and to coincide with the end of the project, ILHAIRE's online multimodal-laughter detection system has been finalized. In Year 3, the framework has been further developed and finalised. User interaction is captured in real-time using a headset delivering high-quality audio at 48 kHz; a Microsoft Kinect providing RGB video and a depth image at 25 Hz, as well as, tracking of facial points and action units; and a respiration belt recording exhalation at 125 Hz. Sensors streams are collected and synchronised through the Social Signal Interpretation (SSI) framework. If desired, raw signals are stored for later analysis. If interaction of several users is captured, SSI uses a global synchronisation signal to keep involved machines in sync. Activity recognisers (VAD and FAD modules) are used to detect actions in voice and face, which are further analysed using pre-trained models that convert the input signals into probabilities for Smile (from action units), as well as, Laughter and Speech (from voice). Apart from that, voiced parts in the audio signal are further analysed for laughter intensity using a pre-trained Weka model. Raw probabilities, as well as, combined decisions (using vector fusion) are then provided to the Dialog Manager (DM) through ActiveMQ. RGB and depth streams are forwarded to EyesWeb, where silhouette and shoulder features are extracted. Likewise, the raw respiration signal is sent to EyesWeb for further processing. The RGB video is also published as an UDP stream in the network using FFMPEG. In this way, multi-user sessions can be recorded in separate rooms, allowing each participant to watch the video streams of the other users. The following Figure shows the overall final architecture of the ILHAIRE laughter analysis and fusion framework. 

----- References:
J. Urbain, R. Niewiadomski, J. Hofmann, E. Bantegnie, T. Baur, N. Berthouze, H. Cakmak, R.T. Cruz, S. Dupont, M. Geist, H. Griffin, F. Lingenfelser, M. Mancini, M. Miranda, G. Mckeown, S. Pammi, O. Pietquin, B. Piot, T. Platt, W. Ruch, A. Sharma, G. Volpe, J. Wagner, 2012, "Laugh Machine", Proceedings Of The 8th International Summer Workshop On Multimodal Interfaces - Enterface'12, Pp. 13-34, July, Metz, France.

Johannes Wagner, Florian Lingenfelser, Tobias Baur, Ionut Damian, Felix Kistler, and Elisabeth André. 2013. The social signal interpretation (SSI) framework: multimodal signal processing and recognition in real-time. In Proceedings of the 21st ACM international conference on Multimedia (MM ’13). ACM, New York, NY, USA, 831-834. DOI=10.1145/2502081.2502223

Antonio Camurri, Shuji Hashimoto, Matteo Ricchetti, Andrea Ricci, Kenji Suzuki, Riccardo Trocca, and Gualtiero Volpe. 2000. EyesWeb: Toward Gesture and Affect Recognition in Interactive Dance and Music Systems. Comput. Music J. 24, 1 (April 2000), 57-69. DOI=10.1162/014892600559182

Real-time acoustic laughter synthesis

Posted on Fri 05 Sep 2014 by Jerome Urbain
In the framework of the eNTERFACE'14 Summer School held in June in Bilbao (Spain), the ILHAIRE team has investigated easily-controllable real-time laughter synthesis.

To enable convenient control over the synthesis, the algorithm to automatically generate phonetic transcriptions from intensity curves has been refined, ported in real-time and integrated within a Graphical User Interface built with Puredata. The user can hence control in real-time an intensity slider and listen to the synthesized laugh. The generation algorithm receives the intensity value point by point and selects laughter syllables one by one in real-time thanks to a unit selection approach[1]  (Urbain, Cakmak, Charlier, Denti, & Dutoit, 2014).

The MAGE platform (Astrinaki, D'Alessandro, Reboursière, Moinet, & Dutoit, 2013) has been used to perform real-time HMM-based laughter synthesis, using the phonetic transcriptions that are received in real-time from the generation algorithm. MAGE is implementing similar HMM-synthesis techniques as the HTS that has been previously used in ILHAIRE for offline laughter synthesis. The main difference introduced by MAGE is real-time processing, which also means that the waveform is computed piece by piece without knowing what the future will be like. This represents a big challenge for optimizing the trajectories and ensuring a smooth, coherent output. After tuning of the method to laughter, we were able to synthesize laughs in real-time using MAGE. MAGE also offers real-time control over the synthesis parameters, for example pitch. A demonstration video is presented below (Video 1). It indeed appears that the synthesis quality is lower than in the offline (HTS) situation, which has been further confirmed by perceptive evaluations conducted at the University of Mons.

Real-time HMM-based laughter synthesis. The user is controlling the intensity slider in the middle of the screen. Parameters of the synthesis can also be easily modified, as it is illustrated with pitch in the video.

Hence, it was decided to explore another synthesis technique, laughter concatenation, which is known for better audio quality than HMM-based synthesis, but is also less flexible. In the case of concatenation, instead of getting the phonetic transcriptions from the generation algorithm to further synthesize them, we directly play the sound of the syllable selected by the generation algorithm. An example video is displayed below (Video 2). The audio quality is indeed better than HMM-based synthesis, at the cost of a lack of flexibility. For instance, changing the pitch in real-time is difficult with that technique, and could only be performed at the cost of a drop in audio quality. Some repetitions of sounds can also appear, which may be disturbing in the long term.

Real-time laughter synthesis performed by concatenating syllables. The user is controlling the green intensity slider on the top of the screen. The choice of selected syllables can be modified by clicking on the square on the top left, as is illustrated in the video for successively favoring the vowel “a”, the vowel “e”, and breathing sounds.

The competition between the two synthesis techniques remains open and the preference of one method over the other depends on the application (is flexibility important, etc.?). Further research is currently performed at UMONS to improve and evaluate the two techniques.

----- References:

Astrinaki, M., D'Alessandro, N., Reboursière, L., Moinet, A., & Dutoit, T. (2013). Mage 2.0: New features and its application in the development of a talking guitar. 13th Conference on New Interfaces for Musical Expression (NIME'13). Daejon and Seoul, South Korea.

Urbain, J., Cakmak, H., Charlier, A., Denti, M., & Dutoit, T. (2014). Arousal-driven Synthesis of Laughter. IEEE Journal of Selected Topics in Signal Processing, 8(2), 273-284.

[1] The syllable is selected from a repository of available syllables; for each syllable in the repository, a target cost is defined as the distance between the target intensity (given by the user) and the syllable intensity; for each syllable in the repository, a concatenation cost is defined as the Ngram probability of adding the syllable transcription to the already selected transcription; the selected syllable is the one with the lowest total cost (target + concatenation).

Laughter data collection in Peru

Posted on Mon 01 Sep 2014
One of the goals of Work Package 1 in the ILHAIRE project was to create a multicultural and multimodal database of laughter. The multimodal part of the database was achieved through the collaboration of the Queen’s University Belfast and University of Augsburg teams. Together we set up a multimodal recording installation that was capable of recording four people in conversation at a time. This used Four kinect systems to capture movement data, four HD webcams for the visual components, and four high quality microphones for the auditory streams of data. This data required nine computers and a Network Attached Storage device capable of storing fifteen terabytes of data. All the various streams of data were synchronized using the Social Signal Interpretation (SSI) software developed by the University of Augsburg team.

We used this setup to record two parts of the ILHAIRE laughter database, the Belfast Storytelling Database which sought to capture hilarious laughter, and the Belfast Conversational Dyads that sought to capture conversational laughter. In the end both contained large quantities of laughter that could be catgorised as hilarious and conversational. In the Belfast Storytelling Database we addressed some of our multicultural goals by using two linguistic groups English speakers and Spanish speakers. However, the multicultural goal of the work package had been to test in a place that was not European and had relatively little interaction with WEIRD (Western, Educated, Industrialized, Rich, and Democratic) people. We chose to do some testing in Peru as the Queen’s University team had previously collected recordings of emotional material in Peru as part of the Belfast Induced Natural Emotion Database.

Adding this truly multicultural component to the database presented many challenges, not least of which was converting a fairly static hardware installation into something that was more mobile. However, we also wanted to keep the set up as similar as possible to that used in the Belfast data collection sessions so that we could have comparable data. We decided to test dyads and groups of three using our conversational task as it was the most natural of our tasks.

In creating our mobile strategy we decided to bring the sensor equipment and rent computers while we were out there. We came to an agreement with a local internet café to rent their computers for the duration of the project. Storage presented a problem but we took three 2 terabyte portable drives with us as storage. Packed up the equipment in a flight box and headed to a town called Chincha Alta in coastal Peru.

Despite being adventurous and exciting cross-cultural data collection is always fraught with unforeseen issues. When we arrived we had the nasty surprise of being forced to pay an import duty on all the equipment, after haggling the price of the duty down to about 50% of the original cost we paid the duty and proceeded to Chincha. Again there were more problems–despite assurances to the contrary the internet café had underpowered computers and so only one served to collect the auditory data and we had to search for additional computers that were powerful enough to run the equipment. We managed to hire enough reasonably powerful laptops that we could get a sufficient set-up but unfortunately we could not get enough fast computers to gather the Kinect data streams. So we settled on gathering audio streams and video streams from dyads and groups. We hired a house outside of town to minimise noise disturbances that were an issue in the Belfast Induced Natural Emotion Database. We set up the lab and data collection started. One major issue was getting sufficient lighting, the rooms were dark only had one window. We used natural daylight with translucent gauze on the windows to diffuse the light but for short periods of the day the sun shone directly through causing problems. We also had several artificial up-lighters to create a more even spread of lighting, however, these were popular with many local insects that took an interest in our work and occasionally the amount of dead insects would build up and catch fire in the halogen bulbs in the uplighters. Insects also caused problems with our felt backdrops that we used to provide an even background and regular inspections had to take place to stop them eating large holes in the felt that may show up on the videos.

 In the Peru data collection the participants sat opposite each other with the HD webcam placed on a table between them replicating as closely as possible the setup used in Belfast

We had many other issues that interfered with the data collection. Dust is a big issue in the coastal area of Peru and keeping it out of the equipment required effort. Power cuts were common but thankfully rare at the times of the day when we were testing and the house had a back up generator that minimised disruption. However, at one stage thieves cut into and stole the power lines that delivered electricity to the area, this meant there was no reliable electricity for two days.

Recruitment of participants was an issue, men leave to go to work during the day and women are often looking after children both of these factors need to be considered; looking after children while participants are in conversation became part of the work load. As many men work during the week the weekends become an important time to gather male and mixed sex dyads and planning is required to maximise opportunities for gathering data that doesn’t fit with the normal rhythms of local life. Illiteracy can be a problem, and most participants had never filled in questionnaires before so many things that can be taken for granted with “WEIRD” participants take a much longer time with those who have never been involved in anything like this before. Punctuality is also often an issue, a neat and tidy testing timetable can require constant revision to adapt to local chronological norms. Adaptability is the key, and it is also very useful if the topic of your scientific investigation is laughter!

One of the major limiting factors on the speed of recording was the size of the data files, our setup captured raw information streams from the HD webcams creating files of up to 150 GB for each 1 hour session across for each participant. These had to be copied from the hard drives to the storage devices between sessions which took a considerable amount of time. We collected data until there was about five terabytes of material on the storage devices, giving us 20 dyads sessions and 4 groups sessions but unfortunately one of the storage device took a knock in transit on the way back and the data was lost and can not be recovered. Given the file sizes and time taken to transfer the files between devices is was not possible to create backups of these devices although some compressed versions had been made. However, the damaged device was the one with the least data on it so only a small number of sessions were lost, but it has impacted the group session data most severely.

So despite the trials and tribulations of data collection in Peru and the best efforts of the local fauna to upset our goals we managed to make it back with a substantial number of synchronised high quality recordings of multimodal natural conversations between Peruvian participants. These provide a further addition to the ILHAIRE laughter database providing a comparable set of data to those recorded during the Belfast Sessions.