J. Urbain, R. Niewiadomski, J. Hofmann, E. Bantegnie, T. Baur, N. Berthouze, H. Cakmak, R.T. Cruz, S. Dupont, M. Geist, H. Griffin, F. Lingenfelser, M. Mancini, M. Miranda, G. Mckeown, S. Pammi, O. Pietquin, B. Piot, T. Platt, W. Ruch, A. Sharma, G. Volpe, J. Wagner, 2012, "Laugh Machine", Proceedings Of The 8th International Summer Workshop On Multimodal Interfaces - Enterface'12, Pp. 13-34, July, Metz, France.
Johannes Wagner, Florian Lingenfelser, Tobias Baur, Ionut Damian, Felix Kistler, and Elisabeth André. 2013. The social signal interpretation (SSI) framework: multimodal signal processing and recognition in real-time. In Proceedings of the 21st ACM international conference on Multimedia (MM ’13). ACM, New York, NY, USA, 831-834. DOI=10.1145/2502081.2502223 http://doi.acm.org/10.1145/2502081.2502223
Antonio Camurri, Shuji Hashimoto, Matteo Ricchetti, Andrea Ricci, Kenji Suzuki, Riccardo Trocca, and Gualtiero Volpe. 2000. EyesWeb: Toward Gesture and Affect Recognition in Interactive Dance and Music Systems. Comput. Music J. 24, 1 (April 2000), 57-69. DOI=10.1162/014892600559182 http://dx.doi.org/10.1162/014892600559182
To enable convenient control over the synthesis, the algorithm to automatically generate phonetic transcriptions from intensity curves has been refined, ported in real-time and integrated within a Graphical User Interface built with Puredata. The user can hence control in real-time an intensity slider and listen to the synthesized laugh. The generation algorithm receives the intensity value point by point and selects laughter syllables one by one in real-time thanks to a unit selection approach (Urbain, Cakmak, Charlier, Denti, & Dutoit, 2014).
The MAGE platform (Astrinaki, D'Alessandro, Reboursière, Moinet, & Dutoit, 2013) has been used to perform real-time HMM-based laughter synthesis, using the phonetic transcriptions that are received in real-time from the generation algorithm. MAGE is implementing similar HMM-synthesis techniques as the HTS that has been previously used in ILHAIRE for offline laughter synthesis. The main difference introduced by MAGE is real-time processing, which also means that the waveform is computed piece by piece without knowing what the future will be like. This represents a big challenge for optimizing the trajectories and ensuring a smooth, coherent output. After tuning of the method to laughter, we were able to synthesize laughs in real-time using MAGE. MAGE also offers real-time control over the synthesis parameters, for example pitch. A demonstration video is presented below (Video 1). It indeed appears that the synthesis quality is lower than in the offline (HTS) situation, which has been further confirmed by perceptive evaluations conducted at the University of Mons.
Hence, it was decided to explore another synthesis technique, laughter concatenation, which is known for better audio quality than HMM-based synthesis, but is also less flexible. In the case of concatenation, instead of getting the phonetic transcriptions from the generation algorithm to further synthesize them, we directly play the sound of the syllable selected by the generation algorithm. An example video is displayed below (Video 2). The audio quality is indeed better than HMM-based synthesis, at the cost of a lack of flexibility. For instance, changing the pitch in real-time is difficult with that technique, and could only be performed at the cost of a drop in audio quality. Some repetitions of sounds can also appear, which may be disturbing in the long term.
The competition between the two synthesis techniques remains open and the preference of one method over the other depends on the application (is flexibility important, etc.?). Further research is currently performed at UMONS to improve and evaluate the two techniques.
Astrinaki, M., D'Alessandro, N., Reboursière, L., Moinet, A., & Dutoit, T. (2013). Mage 2.0: New features and its application in the development of a talking guitar. 13th Conference on New Interfaces for Musical Expression (NIME'13). Daejon and Seoul, South Korea.
Urbain, J., Cakmak, H., Charlier, A., Denti, M., & Dutoit, T. (2014). Arousal-driven Synthesis of Laughter. IEEE Journal of Selected Topics in Signal Processing, 8(2), 273-284.
We used this setup to record two parts of the ILHAIRE laughter database, the Belfast Storytelling Database which sought to capture hilarious laughter, and the Belfast Conversational Dyads that sought to capture conversational laughter. In the end both contained large quantities of laughter that could be catgorised as hilarious and conversational. In the Belfast Storytelling Database we addressed some of our multicultural goals by using two linguistic groups English speakers and Spanish speakers. However, the multicultural goal of the work package had been to test in a place that was not European and had relatively little interaction with WEIRD (Western, Educated, Industrialized, Rich, and Democratic) people. We chose to do some testing in Peru as the Queen’s University team had previously collected recordings of emotional material in Peru as part of the Belfast Induced Natural Emotion Database.
Adding this truly multicultural component to the database presented many challenges, not least of which was converting a fairly static hardware installation into something that was more mobile. However, we also wanted to keep the set up as similar as possible to that used in the Belfast data collection sessions so that we could have comparable data. We decided to test dyads and groups of three using our conversational task as it was the most natural of our tasks.
In creating our mobile strategy we decided to bring the sensor equipment and rent computers while we were out there. We came to an agreement with a local internet café to rent their computers for the duration of the project. Storage presented a problem but we took three 2 terabyte portable drives with us as storage. Packed up the equipment in a flight box and headed to a town called Chincha Alta in coastal Peru.
Despite being adventurous and exciting cross-cultural data collection is always fraught with unforeseen issues. When we arrived we had the nasty surprise of being forced to pay an import duty on all the equipment, after haggling the price of the duty down to about 50% of the original cost we paid the duty and proceeded to Chincha. Again there were more problems–despite assurances to the contrary the internet café had underpowered computers and so only one served to collect the auditory data and we had to search for additional computers that were powerful enough to run the equipment. We managed to hire enough reasonably powerful laptops that we could get a sufficient set-up but unfortunately we could not get enough fast computers to gather the Kinect data streams. So we settled on gathering audio streams and video streams from dyads and groups. We hired a house outside of town to minimise noise disturbances that were an issue in the Belfast Induced Natural Emotion Database. We set up the lab and data collection started. One major issue was getting sufficient lighting, the rooms were dark only had one window. We used natural daylight with translucent gauze on the windows to diffuse the light but for short periods of the day the sun shone directly through causing problems. We also had several artificial up-lighters to create a more even spread of lighting, however, these were popular with many local insects that took an interest in our work and occasionally the amount of dead insects would build up and catch fire in the halogen bulbs in the uplighters. Insects also caused problems with our felt backdrops that we used to provide an even background and regular inspections had to take place to stop them eating large holes in the felt that may show up on the videos.
In the Peru data collection the participants sat opposite each other with the HD webcam placed on a table between them replicating as closely as possible the setup used in Belfast
We had many other issues that interfered with the data collection. Dust is a big issue in the coastal area of Peru and keeping it out of the equipment required effort. Power cuts were common but thankfully rare at the times of the day when we were testing and the house had a back up generator that minimised disruption. However, at one stage thieves cut into and stole the power lines that delivered electricity to the area, this meant there was no reliable electricity for two days.
Recruitment of participants was an issue, men leave to go to work during the day and women are often looking after children both of these factors need to be considered; looking after children while participants are in conversation became part of the work load. As many men work during the week the weekends become an important time to gather male and mixed sex dyads and planning is required to maximise opportunities for gathering data that doesn’t fit with the normal rhythms of local life. Illiteracy can be a problem, and most participants had never filled in questionnaires before so many things that can be taken for granted with “WEIRD” participants take a much longer time with those who have never been involved in anything like this before. Punctuality is also often an issue, a neat and tidy testing timetable can require constant revision to adapt to local chronological norms. Adaptability is the key, and it is also very useful if the topic of your scientific investigation is laughter!
One of the major limiting factors on the speed of recording was the size of the data files, our setup captured raw information streams from the HD webcams creating files of up to 150 GB for each 1 hour session across for each participant. These had to be copied from the hard drives to the storage devices between sessions which took a considerable amount of time. We collected data until there was about five terabytes of material on the storage devices, giving us 20 dyads sessions and 4 groups sessions but unfortunately one of the storage device took a knock in transit on the way back and the data was lost and can not be recovered. Given the file sizes and time taken to transfer the files between devices is was not possible to create backups of these devices although some compressed versions had been made. However, the damaged device was the one with the least data on it so only a small number of sessions were lost, but it has impacted the group session data most severely.
So despite the trials and tribulations of data collection in Peru and the best efforts of the local fauna to upset our goals we managed to make it back with a substantial number of synchronised high quality recordings of multimodal natural conversations between Peruvian participants. These provide a further addition to the ILHAIRE laughter database providing a comparable set of data to those recorded during the Belfast Sessions.