Real-time acoustic laughter synthesis

Posted on Fri 05 Sep 2014by by Jerome Urbain
In the framework of the eNTERFACE'14 Summer School held in June in Bilbao (Spain), the ILHAIRE team has investigated easily-controllable real-time laughter synthesis.

To enable convenient control over the synthesis, the algorithm to automatically generate phonetic transcriptions from intensity curves has been refined, ported in real-time and integrated within a Graphical User Interface built with Puredata. The user can hence control in real-time an intensity slider and listen to the synthesized laugh. The generation algorithm receives the intensity value point by point and selects laughter syllables one by one in real-time thanks to a unit selection approach[1]  (Urbain, Cakmak, Charlier, Denti, & Dutoit, 2014).

The MAGE platform (Astrinaki, D'Alessandro, Reboursière, Moinet, & Dutoit, 2013) has been used to perform real-time HMM-based laughter synthesis, using the phonetic transcriptions that are received in real-time from the generation algorithm. MAGE is implementing similar HMM-synthesis techniques as the HTS that has been previously used in ILHAIRE for offline laughter synthesis. The main difference introduced by MAGE is real-time processing, which also means that the waveform is computed piece by piece without knowing what the future will be like. This represents a big challenge for optimizing the trajectories and ensuring a smooth, coherent output. After tuning of the method to laughter, we were able to synthesize laughs in real-time using MAGE. MAGE also offers real-time control over the synthesis parameters, for example pitch. A demonstration video is presented below (Video 1). It indeed appears that the synthesis quality is lower than in the offline (HTS) situation, which has been further confirmed by perceptive evaluations conducted at the University of Mons.

Real-time HMM-based laughter synthesis. The user is controlling the intensity slider in the middle of the screen. Parameters of the synthesis can also be easily modified, as it is illustrated with pitch in the video.


Hence, it was decided to explore another synthesis technique, laughter concatenation, which is known for better audio quality than HMM-based synthesis, but is also less flexible. In the case of concatenation, instead of getting the phonetic transcriptions from the generation algorithm to further synthesize them, we directly play the sound of the syllable selected by the generation algorithm. An example video is displayed below (Video 2). The audio quality is indeed better than HMM-based synthesis, at the cost of a lack of flexibility. For instance, changing the pitch in real-time is difficult with that technique, and could only be performed at the cost of a drop in audio quality. Some repetitions of sounds can also appear, which may be disturbing in the long term.

Real-time laughter synthesis performed by concatenating syllables. The user is controlling the green intensity slider on the top of the screen. The choice of selected syllables can be modified by clicking on the square on the top left, as is illustrated in the video for successively favoring the vowel “a”, the vowel “e”, and breathing sounds.


The competition between the two synthesis techniques remains open and the preference of one method over the other depends on the application (is flexibility important, etc.?). Further research is currently performed at UMONS to improve and evaluate the two techniques.

----- References:

Astrinaki, M., D'Alessandro, N., Reboursière, L., Moinet, A., & Dutoit, T. (2013). Mage 2.0: New features and its application in the development of a talking guitar. 13th Conference on New Interfaces for Musical Expression (NIME'13). Daejon and Seoul, South Korea.

Urbain, J., Cakmak, H., Charlier, A., Denti, M., & Dutoit, T. (2014). Arousal-driven Synthesis of Laughter. IEEE Journal of Selected Topics in Signal Processing, 8(2), 273-284.



-------------------------------------------------------
[1] The syllable is selected from a repository of available syllables; for each syllable in the repository, a target cost is defined as the distance between the target intensity (given by the user) and the syllable intensity; for each syllable in the repository, a concatenation cost is defined as the Ngram probability of adding the syllable transcription to the already selected transcription; the selected syllable is the one with the lowest total cost (target + concatenation).
 

Jerome Urbain