Acoustic Laughter Synthesis
Examples of synthesized laughs below
Compared to speech synthesis, laughter synthesis is an almost unexplored field. Nevertheless the problem is quite complex and the few attempts made so far suffer from a lack of naturalness.
The most known works are those of Sundaram and Narayanan (Sundaram & Narayanan, 2007) and Lasarcyk and Trouvain (Lasarcyk & Trouvain, 2007). Their approaches were completely different. Sundaram and Narayanan remarked that the energy envelop of laughter waveforms oscillates like a physical mass-spring system. They modeled that trajectory to form the energy envelop of their synthetic laughs and synthesized the constituting vowels via Linear Prediction. Lasarcyk and Trouvain compared two synthesis strategies that had been explored for speech synthesis in the past: diphone concatenation, and synthesis produced by a 3D modeling of the human vocal organs. The latter system gave better results, but laughs were still evaluated as non-natural by naïve listeners.
More recently, Cox (Cox, 2010) conducted an online study to evaluate the different approaches to laughter synthesis (http://www.newscientist.com/article/dn19227-laughters-secrets-faking-it--the-results.html). The participants to the study had to judge whether a presented laugh had been produced by a computer or a human. The study included different synthetic laughs (among others from the two aforementioned groups) as well as human laughs. 6000 people participated in the study. No synthetic method obtained more than 25% of favorable votes (participants thinking it could have been uttered by a human), except from a concatenation method proposed by the University of Mons, which fooled 60% of the raters. It must be noted, however, that this method is actually copy-synthesis (meaning that we try to copy as best as possible the acoustic features of an existing laugh) and not true synthesis (starting from a textual description to generate a laughter audio signal). In conclusion, we can say that no existing laughter synthesis method is perceived as natural by naive listeners. Another interesting fact is that the actual human laughs only reached an 80% score; in other words, on average 20% of listeners thought it had been synthesized by a machine.
Given the lack of naturalness of the past laughter synthesis attempts and the good performance achieved with Hidden Markov Models (HMMS) for speech synthesis, we decided to investigate HMM-based laughter synthesis. We decided to use the HMM-based Speech Synthesis System (HTS), as it is free and widely used in speech synthesis and research.
Several issues had to be faced when trying to train HMMs for laughter synthesis. One important problem is the amount of available data. Indeed, HMM-based synthesis requires large quantities of training data. For speech, generally hours of speech from one single speaker (the voice that is being modeled) are used. And they must be phonetically transcribed. It is extremely hard to obtain such large quantities of laughter data, due to the emotional and spontaneous nature of the phenomenon. The reader who would like to learn more about laughter databases is invited to consult: http://www.ilhaire.eu/blog~Laughter-Databases. In our case we used the AVLaughterCycle database, as it already includes laughter phonetic transcriptions (Urbain & Dutoit, 2011). The problem is that the subjects who laughed the most produced around 5 minutes of laughs, which is few compared to usual speech synthesis databases.
Nevertheless, we could obtain promising results with this method. Several modifications have been introduced to the AVLaughterCycle database and the HTS standard synthesis process, in order to improve the quality of the synthesized laughs. It goes beyond the scope of this blog post to go into the technical details and explain these improvements. Some examples of the obtained laughs are given below. The synthesis quality still depends on the speaker and on the type of sounds that are asked to be synthesized. This is probably due to the limited training data: some sounds are better modeled for one speaker than another, simply because there are more examples of these sounds to train the models. We hope this problem will be solved in the future, with the help of a large quantity of new data specifically recorded for laughter synthesis.
It is important to note that there is currently no tool to generate laughter phonetic transcriptions: the produced laughs are currently synthesized from the phonetic transcriptions of human laughs from the AVLaughterCycle database.
A perceptive evaluation test will be carried out in the near future to quantify the improvements made so far and form a benchmark to compare future developments with. Apart from the optimization of several HTS functions and parameters, future works also include the development of a tool to create (or modify) laughter phonetic transcriptions and the synchronization with visual synthesis to animate a virtual agent.
HMM-based laughter synthesis examples
(Examples require Adobe Flash plug-in and activated audio on your computer)
- Laughs from speaker A:
- Laughs from speaker B:
- Laughs from speaker C, for whom we have less training data, hence the synthesis quality is lower than with the previous voices:
Cox, T. (2010, July 27). Laughter's secrets: faking it - the results. Consulted on July 7, 2012, on New Scientist: http://www.newscientist.com/article/dn19227-laughters-secrets-faking-it--the-results.html
Lasarcyk, E., & Trouvain, J. (2007). Imitating conversational laughter with an articulatory speech synthesis. Proceedings of the Interdisciplinary Workshop on the Phonetics of Laughter, (pp. 43-48). Saarbrücken, Germany.
Sundaram, S., & Narayanan, S. (2007, January). Automatic acoustic synthesis of human-like laughter. Journal of the Acoustic Society of America, 121(1), 527-535.
Urbain, J., & Dutoit, T. (2011). A phonetic analysis of natural laughter, for use in automatic laughter processing systems. Affective Computing and Intelligent Interactions, (pp. 397-406). Memphis, Tennesse, USA.