Page: 1 2 3 4 5 6 7 8 9 Next »

Year-2 kick-off meeting

Posted on Mon 05 Nov 2012 by Harry Griffin
The ILHAIRE partners came to London for the year-2 kick off meeting on the 3rd and 4th October. Hot on the heels of a successful and constructive year-1 review meeting at the European Commission in Brussels, we had plenty to discuss and plans to make. Year 1 saw substantial leaps forward in the technology of laughter synthesis and detection, with several publications under review and projects submitted. In year 2 we will bring these new technical opportunities to bear on some of the fundamental psychological questions of laughter.

With a long history of laughter research the team from UZH are ideally placed to focus our experiments and keep them on a secure psychological footing. With their guidance the consortium drew up a list of scenarios for future experiments. Brainstorming these laughter generation tasks is a lot of fun, but ensuring that they are feasible in multimodal recording experiment is much more of a challenge. Experiments will start very quickly in year-2 so various partners will be collaborating to gather data in the next few months.

Elisabeth Andre took the opportunity to give a talk at the UCLIC seminar series which was very well received, and many of the partners came out to sample some of London’s excellent pubs and restaurants. We look forward to hosting the ILHAIRE partners in London again in 2013, when there will be plenty more to discuss and more plans to make!

Acoustic Laughter Synthesis

Posted on Thu 30 Aug 2012 by Jerome Urbain

Examples of synthesized laughs below

Compared to speech synthesis, laughter synthesis is an almost unexplored field. Nevertheless the problem is quite complex and the few attempts made so far suffer from a lack of naturalness.

The most known works are those of Sundaram and Narayanan (Sundaram & Narayanan, 2007) and Lasarcyk and Trouvain (Lasarcyk & Trouvain, 2007). Their approaches were completely different. Sundaram and Narayanan remarked that the energy envelop of laughter waveforms oscillates like a physical mass-spring system.  They modeled that trajectory to form the energy envelop of their synthetic laughs and synthesized the constituting vowels via Linear Prediction. Lasarcyk and Trouvain compared two synthesis strategies that had been explored for speech synthesis in the past: diphone concatenation, and synthesis produced by a 3D modeling of the human vocal organs. The latter system gave better results, but laughs were still evaluated as non-natural by naïve listeners.

More recently, Cox (Cox, 2010) conducted an online study to evaluate the different approaches to laughter synthesis (http://www.newscientist.com/article/dn19227-laughters-secrets-faking-it--the-results.html). The participants to the study had to judge whether a presented laugh had been produced by a computer or a human. The study included different synthetic laughs (among others from the two aforementioned groups) as well as human laughs. 6000 people participated in the study. No synthetic method obtained more than 25% of favorable votes (participants thinking it could have been uttered by a human), except from a concatenation method proposed by the University of Mons, which fooled 60% of the raters. It must be noted, however, that this method is actually copy-synthesis (meaning that we try to copy as best as possible the acoustic features of an existing laugh) and not true synthesis (starting from a textual description to generate a laughter audio signal). In conclusion, we can say that no existing laughter synthesis method is perceived as natural by naive listeners. Another interesting fact is that the actual human laughs only reached an 80% score; in other words, on average 20% of listeners thought it had been synthesized by a machine.

Given the lack of naturalness of the past laughter synthesis attempts and the good performance achieved with Hidden Markov Models (HMMS) for speech synthesis, we decided to investigate HMM-based laughter synthesis. We decided to use the HMM-based Speech Synthesis System (HTS), as it is free and widely used in speech synthesis and research.

Several issues had to be faced when trying to train HMMs for laughter synthesis. One important problem is the amount of available data. Indeed, HMM-based synthesis requires large quantities of training data. For speech, generally hours of speech from one single speaker (the voice that is being modeled) are used. And they must be phonetically transcribed. It is extremely hard to obtain such large quantities of laughter data, due to the emotional and spontaneous nature of the phenomenon. The reader who would like to learn more about laughter databases is invited to consult: http://www.ilhaire.eu/blog~Laughter-Databases. In our case we used the AVLaughterCycle database, as it already includes laughter phonetic transcriptions (Urbain & Dutoit, 2011). The problem is that the subjects who laughed the most produced around 5 minutes of laughs, which is few compared to usual speech synthesis databases.

Nevertheless, we could obtain promising results with this method. Several modifications have been introduced to the AVLaughterCycle database and the HTS standard synthesis process, in order to improve the quality of the synthesized laughs. It goes beyond the scope of this blog post to go into the technical details and explain these improvements. Some examples of the obtained laughs are given below. The synthesis quality still depends on the speaker and on the type of sounds that are asked to be synthesized. This is probably due to the limited training data: some sounds are better modeled for one speaker than another, simply because there are more examples of these sounds to train the models. We hope this problem will be solved in the future, with the help of a large quantity of new data specifically recorded for laughter synthesis.

It is important to note that there is currently no tool to generate laughter phonetic transcriptions: the produced laughs are currently synthesized from the phonetic transcriptions of human laughs from the AVLaughterCycle database.

A perceptive evaluation test will be carried out in the near future to quantify the improvements made so far and form a benchmark to compare future developments with.  Apart from the optimization of several HTS functions and parameters, future works also include the development of a tool to create (or modify) laughter phonetic transcriptions and the synchronization with visual synthesis to animate a virtual agent.

HMM-based laughter synthesis examples

(Examples require Adobe Flash plug-in and activated audio on your computer)

 

Bibliography

  • Cox, T. (2010, July 27). Laughter's secrets: faking it - the results. Consulted on July 7, 2012, on New Scientist: http://www.newscientist.com/article/dn19227-laughters-secrets-faking-it--the-results.html

     

  • Lasarcyk, E., & Trouvain, J. (2007). Imitating conversational laughter with an articulatory speech synthesis. Proceedings of the Interdisciplinary Workshop on the Phonetics of Laughter, (pp. 43-48). Saarbrücken, Germany.

     

  • Sundaram, S., & Narayanan, S. (2007, January). Automatic acoustic synthesis of human-like laughter. Journal of the Acoustic Society of America, 121(1), 527-535.

     

  • Urbain, J., & Dutoit, T. (2011). A phonetic analysis of natural laughter, for use in automatic laughter processing systems. Affective Computing and Intelligent Interactions, (pp. 397-406). Memphis, Tennesse, USA.

     

Laughter Databases

Posted on Wed 02 May 2012 by Jerome Urbain

To analyse, characterise and model laughter, samples are necessary. Gathering natural laughter samples is a tricky task: laughter is an emotional signal and its realization is strongly influenced by our feelings. Hence natural laughter can only be obtained in realistic situations; simply asking people to laugh in front of cameras does not enable to gather the wide variety of laughs we encounter every day.

Three main techniques have been used to build databases in the fields of emotion and laughter research (Scherer, 2003), from the most natural to the most convenient:

  • Natural expression: data is collected from the real-world, with the subjects free to express themselves and, ideally, not aware that they are being recorded until the end of the data acquisition. A popular setting for emotion recognition is the use of data collected in call-centres (Morrison, Wang, & Silva, 2007) (Devillers & Vidrascu, 2007). The big advantage of these techniques is the naturalness of the data. However, a lot of post processing is needed to segment the laughter utterances, surrounded by a lot of other signals, and it is difficult to ensure high quality recordings in such “hidden” settings.
  • Induced responses: subjects are presented to a stimulus (picture, video, vocal information, etc.) chosen to elicit a target emotion (happiness, fear, etc.). For example, laughter can be induced by presenting a comedy video. Users can be aware they are being recorded, but everything is done to provoke natural reactions. This kind of setting enable to get a good control over the quality of the acquired signals, e.g. using head-mounted microphones, frontal camera views, etc. However, the scenario must be carefully designed to ensure that the participant somehow forgets he is being recorded and acts naturally.
  • Portrayed emotions: actors, professional or not, are directly asked to portray the emotional state or, in our case, to laugh. This kind of data is the easiest to work with, as the annotation phase is quicker and everything can be done to acquire high quality signals (audio, video, etc.). The drawback is the lack of naturalness.

While laughter can virtually be found in any database involving interacting humans, some databases have been recorded with a particular attention on laughter (during the recordings or the post-processing phases). We will only mention some of these “laughter-related” databases here. Further laughter data will be recorded in the framework of the ILHAIRE project to overcome some of the limitations of the currently existing databases, according to the ILHAIRE objectives: we ideally need numerous, accurately annotated (not only laughs must be accurately spotted, but we also need information about the context that elicited and sustained the laugh utterance), multimodal (high quality audio, high quality video for facial movements, body movements, respiration signal, etc.), multi-user (among others to study contagion), spontaneous and multi-cultural laughter data. No existing database satisfies all these requirements. It is indeed unrealistic to have a single corpus fulfilling all the conditions. When appropriate data for our research does not already exist, new data will be recorded by ILHAIRE participants, in line with the scenarios we are targeting in the project, but trying to make new data as profitable as possible for the whole scientific community.

  1. The ICSI Meeting Corpus

    In 2000, the International Computer Science Institute (ICSI) of Berkeley launched a project to record a large speech database from meetings, called ICSI Meeting Corpus. The purpose was to obtain speech as natural as possible and they chose to record only meetings that would have occurred anyway (Janin, et al., 2003). All the meetings took place in a meeting room of their lab, where they would have taken place even if the ICSI Meeting Corpus project had not existed. The only, but important, unnatural setting the project implied on the meetings was the use of head-mounted microphones, in order to have easier speech activity detection and high quality speech transcriptions, but also to avoid penalizing, with poor acoustic signals, non-acoustic research like dialogue structure analysis (Janin, et al., 2003). In consequence, all the subjects knew they were being recorded.

    The meetings involved 3 to 10 participants, with an average of 6 (Janin, et al., 2004). In total, 72h of meetings were recorded, involving 53 different participants. Huge effort was done to annotate the data. For each meeting, there is a full speech transcription, with beginning and ending times of each utterance. Laughter occurrences were also included in the transcriptions. The Corpus is available from the Linguistic Data Consortium (LDC).

    Laughter processing was not the initial purpose of the ICSI Meeting Corpus and it was not the event that received the most attention. But due to the quality of the database, recorded in a natural environment and presenting numerous episodes of laughs in all its forms, it became a standard for laughter processing. Some of the groups using the ICSI Meeting Corpus for laughter processing did additional annotation works to keep only clearly audible laughs (Truong & Leeuwen, 2007) or localize the boundaries of the laughter segments with more accuracy (Laskowsk & Burger, 2007).

  2. The AMI Meeting Corpus

    The AMI Meeting Corpus (Carletta, 2007) consists of 100 hours of meeting recordings. One third is naturally occurring lab meetings. The remaining two thirds were elicited by a role playing game in which participants had to take different roles in a team project. While this is different from the settings of the ICSI Meeting corpus, it has little influence on laughter naturalness.

    The recordings include synchronized audio (individual and far-field microphones) and video (individual and room-view cameras). All of the 138 role playing meetings involved 4 participants. Out of the 33 naturally occurring meetings, 25 also involve 4 participants, 5 have 3 conversationalists and the last 3 have 5 participants.

    The database is freely available for non-commercial purposes from the AMI website. Signals are provided with a range of annotations. Some recordings include annotations about dialogue acts, emotions, actions, gestures, etc. The vocal activities of each participant in each meeting has also been manually transcribed. These transcriptions include the speech as it has been uttered by the speaker (with grammatical errors, hesitations, etc.) and other non-verbal vocalisations such as laughter and cough.

  3. TWSES corpus

    In the framework of the artistic installation "The world starts every second'' (TWSES) (Lafontaine & Todoroff, 2007), voluntary laughs from children and professional singers have been recorded. Some singers portrayed different states of mind like "lover laugh'', "hysteric laugh'', "obsessional laugh'', etc. Some are obviously exaggerated, corresponding to stereotypes of what we consider as "free laughter''. They are clearly different from spontaneous laughs, but they have a strong power of eliciting laughter to their listeners. The recordings include only audio.

  4. SEMAINE database

    Laughter occurs frequently in the SEMAINE SAL recordings (McKeown, Valstar, Cowie, Pantic, & Schroder, 2011), where users are discussing with either an operator-driven avatar or a limited automated avatar, designed to elicit particular emotions. Each file of the SAL database has been labelled in emotional states by 6 to 8 annotators, providing information about the emotional states leading to and following laughter. The database contains audio-visual recordings of the participants with frontal cameras and head-mounted microphones. The database is freely available to the research community.

  5. The Belfast Induced Natural Emotion Database

    The Belfast Induced Natural Emotion Database (BINED) (Sneddon, McRorie, McKeown, & Hanratty, 2011) contains recordings of the natural reactions of subjects participating in different tasks aiming to elicit five different emotional states: frustration, surprise, fear, disgust and amusement. Laughter frequently appeared in all these tasks. The database is freely available for research.

  6. The Green Persuasive Database

    The Green Persuasive Database consists of recordings of conversations between a persuader and a person he tries to convince to adopt more ecological behaviours. Each conversational partner is recorded with a different camera. As in any natural conversation, there are laughter occurrences in the dataset. This database is also freely available for research.

  7. The AVLaughterCycle database

     

    The AVLaughterCycle (AVLC) database (Urbain, et al., 2010) has been recorded during eNTERFACE'09. Its objective was to obtain good quality laughs. 24 subjects were filmed while watching a 12-minutes comedy video. The recorded signals consist of audio, frontal video (webcam) as well as facial motion tracking data. The obtained laughter utterances have been segmented and phonetically annotated (Urbain & Dutoit, 2011). The database is freely available for research works.


     
  8. The Manhob laughter database

    The Manhob laughter database is similar to the AVLC corpus. It was also recorded specifically to acquire laughter data. Subjects were also filmed while watching a funny video, alone. The signals include audio, video and thermal video recordings. Subjects were also asked to speak, which provides interesting data to study the relationship between the laugh and speech styles/features of a person.

  9. Other databases including laughter

    Laughter occurring frequently in everyday situations, most speech databases contain laughs. We can cite the Corpus of Spontaneous Japanese (Maekawa, Koiso, Kikuchi, & Yoneyama, 2003), containing around 650 hours of spontaneous speech. This huge audio database has been transcribed including labels denoting the presence of laughter, but no time boundaries. Professor Nick Campbell has also been involved in several extensive recordings of spontaneous speech and has shown interest in spotting laughter inside his large corpora (Campbell’s website). Among others, there are 20 hours of telephone conversations in Japanese between 8 pairs of volunteers accounting for 2001 laughter and 1129 speech-laugh utterances (Campbell, 2007).

    For his observations of laughs, Wallace Chafe (Chafe, 2007) used excerpts of the Santa Barbara Corpus of Spoken American English, which contains 60 recordings of discourse segments in a range of everyday situations (talking about studies, preparing dinner, business conversations, etc.). The corpus contains transcriptions of the audio files, including labels identifying laughter.

    John Esling (Esling, 2007) used samples from the University of Victoria Larynx Research Project, which includes nasoendoscopic videos of the larynx, to analyse the states of the larynx in laughter. Devillers and Vidrascu  (Devillers & Vidrascu, 2007) were interested in the emotions conveyed by laughter and used 20 hours of telephone conversations in a call centre providing medical advices. Verbal and nonverbal contents such as laughs or tears have been manually annotated. More than half of the 119 laughter utterances in this corpus have been related to negative emotions.

    Dedicated laughter databases have also been recorded for studying some aspects of laughter. For analising laughter acoustics, Bachorowski et al.  (Bachorowski, Smoski, & Owren, 2001) enrolled 139 students and let them watch video containing humorous sequences either alone or with a partner. Laughs from 97 individuals (52 females, 45 males) were kept for the acoustic analyses, for a total of 559 female and 465 male laughter bouts (i.e. laughter exhalation segment). For identifying acoustic correlates of different emotions in laughter, Szameitat et al.  (Szameitat, Alter, Szameitat, Wildgruber, Sterr, & Darwin, 2009), asked 8 professional actors to portray laughs for 4 different affective states, namely joyous, taunting, tickling and schadenfreude (i.e. German word meaning "pleasure in another's misfortune'') laughs. Kipper and Todt (Kipper & Todt, 2007) induced laughter while subjects were reading by putting their own voice in playback with a 200ms delay .

Bibliography

  • Bachorowski, J.-A., Smoski, M. J., & Owren, M. J. (2001). The acoustic features of human laughter. Journal of the Acoustical Society of America, 1581-1597.
  • Campbell, N. (2007). Whom we laugh with affects how we laugh. Proceedings of the Interdisciplinary Workshop on the Phonetics of Laughter, (pp. 61-65). Saarbrücken, Germany.
  • Campbell, N. (n.d.). Nick's Data website. Retrieved June 5, 2011, from http://www.speech-data.jp/
  • Carletta, J. (2007). Unleashing the killer corpus: experiences in creating the multi-everything {AMI Meeting Corpus. Language Resources and Evaluation Journal, 41(2), 181-190.
  • Chafe, W. (2007). The Importance of not being earnest. The feeling behind laughter and humor. (Paperback 2009 ed., Vol. 3). Amsterdam, The Nederlands: John Benjamins Publishing Company.
  • Devillers, L., & Vidrascu, L. (2007). Ensemble methods for spoken emotion recognition in call-centres. Proceedings of the Interdisciplinary Workshop on the Phonetics of Laughter, (pp. 37-40). Saarbrücken, Germany.
  • Esling, J. H. (2007). States of the larynx in laughter. Proceedings of the Interdisciplinary Workshop on the Phonetics of Laughter, (pp. 15-20). Saarbrücken, Germany.
  • Janin, A., Ang, J., Bhagat, S., Dhillon, R., Edwards, J., Macias-Guarasa, J., et al. (2004). The ICSI Meeting Project: Resources and Research. NIST ICASSP 2004 Meeting Recognition Workshop. Montreal, Canada.
  • Janin, A., Baron, D., Edwards, J., Ellis, D., Gelbart, D., Morgan, N., et al. (2003). The ICSI Meeting Corpus. 2003 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP'03), (pp. I--364). Hong-Kong.
  • Kipper, S., & Todt, D. (2007). Series of similar vocal elements as a crucial acoustic structure in human laughter. Proceedings of the Interdisciplinary Workshop on the Phonetics of Laughter, (pp. 3-7). Saarbrücken.
  • Lafontaine, M.-J., & Todoroff, T. The world starts every second. Artistic Installation held at the "Musee des Beaux-Arts'', Angers, France}, Angers, France.
  • Laskowsk, K., & Burger, S. (2007). On the Correlation between Perceptual and Contextual Aspects of Laughter in Meetings. Proceedings of the Interdisciplinary Workshop on the Phonetics of Laughter, (pp. 55-60). Saarbrücken, Germany.
  • Maekawa, K., Koiso, H., Kikuchi, H., & Yoneyama, K. (2003). Use of a large-scale spontaneous speech corpus in the study of linguistic variation. Proceedings of the 15th International Congress of Phonetic Sciences (ICPhS 2003), (pp. 643-646).
  • McKeown, G., Valstar, M., Cowie, R., Pantic, M., & Schroder, M. (2011). The SEMAINE database: Annotated multimodal records of emotionally coloured conversations between a person and a limited agent. IEEE Transactions on Affective Computing.
  • Morrison, D., Wang, R., & Silva, L. C. (2007). Ensemble methods for spoken emotion recognition in call-centres. Speech Communication, 98-112.
  • Scherer, K. (2003). Vocal communication of emotion: a review of research paradigms. Speech Communication, 227-256.
  • Sneddon, I., McRorie, M., McKeown, G., & Hanratty, J. (2011). The Belfast Induced Natural Emotion Database. IEEE Transactions on Affective Computing.
  • Szameitat, D. P., Alter, K., Szameitat, A. J., Wildgruber, D., Sterr, A., & Darwin, C. J. (2009). Acoustic profiles of distinct emotional expressions in laughter. (ASA, Ed.) The Journal of the Acoustical Society of America, 126(1), 354-366.
  • Truong, K. P., & Leeuwen, D. A. (2007). Automatic discrimination between laughter and speech. Speech Communication, 144-158.
  • Urbain, J., & Dutoit, T. (2011). A phonetic analysis of natural laughter, for use in automatic laughter processing systems. Affective Computing and Intelligent Interactions, (pp. 397-406). Memphis, Tennesse, USA.
  • Urbain, J., Bevacqua, E., Dutoit, T., Moinet, A., Niewiadomski, R., Pelachaud, C., et al. (2010). AVLaughterCycle: Enabling a virtual agent to join in laughing with a conversational partner using a similarity-driven audiovisual laughter animation. Journal on Multimodal User Interfaces, 47-58.