1. General information ====================== MULTISIMO is a multimodal corpus consisting of synchronized audio and video files, Kinect 2 files, annotation files, survey materials and documentation. It includes recorded interactions between groups of three people. In each interactive session two participants collaborate with each other to solve a quiz while assisted by a human facilitator. The 18 available dialogues are annotated with information capturing: - full speech transcription - dialogue structure - gaze - hand gestures - feedback of the facilitator - laughter type - head pose - motion energy - speech-word alignment - external assessments of dominance, personality and collaboration. In addition, the corpus includes: - data analysis, - survey materials, i.e. personality tests and experience assessment questionnaires as filled in by all participants, and - participant profiling data. The corpus was designed and implemented at the School of Computer Science and Statistics (SCSS), Trinity College Dublin (TCD), as part of the Marie Sklodowska-Curie Action MULTISIMO (grant agreement No 701621). The design of the recordings has been approved by the SCSS research ethics committee. Following the consent given by participants during the recording process, this corpus release includes 18 out of the 23 recorded interactions (approximately 3 hours). The corpus release process complies with the EU General Data Protection Regulation (GDPR) regarding the data privacy management and has been approved by the Data Protection Officer at TCD. The licence was implemented by the Technology Transfer Office at TCD. 2. Licence ============ Access to and any use of the MULTISIMO corpus is bound by a non-commercial Licence. The full licence text can be found here: http://multisimo.eu/papers/MULTISIMO_Non-excl_non-commercial_research_licence.pdf 3. Citation ============ For any publication that derives in any form from this corpus, please cite the following paper: Koutsombogera, Maria and Vogel, Carl (2018). Modeling Collaborative Multimodal Behavior in Group Dialogues: The MULTISIMO Corpus. In Proceedings of the 11th Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, pp. 2945-2951 [http://www.lrec-conf.org/proceedings/lrec2018/summaries/596.html] 4. Structure and naming conventions =================================== The corpus and related annotations are organized into 6 directories (see section 5). These directories include data related to the 18 recorded sessions. The following naming conventions are observed: * S corresponds to a session (group conversation), followed by the number of the particular session, e.g. S23 denotes session 23. * P, followed by a number, corresponds to a player ID, e.g. P049 denotes player 49. There are overall 36 players in the data. * M, followed by a number, corresponds to a moderator (facilitator) id, e.g. M001 denotes moderator (a.k.a facilitator) 1. There are overall 3 persons sharing the role of facilitator in the 18 sessions. * _S_O (part of a file name): synced file in the original format * _S_H (part of a video file name): denotes synced file in high quality format * _S_L (part of a video file name): denotes synced file in low quality format * Z_S_O (part of a video file name): denotes zoomed view of a synced video file in the original format * Z_S_H (part of a video file name): denotes zoomed view of a synced video file in high quality format * Z_S_L (part of a video file name): denotes zoomed view of a synced video file in the low quality format * _front-video (part of a video file name): denotes front video angle showing a single participant * _360-video (part of a video file name): denotes 360 video angle showing all participants * _all_video (part of a video file name): denotes angle showing all participants The above conventions are combined to form complete file names and need to be used to understand file contents based on file name. For instance: - "P048_S23_audio_S_O.wav" is an audio file of Player 048 in Session 23, in the original format, synced with all audio and video files of Session 23. - "P049_S23_front-video_Z_S_H.mov" is a zoomed front angle video file of Player 049 in Session 23, in high quality format, synced with all audio and video files of Session 23. - "S23_all_video_S_L.mov" is a video file showing all participants of Session 23, in low quality format, synced with all audio and video files of Session 23. 5. Content =========== 5.1 AUDIO --------- * MONO.tar.gz Includes 54 audio files (mono), 1 per speaker, organized in 18 directories (1 per session) Format: wav Bit rate: 768 kbps Sample rate: 48KHz Codec: pcm_s16le Channels: 1 * STEREO.tar.gz Includes 18 audio files (stereo), 1 per session. Format: wav Bit rate: 1536 kbps Sample rate: 48KHz Codec: pcm_s16le Channels: 2 * STEREO_to_MONO.tar.gz Includes 18 audio files, 1 per session; mono files converted from stereo. Format: wav Bit rate: 768 kbps Sample rate: 48KHz Codec: pcm_s16le Channels: 1 * SURROUND.tar.gz Includes 18 audio files, 1 per session, in surround sound. Format: wav Bit rate: 4608 kbps Sample rate: 48KHz Codec: pcm_s16le Channels: 6 * audio_slices-mp3.tar.gz Includes 65 audio slices in mp3 format (5-10 seconds each) from 30 speakers; the slices were used in the automatic personality detection task. Format: mp3 Bit rate: 64 kbps Sample rate: 48KHz Codec: mp3 Channels: 1 * audio_slices-wav.tar.gz Includes 65 audio slices in wav format (5-10 seconds each) from 30 speakers; the slices were used in the automatic personality detection task. Format: wav Bit rate: 768 kbps Sample rate: 48KHz Codec: pcm_s16le Channels: 1 5.2 VIDEO --------- 5.2.1 VIDEO_HIGH_QUALITY.tar.gz ----- Includes all video files in high quality, synchronized (*_S_H* naming convention). - for *_front-video_* and *_all_video_* file names: Format: QuickTime/MPEG-4 Codec: h264 Resolution: 960*540 Frame Rate: 29.98 fps - for *_360-video_*: Format: QuickTime/MPEG-4 Codec: h264 Resolution: 1440*814 Frame Rate: 29.98 fps 5.2.2 VIDEO_LOW_QUALITY.tar.gz ----- Includes all video files in low quality, synchronized (*_S_L* naming convention). Recommended for use in annotation software/editors, because of the smaller file size. - for *_front-video_* and *_all_video_* file names: Format: QuickTime/MPEG-4 Codec: h264 Resolution: 500*282 Frame Rate: 29.98 fps - for *_360-video_*: Format: QuickTime/MPEG-4 Codec: h264 Resolution: 720*480 Frame Rate: 29.98 fps 5.2.3 Combined_front_angles.tar.gz ----- Includes 18 videos (1 per session) Each video combines front angles of players and a view of all session participants. Recommended for use in cases where one needs to focus on the observation of front angles of both players. Format: QuickTime/MPEG-4 Codec: h264 Resolution: 1920*1080 Frame Rate: 60 fps 5.2.4 mp4-front_angle.tar.gz ----- Includes 54 videos (1 per session participant) of front angles of corpus participants in mp4. Format: QuickTime/MPEG-4 Codec: h264 Resolution: 854*480 Frame Rate: 60 fps 5.3 ANNOTATIONS --------------- In the annotations performed with the ELAN software (https://tla.mpi.nl/tools/tla-tools/elan/), the related .eaf files have been previously linked to the corpus videos. Those links are local, therefore you will be prompted to select a video file to link to the annotation; since all different angles of video are synchronized, you can select any (and up to 4) video type for a specific session. * BFI-10-assessment.xlsx.tar.gz Includes observations of 8 external human raters who listened to 79 audio slices and filled in a personality test according to the way they perceive the personality features of the corpus participants. The test employed is BFI-10 (Rammstedt, B. & John, O. P. (2007). Measuring personality in one minute or less: A 10-item short version of the Big Five Inventory in English and German. Journal of Research in Personality, 41, 203-212). * BFI-44-assessment.xlsx.tar.gz Includes individual scores and percentiles on the big 5 traits for all 39 participants. These are the results of the personality test that participants took before the recordings. The test employed is BFI-44 (John, O. P., Donahue, E. M., & Kentle, R. L. (1991). The Big Five Inventory--Versions 4a and 54. Berkeley, CA: University of California,Berkeley, Institute of Personality and Social Research). * collaboration_assessment.csv.gz Includes observations of 2 external human raters who watched the corpus videos and assessed the collaboration levels of the quiz players. Assessment is performed on a scale from 0 (no collaboration) to 4 (very high collaboration). * Dominance-assessment.xlsx.tar.gz Includes observations of 5 external human raters who watched the corpus videos and assessed the dominance levels of the quiz players. Assessment is performed on a scale from 1 (not at all dominant) to 5 (very dominant). * EAQ-ANALYSIS.xlsx.tar.gz Includes the analysis of the EAQ questionnaire that participants filled in after the recordings. * EMOTION-HEAD_POSE.tar.gz Includes emotion analysis, head pose and action unit analysis for each speaker in the corpus, as measured from the EMOTIENT software. * MOTION_ENERGY Includes a csv file with motion energy measurements per frame, for all players in all dialogues. * Timestamps-Sections_Subsections.tar.gz Includes timestamps for the sections and subsections in each dialogue. Overall, there are 5 sections (Introduction, Question1, Question2, Question3, Closing); Questions are further distinguished into 2 subsections: (i) identifying the answers and (ii) ranking the answers. * WORD_ALIGNMENT Includes csv files with the onset and offset of each word of the players' speech content in 14 dialogues. This was performed using the dialog transcripts and the MONO audio files with the Praat software. * facilitator_feedback.tar.gz Includes 18 files (.eaf) annotated in ELAN with information about the types (positive, negative, neutral) and subtypes of the facilitator's feedback. * gaze_annotation.tar.gz Includes the manual annotation (in ELAN) of participants' gaze. * gesture_annotation.tar.gz Includes the manual annotation (in ELAN) of participants' gestures by 2 raters (full annotation by rater 1, partial annotation by rater 2). * laughter_annotation.tar.gz Includes 18 files (.eaf) annotated in ELAN with information about laughter type; laughter was annotated with 2 values, "discourse" and "mirthful". * pauses_analysis.xlsx.tar.gz Analysis of speech pauses in each session, including left and right context, number, duration and frequency of pauses in the whole session and in its sections. * speech transcription_Transcriber.tar.gz Includes 18 transcription files (.trs) of the sessions. Transcription was performed with the Transcriber software (http://trans.sourceforge.net/). * speech transcription_Elan.tar.gz Includes 18 ELAN files (.eaf) with speech transcription imported from Transcriber. * transcriber_configuration.cfg.tar.gz Configuration file used with Transcriber. * transcripts and word frequencies.tar.gz Includes the following directories: - transcriptions: all transcripts in xml and txt format. - word_frequency: index of words and their frequencies per participant and per session. - Normalised_Transcripts: clean text of the transcript per participant. - Normalised_word_Frequency: clean index of word and their frequencies per participant and per session. * turn-taking_stats.xlsx.tar.gz Includes details about turn-taking features for each session and participant, such as: number of turns, number of words, turn duration, number of words per minute, overlaps number and duration, etc. 5.4 KINECT ---------- Includes raw Kinect 2 files for each session. Please note that these files are not synchronized with the audio and video files in the related directories. 5.5 ORIGINAL_VIDEO_AUDIO ------------------------ Includes audio and video files in the original format, synchronized. The directory includes 18 tar.gz files, one per session. Each tar.gz file includes the audio and video files of the session (following the the *_S_O* naming convention described in Section 4). Please be minded of the large size of these files, e.g. if you plan to use them in annotation software/editors. * Audio Files specs: Format: wav Bit rate: 2304 kbps Sample rate: 48KHz Codec: pcm_s24le * Video Files specs: - for *_front-video_* and *_all_video_* filenames: Format: QuickTime/MPEG-4 Codec: h264 Resolution: FHD, 1920*1080 Frame Rate: 29.98 fps - for *_360-video_* filenames: Format: QuickTime/MPEG-4 Codec: h264 Resolution: 4K, 3840*2160 Frame Rate: 29.98 fps 5.6 DOCUMENTATION ----------------- - BFI.pdf: the personality test (Big Five Inventory) filled in by the participants before the recordings. - EAQ.pdf: the Experience Assessment Questionnaire filled in by the participants after the recordings. - participants_details.xlsx: a list of the participants' id and the session they participate in, their age, sex, nationality, English nativeness and level of familiarity with their group partners. Acknowledgments =============== The research leading to the development of the corpus has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement No 701621 (MULTISIMO). We sincerely thank all people who volunteered to participate in the recordings and the assessments. Special thanks go to the following researchers, who contributed to the data collection and created or contributed to annotations: Dr Akira Hayakawa (experimental setup, data preprocessing); Dr Fasih Haider (experimental setup); Dr Justine Reverdy (facilitator's feedback annotation and analysis); Rachel Costello (turn-taking analysis); Parth Sarthy (analysis of personality-related text and audio features); Siddhitha Sundari Bhoopathy (laughter annotation and analysis); Aine Glynn (gaze annotation); Mohamed Tousif (gesture annotation); Kyle Behan (gesture annotation); Anaïs Claire Murat (gaze annotation); Zohreh Khosrobeigi (word alignment, motion energy). --- Maria Koutsombogera & Carl Vogel Computational Linguistics Group, Centre for Computing and Language Studies, School of Computer Science and Statistics, Trinity College, the University of Dublin Jan. 15, 2019 Last updated: December 11, 2023