The MediaEVal 2011 Rich Speech Retrieval (RSR) task was an exploratory study of the retrieval from an archive of semiprofessional user-generated Internet video where user information needs were associated with specific types of speech acts. The video dataset was taken from the Internet sharing platform blip.tv, and search queries were associated with specific speech acts occurring in the video. A crowdsourcing approach was used to identify segments in the video data which contain speech acts, to create a description of the video containing the act and to generate search queries designed to refind this speech act. I first describe the construction of the dataset and reflect on our experiences with crowdsourcing this test collection using the Amazon Mechanical Turk platform. I will highlight the challenges we encountered in constructing this dataset, including the selection of the data source, design of the crowdsouring task and the specification of queries and relevant items. The completed MediaEval 2011 RSR test collection created using this methods was a known-item search for a single manually identified ideal jump-in point in the video where playback should begin for each query.
I will then provide a summary comparison of the results from three participant groups in the MediaEval 2011 RSR task based on automatic speech recognition system (ASR) transcripts, metadata manually assigned to each video by the user who uploaded it to blip.tv, and their combination. This analysis shows how the participants sought to use different transcript segmentation methods to maximize the rank of the relevant item and to locate the nearest match to the ideal jump-in point. The results indicate that best overall results are obtained for topically homogeneous segments which have a strong overlap with the relevant region associated with the jump-in point, and that use of metadata can be beneficial when segments are unfocused or cover more than one topic.



