AVSpeech: Large-scale Audio-Visual Speech Dataset
Ariel Ephrat and Inbar Mosseri and Oran Lang and Tali Dekel and Kevin Wilson and Avinatan Hassidim and William T. Freeman and Michael Rubinstein

folder AVSpeech (402 files)
filemetadata.jsonl.zip 133.32MB
fileclips/xpi.tar 3.70GB
fileclips/xpj.tar 2.77GB
fileclips/xph.tar 3.58GB
fileclips/xpf.tar 3.66GB
fileclips/xpg.tar 3.45GB
fileclips/xpe.tar 3.57GB
fileclips/xpc.tar 3.43GB
fileclips/xpd.tar 3.63GB
fileclips/xpa.tar 3.57GB
fileclips/xpb.tar 3.57GB
fileclips/xoy.tar 3.60GB
fileclips/xoz.tar 3.82GB
fileclips/xox.tar 3.69GB
fileclips/xow.tar 3.84GB
fileclips/xou.tar 3.95GB
fileclips/xov.tar 3.93GB
fileclips/xot.tar 3.83GB
fileclips/xos.tar 3.85GB
fileclips/xor.tar 3.76GB
fileclips/xoq.tar 3.75GB
fileclips/xoo.tar 3.72GB
fileclips/xop.tar 3.86GB
fileclips/xon.tar 3.87GB
fileclips/xom.tar 3.95GB
fileclips/xol.tar 3.56GB
fileclips/xok.tar 3.91GB
fileclips/xoj.tar 4.06GB
fileclips/xoi.tar 3.90GB
fileclips/xoh.tar 3.88GB
fileclips/xog.tar 3.77GB
fileclips/xof.tar 3.67GB
fileclips/xod.tar 3.92GB
fileclips/xoe.tar 3.69GB
fileclips/xob.tar 3.66GB
fileclips/xoc.tar 3.82GB
fileclips/xoa.tar 3.62GB
fileclips/xnz.tar 3.94GB
fileclips/xny.tar 4.14GB
fileclips/xnx.tar 3.93GB
fileclips/xnv.tar 4.22GB
fileclips/xnw.tar 3.94GB
fileclips/xnt.tar 3.58GB
fileclips/xnu.tar 3.63GB
fileclips/xnr.tar 4.01GB
fileclips/xns.tar 3.85GB
fileclips/xnp.tar 3.91GB
fileclips/xnq.tar 3.84GB
fileclips/xnn.tar 3.58GB
Too many files! Click here to view them all.
Type: Dataset
Tags: speech isolation, lip reading, face detection

Bibtex:
@article{,
title= {AVSpeech: Large-scale Audio-Visual Speech Dataset },
journal= {},
author= {Ariel Ephrat and Inbar Mosseri and Oran Lang and Tali Dekel and Kevin Wilson and Avinatan Hassidim and William T. Freeman and Michael Rubinstein},
year= {},
url= {https://looking-to-listen.github.io/avspeech/},
abstract= {AVSpeech is a new, large-scale audio-visual dataset comprising speech video clips with no interfering background noises. The segments are 3-10 seconds long, and in each clip the audible sound in the soundtrack belongs to a single speaking person, visible in the video. In total, the dataset contains roughly 4700 hours* of video segments, from a total of 290k YouTube videos, spanning a wide variety of people, languages and face poses. For more details on how we created the dataset see our paper, Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation (https://arxiv.org/abs/1804.03619).

* UPLOADER'S NOTE: This dataset contains 3000 hours of video segments and not the entire 4700 hours. 1700 hours were not included as some no longer existed on youtube, had a copyright violation, not available in the United States, or was of poor quality. Over 1 million segments are included in this torrent, each between 3 - 10 seconds, and in 720p resolution. See README on how to use this dataset},
keywords= {speech isolation, lip reading, face detection},
terms= {},
license= {},
superseded= {}
}


Send Feedback