# Resources

We list the links to toolkits and available datasets here.

# Toolkits

# Datasets

# ST

Dataset Languages Duration Domain
GigaST (opens new window) EN→ZH, EN→DE 10,000hrs diverse
LIBRI-TRANS (opens new window) (Kocabiyikoglu et al., 2018 (opens new window)) EN→FR 236hrs read audiobooks
MuST-C (opens new window) (Cattoni et al., 2021 (opens new window)) EN→ 14 lang. 237-504hrs TED talks
CoVoST (opens new window)(Wang et al., 2020 (opens new window)) EN→15 lang. ,
21 lang. →EN
929hrs, 30-311hrs read, Common Voice
Europarl-ST (opens new window) (Iranzo-Sanchez et al., 2020 (opens new window)) 9 lang. 10-90hrs EP proceedings
Multilingual TEDx (opens new window) (Salesky et al., 2021 (opens new window)) 8 lang.→6 lang. 11-69hrs TED talks
IWSLT 2018 (opens new window) (Niehues et al., 2018 (opens new window)) EN→DE 273hrs TED talks
CIAIR (opens new window) (Tohyama et al., 2005 (opens new window)) EN→JA 182hrs travel conversation
EPIC (opens new window) (Bendazzoli et al., 2005 (opens new window)) IT↔EN↔ES 18hrs parliament interpret.
Fisher--CALLHOME (opens new window) (Post et al., 2013 (opens new window)) ES→EN 160hrs phone conversations
STC (opens new window) (Shimizu et al., 2014 (opens new window)) EN↔JA 22hrs simult. interpret.
How2 (opens new window) (Sanabria et al., 2018 (opens new window)) EN→PT 300hrs instructional videos
Griko (opens new window) (Boito et al., 2018 (opens new window)) GR→IT 18min conversation (linguists)
LibriVoxDeEn (opens new window) (Beilharz et al. 2020 (opens new window)) DE→EN 100hrs read audiobooks
MaSS (opens new window) (Boito et al., 2020 (opens new window)) 8 lang. 20hrs Bible readings
BSTC (opens new window) (Zhang et al., 2020 (opens new window)) ZH→EN 50hrs simult. interpret.
HPN (opens new window) (Shi et al., 2021 (opens new window)) Hpn→Es 36hrs conversation (plant)

# ASR

Dataset Languages Duration Domain
GigaSpeech (opens new window) (Cheng et al., 2021 (opens new window)) EN 10,000hrs diverse
Wenet (opens new window) (Zhang et al., 2021 (opens new window)) ZH 10,000hrs diverse
LibriSpeech (opens new window) (Panayotov et al., 2015 (opens new window)) EN 1,000hrs read audiobooks
TED-LIUM 3 (opens new window) (Hernandez et al., 2018 (opens new window)) EN 452hrs TED talks
Common Voice en_2181h_2020-12-11 (opens new window) (Ardila et al., 2019 (opens new window)) EN 1,686hrs validated
VoxForge EN (opens new window) EN 120hrs
AISHELL-2 (opens new window) (Du et al., 2018 (opens new window)) ZH 1,000hrs smart house, industry, ...
BAAI Magic Chinese Data (opens new window) ZH 100hrs real-life dialogues

# MT

Dataset Domain
WMT2020 (opens new window) news, europarl, common crawl, paracrawl, UN
OpenSubtitles2018 (opens new window) movie subtitles
CCAligned (opens new window) (El-Kishky et al., 2020 (opens new window)) common crawl
TED 2020 (opens new window) TED talks
Last Updated: 3/29/2022, 6:05:33 PM