# Resources

We list the links to toolkits and available datasets here.

# Toolkits

# Datasets

# ST

Dataset	Languages	Duration	Domain
GigaST (opens new window)	EN→ZH, EN→DE	10,000hrs	diverse
LIBRI-TRANS (opens new window) (Kocabiyikoglu et al., 2018 (opens new window))	EN→FR	236hrs	read audiobooks
MuST-C (opens new window) (Cattoni et al., 2021 (opens new window))	EN→ 14 lang.	237-504hrs	TED talks
CoVoST (opens new window)(Wang et al., 2020 (opens new window))	EN→15 lang. , 21 lang. →EN	929hrs, 30-311hrs	read, Common Voice
Europarl-ST (opens new window) (Iranzo-Sanchez et al., 2020 (opens new window))	9 lang.	10-90hrs	EP proceedings
Multilingual TEDx (opens new window) (Salesky et al., 2021 (opens new window))	8 lang.→6 lang.	11-69hrs	TED talks
IWSLT 2018 (opens new window) (Niehues et al., 2018 (opens new window))	EN→DE	273hrs	TED talks
CIAIR (opens new window) (Tohyama et al., 2005 (opens new window))	EN→JA	182hrs	travel conversation
EPIC (opens new window) (Bendazzoli et al., 2005 (opens new window))	IT↔EN↔ES	18hrs	parliament interpret.
Fisher--CALLHOME (opens new window) (Post et al., 2013 (opens new window))	ES→EN	160hrs	phone conversations
STC (opens new window) (Shimizu et al., 2014 (opens new window))	EN↔JA	22hrs	simult. interpret.
How2 (opens new window) (Sanabria et al., 2018 (opens new window))	EN→PT	300hrs	instructional videos
Griko (opens new window) (Boito et al., 2018 (opens new window))	GR→IT	18min	conversation (linguists)
LibriVoxDeEn (opens new window) (Beilharz et al. 2020 (opens new window))	DE→EN	100hrs	read audiobooks
MaSS (opens new window) (Boito et al., 2020 (opens new window))	8 lang.	20hrs	Bible readings
BSTC (opens new window) (Zhang et al., 2020 (opens new window))	ZH→EN	50hrs	simult. interpret.
HPN (opens new window) (Shi et al., 2021 (opens new window))	Hpn→Es	36hrs	conversation (plant)

# ASR

Dataset	Languages	Duration	Domain
GigaSpeech (opens new window) (Cheng et al., 2021 (opens new window))	EN	10,000hrs	diverse
Wenet (opens new window) (Zhang et al., 2021 (opens new window))	ZH	10,000hrs	diverse
LibriSpeech (opens new window) (Panayotov et al., 2015 (opens new window))	EN	1,000hrs	read audiobooks
TED-LIUM 3 (opens new window) (Hernandez et al., 2018 (opens new window))	EN	452hrs	TED talks
Common Voice en_2181h_2020-12-11 (opens new window) (Ardila et al., 2019 (opens new window))	EN	1,686hrs validated
VoxForge EN (opens new window)	EN	120hrs
AISHELL-2 (opens new window) (Du et al., 2018 (opens new window))	ZH	1,000hrs	smart house, industry, ...
BAAI Magic Chinese Data (opens new window)	ZH	100hrs	real-life dialogues

# MT

Dataset	Domain
WMT2020 (opens new window)	news, europarl, common crawl, paracrawl, UN
OpenSubtitles2018 (opens new window)	movie subtitles
CCAligned (opens new window) (El-Kishky et al., 2020 (opens new window))	common crawl
TED 2020 (opens new window)	TED talks