# Resources
We list the links to toolkits and available datasets here.
# Toolkits
- NeurST (opens new window) (Zhao et al., 2020 (opens new window))
- Espnet-ST (opens new window) (Inaguma et al., 2020 (opens new window))
- Fairseq S2T (opens new window) (Wang et al., 2020 (opens new window))
# Datasets
# ST
# ASR
Dataset | Languages | Duration | Domain |
---|---|---|---|
GigaSpeech (opens new window) (Cheng et al., 2021 (opens new window)) | EN | 10,000hrs | diverse |
Wenet (opens new window) (Zhang et al., 2021 (opens new window)) | ZH | 10,000hrs | diverse |
LibriSpeech (opens new window) (Panayotov et al., 2015 (opens new window)) | EN | 1,000hrs | read audiobooks |
TED-LIUM 3 (opens new window) (Hernandez et al., 2018 (opens new window)) | EN | 452hrs | TED talks |
Common Voice en_2181h_2020-12-11 (opens new window) (Ardila et al., 2019 (opens new window)) | EN | 1,686hrs validated | |
VoxForge EN (opens new window) | EN | 120hrs | |
AISHELL-2 (opens new window) (Du et al., 2018 (opens new window)) | ZH | 1,000hrs | smart house, industry, ... |
BAAI Magic Chinese Data (opens new window) | ZH | 100hrs | real-life dialogues |
# MT
Dataset | Domain |
---|---|
WMT2020 (opens new window) | news, europarl, common crawl, paracrawl, UN |
OpenSubtitles2018 (opens new window) | movie subtitles |
CCAligned (opens new window) (El-Kishky et al., 2020 (opens new window)) | common crawl |
TED 2020 (opens new window) | TED talks |
GigaST →