# Resources
We list the links to toolkits and available datasets here.
# Toolkits
- NeurST (opens new window) (Zhao et al., 2020 (opens new window))
- Espnet-ST (opens new window) (Inaguma et al., 2020 (opens new window))
- Fairseq S2T (opens new window) (Wang et al., 2020 (opens new window))
# Datasets
# ST
# ASR
| Dataset | Languages | Duration | Domain |
|---|---|---|---|
| GigaSpeech (opens new window) (Cheng et al., 2021 (opens new window)) | EN | 10,000hrs | diverse |
| Wenet (opens new window) (Zhang et al., 2021 (opens new window)) | ZH | 10,000hrs | diverse |
| LibriSpeech (opens new window) (Panayotov et al., 2015 (opens new window)) | EN | 1,000hrs | read audiobooks |
| TED-LIUM 3 (opens new window) (Hernandez et al., 2018 (opens new window)) | EN | 452hrs | TED talks |
| Common Voice en_2181h_2020-12-11 (opens new window) (Ardila et al., 2019 (opens new window)) | EN | 1,686hrs validated | |
| VoxForge EN (opens new window) | EN | 120hrs | |
| AISHELL-2 (opens new window) (Du et al., 2018 (opens new window)) | ZH | 1,000hrs | smart house, industry, ... |
| BAAI Magic Chinese Data (opens new window) | ZH | 100hrs | real-life dialogues |
# MT
| Dataset | Domain |
|---|---|
| WMT2020 (opens new window) | news, europarl, common crawl, paracrawl, UN |
| OpenSubtitles2018 (opens new window) | movie subtitles |
| CCAligned (opens new window) (El-Kishky et al., 2020 (opens new window)) | common crawl |
| TED 2020 (opens new window) | TED talks |
GigaST →