# Benchmarks
We conduct experiments on several benchmark ST datasets using NeurST (opens new window) and list the performance with other counterparts and studies. We try to make fair comparsions and facilitate future research works.
# End-to-End ST
We present BLEU scores (Papineni et al., 2002 (opens new window)) for end-to-end ST models, tokenized BLEU using tokenizer.perl (opens new window)+multi-bleu.perl (opens new window) or detokenized BLEU by sacrebleu (opens new window).
# libri-trans
libri-trans (opens new window) is a small EN->FR ST corpus, originally started from the LibriSpeech corpus. There are 236 hours of English speech aligned to French translations from e-books at the utterance level. As most previous studies did, the training data consists of the clean 100-hour portion plus the augmented MT from Google Translate.
All kinds of BLEU scores are reported.
Model | external Audio | external ASR | external MT | case-sensitive tok BLEU | case-sensitive detok BLEU |
---|---|---|---|---|---|
NeurST transf-s (Zhao et al., 2020 (opens new window)) | × | × | × | 17.8 | 16.3 |
ST+AFS(t,f) transf-m (Zhang et al., 2020 (opens new window)) | × | × | × | 18.6 | 17.2 |
Chimera (w2v2 transf-m) (Han et al., 2021 (opens new window)) | √ | × | √ | - | 19.4 |
Model | external ASR | external MT | case-insensitive tok BLEU | case-insensitive detok BLEU |
---|---|---|---|---|
NeurST transf-s (Zhao et al., 2020 (opens new window)) | × | × | 18.7 | 17.2 |
Espnet-ST transf-s (Inaguma et al., 2020 (opens new window)) | × | × | - | 16.7 |
transf-s + KD (Liu et al., 2019 (opens new window)) | × | × | 17.0 | - |
TCEN-LSTM (Wang et al., 2020 (opens new window)) | × | × | - | 17.1 |
transf-s + curriculum pre-train (Wang et al., 2020 (opens new window)) | × | × | 17.7 | - |
LUT (transf-m + bert KD + mtl) (Dong et al., 2021 (opens new window)) | × | × | 17.8 | - |
COSTT (Dong et al., 2021b (opens new window)) | × | × | 17.8 | - |
transf-m + curriculum pre-train (Wang et al., 2020 (opens new window)) | √ | × | 18.0 | - |
LUT (transf-m + bert KD + mtl) (Dong et al., 2021a (opens new window)) | √ | × | 18.3 | - |
COSTT (Dong et al., 2021b (opens new window)) | × | √ | 18.2 | - |
SATE transf-s (Xu et al., 2021 (opens new window)) | × | × | - | 18.3 |
SATE conformer-m (Xu et al., 2021 (opens new window)) | √ | √ | - | 20.8 |
# MuST-C
MuST-C (opens new window) is a multilingual speech translation corpus whose size and quality facilitates the training of end-to-end systems for speech translation from English into several languages. For each target language, MuST-C comprises several hundred hours of audio recordings from English TED Talks, which are automatically aligned at the sentence level with their manual transcriptions and translations.
We report case-sensitive detokenized BLEU via sacrebleu toolkit.
Model | ext audio | ext ASR | ext MT | DE | ES | FR | IT | NL | PT | RO | RU |
---|---|---|---|---|---|---|---|---|---|---|---|
NeurST transf-s (Zhao et al., 2020 (opens new window)) | × | × | × | 22.8 | 27.4 | 33.3 | 22.9 | 27.2 | 28.7 | 22.2 | 15.1 |
Espnet-ST transf-s (Inaguma et al., 2020 (opens new window)) | × | × | × | 22.9 | 28.0 | 32.8 | 23.8 | 27.4 | 28.0 | 21.9 | 15.8 |
fairseq s2t transf-s (Wang et al., 2020 (opens new window)) | × | × | × | 22.7 | 27.2 | 32.9 | 22.7 | 27.3 | 28.1 | 21.9 | 15.3 |
ST+AFS(t,f) transf-m (Zhang et al., 2020 (opens new window)) | × | × | × | 22.4 | 26.9 | 31.6 | 23.0 | 24.9 | 26.3 | 21.0 | 14.7 |
Chimera (w2v2 transf-m) (Han et al., 2021 (opens new window)) | √ | × | √ | 27.1 | 30.6 | 35.6 | 25.0 | 29.2 | 30.2 | 24.0 | 17.4 |
XSTNet (w2v2 transf-m mtl) (Ye et al., 2021 (opens new window)) | √ | × | × | 25.5 | 36.0 | - | - | - | - | - | 16.9 |
XSTNet (w2v2 transf-m mtl) (Ye et al., 2021 (opens new window)) | √ | × | √ | 27.1 | 38.0 | - | - | - | - | - | 18.4 |
SATE transf-s (Xu et al., 2021 (opens new window)) | × | × | × | 25.2 | - | - | - | - | - | - | - |
SATE conformer-m (Xu et al., 2021 (opens new window)) | × | √ | √ | 28.1 | - | - | - | - | - | - | - |