# Benchmarks

We conduct experiments on several benchmark ST datasets using NeurST (opens new window) and list the performance with other counterparts and studies. We try to make fair comparsions and facilitate future research works.

# End-to-End ST

We present BLEU scores (Papineni et al., 2002 (opens new window)) for end-to-end ST models, tokenized BLEU using tokenizer.perl (opens new window)+multi-bleu.perl (opens new window) or detokenized BLEU by sacrebleu (opens new window).

# libri-trans

libri-trans (opens new window) is a small EN->FR ST corpus, originally started from the LibriSpeech corpus. There are 236 hours of English speech aligned to French translations from e-books at the utterance level. As most previous studies did, the training data consists of the clean 100-hour portion plus the augmented MT from Google Translate.

All kinds of BLEU scores are reported.

Model	external Audio	external ASR	external MT	case-sensitive tok BLEU	case-sensitive detok BLEU
NeurST transf-s (Zhao et al., 2020 (opens new window))	×	×	×	17.8	16.3
ST+AFS(t,f) transf-m (Zhang et al., 2020 (opens new window))	×	×	×	18.6	17.2
Chimera (w2v2 transf-m) (Han et al., 2021 (opens new window))	√	×	√	-	19.4

Model	external ASR	external MT	case-insensitive tok BLEU	case-insensitive detok BLEU
NeurST transf-s (Zhao et al., 2020 (opens new window))	×	×	18.7	17.2
Espnet-ST transf-s (Inaguma et al., 2020 (opens new window))	×	×	-	16.7
transf-s + KD (Liu et al., 2019 (opens new window))	×	×	17.0	-
TCEN-LSTM (Wang et al., 2020 (opens new window))	×	×	-	17.1
transf-s + curriculum pre-train (Wang et al., 2020 (opens new window))	×	×	17.7	-
LUT (transf-m + bert KD + mtl) (Dong et al., 2021 (opens new window))	×	×	17.8	-
COSTT (Dong et al., 2021b (opens new window))	×	×	17.8	-
transf-m + curriculum pre-train (Wang et al., 2020 (opens new window))	√	×	18.0	-
LUT (transf-m + bert KD + mtl) (Dong et al., 2021a (opens new window))	√	×	18.3	-
COSTT (Dong et al., 2021b (opens new window))	×	√	18.2	-
SATE transf-s (Xu et al., 2021 (opens new window))	×	×	-	18.3
SATE conformer-m (Xu et al., 2021 (opens new window))	√	√	-	20.8

# MuST-C

MuST-C (opens new window) is a multilingual speech translation corpus whose size and quality facilitates the training of end-to-end systems for speech translation from English into several languages. For each target language, MuST-C comprises several hundred hours of audio recordings from English TED Talks, which are automatically aligned at the sentence level with their manual transcriptions and translations.

We report case-sensitive detokenized BLEU via sacrebleu toolkit.

Model	ext audio	ext ASR	ext MT	DE	ES	FR	IT	NL	PT	RO	RU
NeurST transf-s (Zhao et al., 2020 (opens new window))	×	×	×	22.8	27.4	33.3	22.9	27.2	28.7	22.2	15.1
Espnet-ST transf-s (Inaguma et al., 2020 (opens new window))	×	×	×	22.9	28.0	32.8	23.8	27.4	28.0	21.9	15.8
fairseq s2t transf-s (Wang et al., 2020 (opens new window))	×	×	×	22.7	27.2	32.9	22.7	27.3	28.1	21.9	15.3
ST+AFS(t,f) transf-m (Zhang et al., 2020 (opens new window))	×	×	×	22.4	26.9	31.6	23.0	24.9	26.3	21.0	14.7
Chimera (w2v2 transf-m) (Han et al., 2021 (opens new window))	√	×	√	27.1	30.6	35.6	25.0	29.2	30.2	24.0	17.4
XSTNet (w2v2 transf-m mtl) (Ye et al., 2021 (opens new window))	√	×	×	25.5	36.0	-	-	-	-	-	16.9
XSTNet (w2v2 transf-m mtl) (Ye et al., 2021 (opens new window))	√	×	√	27.1	38.0	-	-	-	-	-	18.4
SATE transf-s (Xu et al., 2021 (opens new window))	×	×	×	25.2	-	-	-	-	-	-	-
SATE conformer-m (Xu et al., 2021 (opens new window))	×	√	√	28.1	-	-	-	-	-	-	-