# GigaST
GigaST is a large-scale speech translation corpus, by translating the transcriptions in GigaSpeech, a multi-domain English speech recognition corpus with 10,000 hours of labeled audio. The training data is translated by a strong machine translation system and the test data is produced by professional human translators.
# Download
The GigaST dataset can be downloaded from:
Language | Version | Link |
---|---|---|
En-De | v1.0.0 | GigaST.de.json (opens new window) |
En-Zh | v1.0.0 | GigaST.zh.json (opens new window) |
The corresponding audio recordings and transcriptions can be found in GigaSpeech (opens new window).
# Preparation Guidelines
See github (opens new window).
# Citation
@Article{gigast,
author = {Ye, Rong and Zhao, Chengqi and Ko, Tom and Meng, Chutong and Wang, Tao and Wang, Mingxuan and Cao, Jun},
journal = {arXiv preprint arXiv:2204.03939},
title = {GigaST: A 10,000-hour Pseudo Speech Translation Corpus},
year = {2022},
}
# License
The GigaST dataset is available to download for non-commercial purposes under a Creative Commons Attribution-NonCommercial 4.0 International License (opens new window).
# Acknowledgement
GigaSpeech dataset is essential for the creation of GigaST. The authors are extremely grateful to the GigaSpeech's contributors.