Advancing automatic speech recognition for low-resource ghanaian languages: Audio datasets for Akan, Ewe, Dagbani, Dagaare, and Ikposo
Loading...
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Data in Brief
Abstract
Audio datasets are fundamental to the development of auto-
matic speech-recognition (ASR) systems. However, the avail-
ability of a large corpus of audio datasets in low-resource
languages (LRLs) is limited. This study addresses this gap by
introducing audio speech datasets for five low-resource lan-
guages spoken in Ghana and parts of Togo. Specifically, it
presents a 50 0 0-hour speech corpus in Akan, Ewe, Dagbani,
Dagaare, and Ikposo. Each language corpus includes 10 0 0
h of validated audio speech recorded by their indigenous
speakers. These audio recordings are spoken descriptions of
10 0 0 culturally relevant images collected using a custom An-
droid mobile application. To enhance the dataset’s utility in
ASR and linguistic research 10 % of the audio recordings
for each language were randomly selected and transcribed,
resulting in approximately 100 h of transcription per lan-
guage. This dataset represents a critical resource for pre-
serving and documenting Ghanaian languages. It holds the
potential for advancing speech and language technologies
in these languages. Creating this audio dataset is the first step towards bridging the technological gap between high-
and low-resource languages. Ethical guidelines were strictly
followed throughout the data collection process and partic-
ipants were given incentives for lending their voices to this
study.
Description
Research Article
