The dataset

We will use the LJ Speech dataset for this task (https://keithito.com/LJ-Speech-Dataset/). It contains 13,100 .wav recordings with their corresponding transcripts. The transcripts are available in both their raw and normalized formats. In the normalized version of a transcript, numbers are written in full words.

The recordings were produced with the same voice. The total length of the audio content is roughly 24 hours, with samples that can last from 1 to 7 seconds. This dataset is in the public domain, and there are no restrictions on its use.

Note that the dataset will occupy roughly 3.8 GB on your hard disk after the extraction of the ZIP file, downloadable from the aforementioned link.

The dataset folder contains a CSV file named metadata.csv, a README file, and a folder, /wavs, that contains the .wav audio files. metadata.csv is comprised of three columns, and each row corresponds to one of the 13,100 recordings. The first column gives the name of the corresponding .wav file, and the other two columns give the raw and normalized transcripts, respectively:

Once downloaded, the ZIP file containing the data should be extracted in the /data folder.

Table of Contents for The dataset

Create new playlist

Sign In

Sign Up

Table of Contents for
The dataset