🤗 Datasets
We will create two audio 🤗 datasets, one for the preliminar evaluation and other one for the Whisper fine-tunning. We can share any dataset with anyone by creating a dataset repository on the Hugging Face Hub.
prEval dataset
prEval preview
audio | description | AQ | CSl | IB | palo | quejios | region | sex | WER |
---|---|---|---|---|---|---|---|---|---|
Venganza, Manolo Caracol |
1 | 0.00 | 0 | fandango | 0 | sevilla | male | ? | |
De una mina de la Union, Camarón de la Isla |
5 | 0.00 | 1 | minera | 0.40 | cadiz | male | ? | |
En lo alto del cerro de Palomares, Estrella Morente |
5 | 0.07 | 2 | tangos de granada | 0.26 | granada | female | ? |
Create
When creating our repository, we will make sure we are in an environment huggingface_hub CLI and Datasets library (see Working env section) installed and we will follow the instructions on 🤗 Docs (all the commands below apply to my own setup):
-
Login using Hugging Face Hub credentials:
-
Install Git-LFS, which will be needed if we manage large files, and create new dataset repository:
-
Clone our repository:
-
Move our dataset files (for the moment, two empty files and one folder) to the repository directory, then commit and push our files:
Once created our dataset structure looks like this:
Update
When updating prEval, we are interested in obtaining as diverse flamenco songs as possible, so our selected features are balanced and we can capture valuable insights from the preliminar evaluation. For this purpose, it is useful to take a look at the dataset from time to time and check the distribution of our features, to ensure that they are diverse enough and do not form unbalanced distributions. [I do it with R (Rstudio) like this]. If some feature has a strange distribution, we should try to balance it with the next audio files we add to the dataset.
This is the workflow I used for adding new files to prEval dataset:
Example
In this steps I will be using the song De una mina de la Union, from Camaron de la Isla ("Son tus ojos dos estrellas", 1971), available here.
-
Activate whisper env, go to dataset path (
/home/jcalle/whisper/datasets/prEval
) and createextended_metadata.csv
file, manually writing the column names (first line): -
Generate a hash (ID) for the audio file using
sha1.py
.$ python3 ../../scripts/sha1.py extended_metadata.csv Enter the song name: De una mina de la Union Enter the author name: Camaron de la Isla Enter the album name: Son tus ojos dos estrellas Enter the year: 1971 Enter the link: https://youtu.be/VnKJnIU-8iY
Prompts must be filled without any spanish accentuation or punctuation marks (except for links). Uppercase letters must be used when it corresponds.
sha1.py script and expected output
This python script serves two functions:
- On one hand, it prompts you for information related to the audio sample and generates an unique ID, which is a truncated SHA-1 hash that takes as input: song title + author.
- On the other hand, it takes themetadata.csv
file (which stores relevant information and features about the audio sample) as an argument when calling the script and automatically appends the provided information as a single line in comma separated values.import hashlib import sys # Function to generate SHA-1 hash def generate_sha1_hash(input_string): return hashlib.sha1(input_string.encode()).hexdigest() # Prompt for input variables song = input("Enter the song name: ") author = input("Enter the author name: ") album = input("Enter the album name: ") year = input("Enter the year: ") link = input("Enter the link: ") # Generate hash from song + author song_author = song + author song_author_hash = generate_sha1_hash(song_author) # Add prefix and suffix to hash song_author_hash = "data/" + song_author_hash[:10] + ".mp3" # Output variables in specific order output = f"{song_author_hash},{song},{author},{album},{year},{link}" # Append output to specified CSV file csv_file = sys.argv[1] with open(csv_file, "a") as file: file.write(output + "\n")
file_name song author album year link transcription AQ_w AQ_i AQ_b AQ_o AQ CSl IB_guitar IB_percussion IB_jaleos IB_others IB palo quejios region sex data/d7f36ef3e7.mp3 De una mina de la Union Camaron de la Isla Son tus ojos dos estrellas 1971 https://youtu.be/VnKJnIU-8iY Last fields are still empty, we will fill them in step 4.
-
Obtain and save the audio file. I have to ways of doing this: (i) digitally recording a vynil record or (ii) downloading it from Youtube using youtube-dl:
-
Run pre-trained Whisper and obtain original transcription. Since we will have to ultimately run Whisper on each audio file, I do it at this point since it is a nice starting point for manually transcribing the song. If there are words I can not understand, I try visiting other sources to complete the correct transcription.
- Run pre-trained Whisper:
nohup whisper data/d7f36ef3e7.mp3 --model medium --task transcribe --language es -f txt -o pre-whisper.out/d7f36ef3e7 &
- Manually correct transcription in a copy of Whisper TXT output (
original_d7f36ef3e7.txt
):
- Format correct transcription to remove punctuation marks and convert line breaks to white spaces:
outformat_whisper.sh
#!/bin/bash # Check if a file name is provided if [ $# -eq 0 ]; then echo "Please provide the file name as an argument." exit 1 fi # Check if the file exists if [ ! -f "$1" ]; then echo "File '$1' not found." exit 1 fi # Process the file temp_file="temp.txt" sed -e ':a' -e 'N' -e '$!ba' -e 's/\n/ /g' -e 's/[[:punct:]]//g' -e 's/.*/\L&/' "$1" > "$temp_file" mv "$temp_file" "$1" echo "File '$1' has been processed and overwritten."
-
Manually extract features from the audio file. I use the Edit CSV VSCode extension to comfortably fill out the
extended_metadata.csv
file.-
Transcription. Copy transcription from
original_d7f36ef3e7.txt
to the transcription field of theextended_metadata.csv
file. -
Audio quality (AQ). Try to identify each type of sound imperfection one at a time, and put "1" if we detect each (we can put 2 if the imperfection is exagerated):
-
White noise (AQ_w): static background noises (example).
-
Impulse noise (AQ_i): unwanted instantaneous sharp sounds, like clicks and pops (example).
-
Background noise (AQ_b): any other sound besides the ones being monitored (we can include echo in this category) (example).
-
Oversaturation: vocals are so loud that sound is distorted (example).
-
-
Cultural-specific lyrics (CSl). Use
word_count.sh
to determine total number of unique words and manually look the list for cultural-specific lyrics. Divide the later by the first to obtain the score.#!/bin/bash # Check if a file name is provided if [ $# -eq 0 ]; then echo "Please provide the file name as an argument." exit 1 fi # Check if the file exists if [ ! -f "$1" ]; then echo "File '$1' not found." exit 1 fi # Process the file word_count=$(tr -s '[:space:]' '\n' < "$1" | tr '[:upper:]' '[:lower:]' | tr -d '[:punct:]' | sort | uniq -c) # Output the number of unique words echo "Number of unique words: $(echo "$word_count" | wc -l)" # Output the count of occurrences for each word echo "Word occurrences:" echo "$word_count"
There are a total of 24 unique words. There is no cultural-specific word, so the score would be:
\(CSl = (\frac{\text{# cultural-specific words}}{\text{# unique words}}) = \frac{0}{24} = 0\)
-
Instrumental background (IB). Check for different instrumental backgrounds at the same time as the vocals, and add "1" to each one we detect (we will add "2" if it impairs lyrics understanding):
-
Palo. Indicates the subgenre of flamenco to which the song belongs.
-
Quejíos. Refers to the proportion sorrow moans in the audio sample. To measure it, we use an online tool with two simultaneos clocks: with one we measure the duration of the whole vocal part, and with the other the duration of the quejíos within the vocals. Finally, this feature is calculated like:
\(\text{quejíos} = (\frac{\text{duration of quejíos (s)}}{\text{duration of vocals (s)}}) = \frac{31 (s)}{77 (s)} = 0.40 (s)\)
-
Region and sex. Fill out these fields according to the region of origin and the sex of the author.
-
-
Summarize metadata. Summarize the variables AQ and IB of the
extended_metadata.csv
file tometadata.csv
, which will be the one that we will use with 🤗:import csv import sys # Function to combine AQ columns def combine_aq(aq_w, aq_i, aq_b, aq_o): aq_sum = aq_w + aq_i + aq_b + aq_o aq_combined = 5 - aq_sum aq_combined = max(aq_combined, 0) aq_combined = min(aq_combined, 5) return aq_combined # Function to combine IB columns def combine_ib(ib_guitar, ib_percussion, ib_jaleos, ib_others): ib_combined = ib_guitar + ib_percussion + ib_jaleos + ib_others ib_combined = min(ib_combined, 5) ib_combined = max(ib_combined, 0) return ib_combined # Check if a file name is provided as a command-line argument if len(sys.argv) < 2: print("Please provide the input CSV file as an argument.") sys.exit(1) # Specify output file name output_file = "metadata.csv" # Read input file and combine columns with open(sys.argv[1], "r") as csvfile_in, open(output_file, "w", newline="") as csvfile_out: reader = csv.DictReader(csvfile_in) fieldnames_out = ['file_name', 'song', 'author', 'album', 'year', 'link', 'transcription', 'AQ', 'CSl', 'IB', 'palo', 'quejios', 'region', 'sex'] writer = csv.DictWriter(csvfile_out, fieldnames=fieldnames_out) writer.writeheader() for row in reader: aq_combined = combine_aq(float(row['AQ_w']), float(row['AQ_i']), float(row['AQ_b']), float(row['AQ_o'])) ib_combined = combine_ib(float(row['IB_guitar']), float(row['IB_percussion']), float(row['IB_jaleos']), float(row['IB_others'])) output_row = { 'file_name': row['file_name'], 'song': row['song'], 'author': row['author'], 'album': row['album'], 'year': row['year'], 'link': row['link'], 'transcription': row['transcription'], 'AQ': aq_combined, 'CSl': row['CSl'], 'IB': ib_combined, 'palo': row['palo'], 'quejios': row['quejios'], 'region': row['region'], 'sex': row['sex'] } writer.writerow(output_row) print("Output file has been created: " + output_file)
file_name song author album year link transcription AQ CSl IB palo quejios region sex data/d7f36ef3e7.mp3 De una mina de la Union Camaron de la Isla Son tus ojos dos estrellas 1971 https://youtu.be/VnKJnIU-8iY ay ay ay que ay la unión me van a hacer barrenero de las minas de la unión y entre todos mis compañeros ay me van a regalar un farol porque no tengo dinero 5.0 0 1.0 minera 0.40 Cadiz male -
Push changes to the 🤗 repository