This repository provides a set of audio processing pipelines designed to facilitate the creation of a music dataset. The tools included help in downloading audio files, creating prompts, and splitting audio into manageable segments.
Before you begin, you need to install the required dependencies. Run the following command to install them:
python3 -m pip install -r requirements.txt
This command installs all the necessary Python packages listed in the requirements.txt
file. Make sure you have Python 3 and pip
installed on your system.
Start the proxy:
sudo docker run -d --rm -it -p 3128:3128 -p 4444:4444 -e "TOR_INSTANCES=40" jourdelune/rotating-tor-http-proxy
The downloader module is responsible for fetching audio files from various sources. It ensures that the files are downloaded and stored in the appropriate directory for further processing.
python3 -m scripts.downloader --input_dataset WaveGenAI/youtube-cc-by-music --cache_dir PATH --max_files 50000 --shuffle
The prompt creation module generates prompts based on the descriptions of the audio.
python3 -m scripts.prompt_creator --input_dataset HUGGING_FACE_DS --use_cache --cache_dir DIR
The split audio module takes the downloaded audio files and splits them into smaller, manageable segments. This is useful for processing large audio files and making them easier to handle in subsequent steps.
python3 -m scripts.split_data --input_dir DIR --output_dir DIR --remove-original --chunk-size 30
Push the dataset to huggingface for further processing.
python3 -m scripts.push_to_huggingface --input_dir DIR --output_dataset NAME
The codec conversion module converts audio files to DAC format, then it could be used to train a transformer model.
python3 -m scripts.codec_generator --input_dataset DIR --output_dataset DIR --max_files 500000
To push to huggingface, use :
huggingface-cli upload-large-folder DATASET_NAME --repo-type=dataset DIR -num-workers=16