Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unlabelled Data Processors #87

Open
wants to merge 13 commits into
base: main
Choose a base branch
from
4 changes: 2 additions & 2 deletions dataset_configs/english/coraal/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -93,8 +93,8 @@ processors:
- {"pattern": '\baksing\b', "repl": "asking"}
- {"pattern": '\baksed\b', "repl": "asked"}
# removing unintelligible/redacted flags
- {"pattern": '/(?i)unintelligible/', "repl": ""}
- {"pattern": '/(?i)inaudible/', "repl": ""}
- {"pattern": '(?i)unintelligible/', "repl": ""}
- {"pattern": '(?i)inaudible/', "repl": ""}
- {"pattern": '/RD(.*?)/', "repl": ""}
- {"pattern": '/(\?)\1*/', "repl": ""}
# removing non-linguistic markers
Expand Down
53 changes: 53 additions & 0 deletions dataset_configs/multilingual/babel/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
documentation: |
IARPA Babel Dataset
###################

This config is designed for the languages of the IARPA Babel Dataset available at https://catalog.ldc.upenn.edu.

It creates initial manifest for the specified data type and data split.
Further data processing steps should be performed based on the specific langauge.

**Required arguments**.

* **raw_data_dir**: specify path of the directory downloaded from LDC.
* **data_type**: should be "conversational" or "scripted".
* **resampled_audio_dir**: specify the directory path, where new processed audios should be located.
* **data_split**: should be "training", "untranscribed-training", "sub-train", "dev" or "eval".
* **output_manifest_file**: specify output manifest filepath.

**Output format**.

This config dumps the final manifest at ``${output_manifest_file}``.
The output manifest contains the following fields:

* **outputFn (str)**: initial audio filename.
* **sessID (str)**: session ID of the recording.
* **date (str)**: date of the recording.
* **time (str)**: time of the recording.
* **spkrCode (str)**: speaker ID.
* **lineType (str)**: type of the line (inline or outline).
* **dialect (str)**: dialect of the speaker.
* **gen (str)**: gender of the speaker.
* **envType (str)**: environment (i.e., home, office, etc.).
* **age (str)**: age of the speaker.
* **network (str)**: name of the telecommunications network.
* **phoneModel (str)**: model of the phone.
* **sampleCount (str)**: count of the sample.
* **sampleRate (str)**: original sample rate of the recording.
* **audio_filepath (str)**: path to the processed audio file.
* **duration (float)**: duration of the audio in seconds.

processors_to_run: all
workspace_dir: ???
data_type: scripted
resampled_audio_dir: ${workspace_dir}/processed/${data_type}/${data_split}
data_split: training
final_manifest: ${workspace_dir}/processed/${data_type}/${data_split}_manifest.json

processors:
- _target_: sdp.processors.CreateInitialManifestBabel
raw_data_dir: ${workspace_dir}
data_type: ${data_type}
data_split: ${data_split}
resampled_audio_dir: ${resampled_audio_dir}
output_manifest_file: ${final_manifest}
39 changes: 39 additions & 0 deletions dataset_configs/multilingual/voxpopuli/config_un.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
documentation: |
Voxpopuli unlabelled subset
###########################

This config can be used to prepare
`Voxpopuli dataset unlabelled subset <https://github.com/facebookresearch/voxpopuli/>`_
dataset in the NeMo format.

It creates initial manifest for the specified language.

**Required arguments**.

* **raw_data_dir**: specify the directory where the downloaded data will be/is saved.
* **language_id**: specify the language of the data you wish to be downloaded and/or processed.
* **resampled_audio_dir**: specify the directory path, where new processed audios should be located.
* **delete_raw_file**: specify if the initial raw audio files should be deleted or not.


**Output format**.

This config dumps the final manifest at ``${resampled_audio_dir}/${language_id}/manifest.json``.
The output manifest contains the following fields:

* **audio_filepath (str)**: path to the processed audio file.
* **duration (float)**: duration of the audio in seconds.

processors_to_run: all
workspace_dir: ???
language_id: hu_v2
resampled_data_dir: ${workspace_dir}/unlabelled_processed/
final_manifest: ${workspace_dir}/unlabelled_processed/${language_id}/manifest.json

processors:
- _target_: sdp.processors.CreateInitialManifestVoxpopuliUnlabelled
raw_data_dir: ${workspace_dir}
language_id: ${language_id}
resampled_data_dir: ${resampled_data_dir}
delete_raw_file: False
output_manifest_file: ${final_manifest}
35 changes: 35 additions & 0 deletions dataset_configs/multilingual/yodas/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
processors_to_run: all
manifest: ???
resampled_audio_dir: ???
out_manifest: ???
char_rate: 10
min_duration: 1.5
max_duration: 40.1

processors:
- _target_: sdp.processors.GetAudioDuration
audio_filepath_key: audio_filepath
duration_key: duration
input_manifest_file: ${manifest}

- _target_: sdp.processors.RandomSegment
min_duration: ${min_duration}
max_duration: ${max_duration}
resampled_audio_dir: ${resampled_audio_dir}
audio_format: flac

- _target_: sdp.processors.ASRInference
pretrained_model: nvidia/parakeet-ctc-0.6b

- _target_: sdp.processors.DropHighLowCharrate
low_charrate_threshold: ${char_rate}
text_key: pred_text
high_charrate_threshold: 10000

- _target_: sdp.processors.KeepOnlySpecifiedFields
fields_to_keep:
- audio_filepath
- duration

- _target_: sdp.processors.DropCorrupted
output_manifest_file: ${out_manifest}
25 changes: 24 additions & 1 deletion docs/src/sdp/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,9 @@ VoxPopuli
.. autodata:: sdp.processors.CreateInitialManifestVoxpopuli
:annotation:

.. autodata:: sdp.processors.CreateInitialManifestVoxpopuliUnlabelled
:annotation:

.. autodata:: sdp.processors.NormalizeFromNonPCTextVoxpopuli
:annotation:

Expand All @@ -58,8 +61,13 @@ Librispeech

.. autodata:: sdp.processors.CreateInitialManifestLibrispeech
:annotation:


Babel
'''''''''''

.. autodata:: sdp.processors.CreateInitialManifestBabel
:annotation:

SLR83
'''''

Expand Down Expand Up @@ -158,6 +166,18 @@ Data modifications
.. autodata:: sdp.processors.InverseNormalizeText
:annotation:

.. autodata:: sdp.processors.RandomSegment
:annotation:

.. autodata:: sdp.processors.UntarAudios
:annotation:

.. autodata:: sdp.processors.ExtractFilesFromTar
:annotation:

.. autodata:: sdp.processors.RemoveEmojis
:annotation:

Data filtering
''''''''''''''

Expand Down Expand Up @@ -237,6 +257,9 @@ Data filtering
.. autodata:: sdp.processors.DropRepeatedFields
:annotation:

.. autodata:: sdp.processors.DropCorrupted
:annotation:


Miscellaneous
#############
Expand Down
20 changes: 20 additions & 0 deletions docs/src/sdp/existing_configs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -92,12 +92,16 @@ VoxPopuli
* **Spanish**:
`config <https://github.com/NVIDIA/NeMo-speech-data-processor/blob/main/dataset_configs/spanish_pc/voxpopuli/config.yaml>`__ |
:doc:`documentation <config-docs/spanish_pc/voxpopuli/config>`
* **Multilingual**:
`config <https://github.com/NVIDIA/NeMo-speech-data-processor/blob/main/dataset_configs/multilingual/voxpopuli/config_un.yaml>`__ |
:doc:`documentation <config-docs/multilingual/voxpopuli/config_un>`

.. toctree::
:hidden:

config-docs/italian/voxpopuli/config
config-docs/spanish_pc/voxpopuli/config
config-docs/multilingual/voxpopuli/config_un

Fisher
~~~~~~
Expand Down Expand Up @@ -237,6 +241,22 @@ MTEDx

config-docs/portuguese/mtedx/config

Babel
~~~~~~

**Dataset link:** https://www.ldc.upenn.edu

**Supported configs**.

* **Multilingual**:
`config <https://github.com/NVIDIA/NeMo-speech-data-processor/blob/main/dataset_configs/multilingual/babel/config.yaml>`__ |
:doc:`documentation <config-docs/multilingual/babel/config>`

.. toctree::
:hidden:

config-docs/multilingual/babel/config

Kazakh Speech Dataset (SLR140)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expand Down
5 changes: 4 additions & 1 deletion pytest.ini
Original file line number Diff line number Diff line change
@@ -1,2 +1,5 @@
[pytest]
addopts = --doctest-modules
addopts = --doctest-modules
markers =
dependency: mark a test as a dependent on the other mentioned test.
slow: marks tests as slow (deselect with '-m "not slow"').
4 changes: 2 additions & 2 deletions requirements/huggingface.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
accelerate
transformers>=0.2.1
accelerate==0.34.2
transformers==4.39
huggingface_hub>=0.20.3,<0.24.0 # https://github.com/NVIDIA/NeMo/issues/9793
4 changes: 2 additions & 2 deletions requirements/main.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ ffmpeg
hydra-core
joblib
librosa>=0.10.0 # specify >=0.10.0 so that librosa.get_duration(path=...) will work
numpy
numpy==1.26.4
omegaconf
pandas
rarfile
Expand All @@ -13,7 +13,7 @@ sox
tqdm
webvtt-py
wget

pydub
# for some processers, additionally https://github.com/NVIDIA/NeMo is required
# for some processers, additionally nemo_text_processing is required
# for mcv: apt-get update && apt-get upgrade -y && apt-get install -y sox libsox-fmt-all
7 changes: 7 additions & 0 deletions sdp/processors/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,9 @@
)
from sdp.processors.datasets.voxpopuli.create_initial_manifest import (
CreateInitialManifestVoxpopuli,
CreateInitialManifestVoxpopuliUnlabelled,
)
from sdp.processors.datasets.babel.create_initial_manifest import CreateInitialManifestBabel
from sdp.processors.datasets.voxpopuli.normalize_from_non_pc_text import (
NormalizeFromNonPCTextVoxpopuli,
)
Expand Down Expand Up @@ -80,6 +82,10 @@
SubIfASRSubstitution,
SubMakeLowercase,
SubRegex,
ExtractFilesFromTar,
RandomSegment,
UntarAudios,
RemoveEmojis
)
from sdp.processors.modify_manifest.data_to_dropbool import (
DropASRError,
Expand All @@ -97,6 +103,7 @@
DropOnAttribute,
PreserveByValue,
DropRepeatedFields,
DropCorrupted,
)
from sdp.processors.modify_manifest.make_letters_uppercase_after_period import (
MakeLettersUppercaseAfterPeriod,
Expand Down
Loading
Loading