New extended prompt format for Canary, short utterances inference fix, and training micro-optimizations #11058

pzelasko · 2024-10-28T13:34:38Z

What does this PR do ?

New prompt format
- Prompt formatter definition
- Special tokenizer build script
- Unit tests for the new Canary prompt format.
Micro-optimizations (16% faster training step)
Option to enable loss mask on prompt tokens
Fix for inference of very short utterances

Collection: ASR

Changelog

Add specific line by line info of high level changes in this PR.

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

Signed-off-by: Piotr Żelasko <[email protected]>

scripts/speech_recognition/canary/build_canary_1b_special_tokenizer.py

@@ -0,0 +1,40 @@
+#!/usr/bin/env python
+from pathlib import Path


Signed-off-by: Piotr Żelasko <[email protected]>

bonham79

So I'm a bit hesitant to create a separate Canary2 prompt class. I think it sets a bad precedent for future Canary releases that could lead to bloat. Is there any motivated reason the original Canary tokenizer class can't just redefined and then there's a separate conversion script to write up that converts the old Canary model into the new version? There's only one anyhow, so it's not like it's going to create absurd cascade problems for users.

bonham79 · 2024-10-28T16:36:24Z

nemo/collections/common/prompts/canary2.py

+
+    any_special_token_present = False
+
+    lang_dict_compat = {"en": "en-US", "es": "es-ES", "fr": "fr-FR", "de": "de-DE"}


bonham79 · 2024-10-28T16:41:07Z

nemo/collections/common/prompts/canary2.py

+                f"Please ensure that every utterance in the input manifests contains these keys."
+            )
+
+        optional_slots = {


Is there a check to maintain consistency across slots or are you allowing variable slot sizes? (e.g. sample 1 has itn sample 2 does not have that slot.)

we are combining different datasets annotated for different things, so the flexibility is required atleast at the level of each manifest if not per sample.

Hmm, most of these are binary prompts. What about just having default values for the prompts instead? Else I believe y'all are adding the extra difficulty of the model needing to learn variable length prompt input.

There is no variable length (except if you add decoder context). We indeed have default values for slots here.

nemo/collections/common/tokenizers/canary_tokenizer.py

bonham79 · 2024-10-28T16:43:33Z

scripts/speech_recognition/canary/build_canary_1b_special_tokenizer.py

@@ -0,0 +1,40 @@
+#!/usr/bin/env python


I don't believe this script is necessary. It can just be an example in the canary tutorial instead.

The people building the new tokenizer for Canary 3 or community Canary forks will have an easier time and won't need to guess how the previous special tokenizers were built :)

Those people will be using the tutorial anyhow no? From my understanding, Canary use hasn't picked up that much this year, so there's not a significant community. It's more likely you'll have greater growth with the current model, so providing backend support may not be crucial.

I'm re-doing the Canary 2 tokenizer right now and am grateful I put these scripts in here. Not removing them :)

bonham79 · 2024-10-28T16:43:49Z

scripts/speech_recognition/canary/build_canary_2_special_tokenizer.py

@@ -0,0 +1,111 @@
+#!/usr/bin/env python


krishnacpuvvada · 2024-10-28T18:11:55Z

nemo/collections/common/prompts/canary2.py

+        # User prompt.
+        # This role is used for emotion / LID inference only - use it for two just two decoder inference steps
+        # to retrieve the recognized emotion and language tokens.
+        "user_partial": {


Leaving a note that this doesn't cover the cases like "identify language and transcribe with ITN" (which is fine for this version). should put sufficient guards regarding what users are allowed to ask in transcribe fn.

That will be complicated to implement in transcribe fn in general. We should think about whether a new API specifically for Canary would be cleaner.

+1 to new API, even already the prompt setup can be a bit bulky. Putting thought into API changes before it an actual problem could save some heartache.

krishnacpuvvada · 2024-10-28T18:49:20Z

nemo/collections/common/prompts/canary2.py

+            )
+
+        # first, validate the utterance
+        expected_slots = {"source_lang", "target_lang", "pnc"}


how about expected_slots = formatter.expected_slots and same for optional_slots

should we make all the slots expected ie. source_lang, target_lang, pnc, itn, timestamp, diarize and we will set the default tags in data config yaml if something is missing in manifest file ?

it will require rebuilding manifests (or at least data configs) for existing data, a bit of extra headache that I'm not sure is worth the time. WDYT?

it shouldn't require us to rebuild manifests. it should be doable using the tags we have in train_ds yaml config?

atleast may be the first part i.e, expected_slots = formatter.expected_slots ?

discussed offline - keeping the current code to hide the extra complexity of new canary2 slots for users that are not interested in them

after further conversation - made pnc optional too

krishnacpuvvada · 2024-10-28T19:02:14Z

nemo/collections/common/tokenizers/canary_tokenizer.py

+        if text.startswith(CANARY2_BOCTX):
+            # Canary 2 prompt format. It starts with decoder context, which should be tokenized using
+            # a different tokenizer than spl_tokens. We don't really know what it is, so we'll use the
+            # following HACK solution: look up 5th token which is target_lang and tokenize this part


limitation: Hack will error out if we are decoding with unk_lang as target_lang

Yes. Better use single tokenizer in these cases...

Signed-off-by: Piotr Żelasko <[email protected]>

Signed-off-by: pzelasko <[email protected]>

github-actions · 2024-11-22T02:02:11Z

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

Signed-off-by: Piotr Żelasko <[email protected]>

nemo/collections/asr/models/aed_multitask_models.py

tests/collections/asr/test_asr_multitask_model_bpe.py

-from lhotse.testing.dummies import DummyManifest
+from lhotse import CutSet, MonoCut, SupervisionSegment
+from lhotse.testing.dummies import DummyManifest, dummy_cut
+from lhotse.testing.random import deterministic_rng


…ISO lang codes Signed-off-by: Piotr Żelasko <[email protected]>

scripts/speech_recognition/canary/build_canary_2_special_tokenizer.py

@@ -0,0 +1,265 @@
+#!/usr/bin/env python
+import math


Signed-off-by: Piotr Żelasko <[email protected]>

pzelasko · 2024-12-04T01:10:18Z

nemo/collections/asr/modules/transformer/transformer_modules.py

@@ -58,27 +58,7 @@ def _build_pos_enc(self, hidden_size, max_sequence_length, device=None):
        self.register_buffer('pos_enc', pos_enc)

    def forward(self, position_ids):
-        max_pos_id = position_ids.max()
-        # update positional encoding if needed
-        if max_pos_id >= self._max_sequence_length:


This check is super costly as it triggers a DtoH transfer and CUDA sync on every call to transformer decoder forward, and the proposed solution doesn't work anyway (bad results instead of a crash).

pzelasko · 2024-12-04T01:12:25Z

nemo/collections/asr/models/aed_multitask_models.py

@@ -67,7 +67,8 @@

 def lens_to_mask(lens, max_length):
    batch_size = lens.shape[0]
-    mask = torch.arange(max_length).repeat(batch_size, 1).to(lens.device) < lens[:, None]
+    arange = torch.arange(max_length, device=lens.device)
+    mask = arange.expand(batch_size, max_length) < lens.unsqueeze(1)


micro-optimization, removes some copies and memory movement

Signed-off-by: Piotr Żelasko <[email protected]>

pzelasko · 2024-12-04T01:13:44Z

nemo/collections/asr/models/aed_multitask_models.py

@@ -673,24 +674,33 @@ def training_step(self, batch: PromptedAudioToTextMiniBatch, batch_nb):
            return torch.tensor([0.0])

        input_ids, labels = batch.get_decoder_inputs_outputs()
+        input_ids_lens = batch.prompted_transcript_lens - 1


fixing off-by-one issue that included an extra padding frame in decoder masks

pzelasko · 2024-12-04T01:13:54Z

nemo/collections/asr/models/aed_multitask_models.py

+        num_frames = batch.audio_lens.sum().float()
+        num_tokens = batch.prompted_transcript_lens.sum().float()
+        tot_frames = torch.as_tensor(batch.audio.numel(), device=num_frames.device, dtype=torch.float)
+        tot_tokens = torch.as_tensor(batch.prompted_transcript.numel(), device=num_frames.device, dtype=torch.float)


micro-optimizations

pzelasko · 2024-12-04T01:14:59Z

nemo/collections/asr/models/aed_multitask_models.py

-            'learning_rate': self._optimizer.param_groups[0]['lr'],
-            'batch_size': batch.audio.shape[0],
+            'learning_rate': torch.as_tensor(self._optimizer.param_groups[0]['lr']),
+            'batch_size': torch.as_tensor(batch.audio.shape[0]),


micro-optimizations (the PTL logger turned out to have an inefficient way of converting scalars to tensors)

Signed-off-by: Piotr Żelasko <[email protected]>

krishnacpuvvada

LGTM overall; left a minor comment.
Great work!

Signed-off-by: Piotr Żelasko <[email protected]>

krishnacpuvvada

LGTM

Signed-off-by: Piotr Żelasko <[email protected]>

github-actions · 2024-12-04T19:54:15Z

beep boop 🤖: 🙏 The following files have warnings. In case you are familiar with these, please try helping us to improve the code base.

Your code was analyzed with PyLint. The following annotations have been identified:

************* Module nemo.collections.common.prompts.canary2
nemo/collections/common/prompts/canary2.py:37:0: C0301: Line too long (136/119) (line-too-long)
nemo/collections/common/prompts/canary2.py:30:0: C0115: Missing class docstring (missing-class-docstring)
nemo/collections/common/prompts/canary2.py:121:0: C0116: Missing function or method docstring (missing-function-docstring)
************* Module scripts.speech_recognition.canary.build_canary_1b_special_tokenizer
scripts/speech_recognition/canary/build_canary_1b_special_tokenizer.py:15:0: W0611: Unused Path imported from pathlib (unused-import)
************* Module scripts.speech_recognition.canary.build_canary_2_special_tokenizer
scripts/speech_recognition/canary/build_canary_2_special_tokenizer.py:86:0: C0116: Missing function or method docstring (missing-function-docstring)
scripts/speech_recognition/canary/build_canary_2_special_tokenizer.py:15:0: W0611: Unused import math (unused-import)
************* Module nemo.collections.asr.data.audio_to_text_lhotse_prompted
nemo/collections/asr/data/audio_to_text_lhotse_prompted.py:60:0: C0301: Line too long (133/119) (line-too-long)
nemo/collections/asr/data/audio_to_text_lhotse_prompted.py:27:0: C0115: Missing class docstring (missing-class-docstring)
nemo/collections/asr/data/audio_to_text_lhotse_prompted.py:113:0: C0115: Missing class docstring (missing-class-docstring)
************* Module nemo.collections.asr.models.aed_multitask_models
nemo/collections/asr/models/aed_multitask_models.py:695:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/asr/models/aed_multitask_models.py:737:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/asr/models/aed_multitask_models.py:788:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/asr/models/aed_multitask_models.py:796:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/asr/models/aed_multitask_models.py:1070:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/asr/models/aed_multitask_models.py:1136:0: C0116: Missing function or method docstring (missing-function-docstring)
************* Module nemo.collections.asr.modules.transformer.transformer_modules
nemo/collections/asr/modules/transformer/transformer_modules.py:60:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/asr/modules/transformer/transformer_modules.py:104:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/asr/modules/transformer/transformer_modules.py:163:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/asr/modules/transformer/transformer_modules.py:168:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/asr/modules/transformer/transformer_modules.py:219:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/asr/modules/transformer/transformer_modules.py:25:0: W0611: Unused logging imported from nemo.utils (unused-import)
************* Module nemo.collections.common.data.lhotse.dataloader
nemo/collections/common/data/lhotse/dataloader.py:110:0: C0301: Line too long (123/119) (line-too-long)
nemo/collections/common/data/lhotse/dataloader.py:126:0: C0301: Line too long (125/119) (line-too-long)
nemo/collections/common/data/lhotse/dataloader.py:134:0: C0301: Line too long (148/119) (line-too-long)
nemo/collections/common/data/lhotse/dataloader.py:231:0: C0301: Line too long (121/119) (line-too-long)
nemo/collections/common/data/lhotse/dataloader.py:232:0: C0301: Line too long (126/119) (line-too-long)
nemo/collections/common/data/lhotse/dataloader.py:233:0: C0301: Line too long (123/119) (line-too-long)
nemo/collections/common/data/lhotse/dataloader.py:577:0: C0301: Line too long (135/119) (line-too-long)
nemo/collections/common/data/lhotse/dataloader.py:717:0: C0301: Line too long (213/119) (line-too-long)
nemo/collections/common/data/lhotse/dataloader.py:724:0: C0301: Line too long (143/119) (line-too-long)
nemo/collections/common/data/lhotse/dataloader.py:169:0: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/common/data/lhotse/dataloader.py:426:0: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/common/data/lhotse/dataloader.py:477:0: C0115: Missing class docstring (missing-class-docstring)
nemo/collections/common/data/lhotse/dataloader.py:504:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/common/data/lhotse/dataloader.py:510:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/common/data/lhotse/dataloader.py:513:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/common/data/lhotse/dataloader.py:516:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/common/data/lhotse/dataloader.py:519:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/common/data/lhotse/dataloader.py:528:0: C0115: Missing class docstring (missing-class-docstring)
nemo/collections/common/data/lhotse/dataloader.py:530:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/common/data/lhotse/dataloader.py:533:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/common/data/lhotse/dataloader.py:539:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/common/data/lhotse/dataloader.py:567:0: C0115: Missing class docstring (missing-class-docstring)
nemo/collections/common/data/lhotse/dataloader.py:582:0: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/common/data/lhotse/dataloader.py:589:0: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/common/data/lhotse/dataloader.py:604:0: C0116: Missing function or method docstring (missing-function-docstring)
************* Module nemo.collections.common.tokenizers.canary_tokenizer
nemo/collections/common/tokenizers/canary_tokenizer.py:83:0: C0301: Line too long (120/119) (line-too-long)
nemo/collections/common/tokenizers/canary_tokenizer.py:56:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/common/tokenizers/canary_tokenizer.py:60:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/common/tokenizers/canary_tokenizer.py:64:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/common/tokenizers/canary_tokenizer.py:68:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/common/tokenizers/canary_tokenizer.py:108:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/common/tokenizers/canary_tokenizer.py:114:4: C0116: Missing function or method docstring (missing-function-docstring)
************* Module nemo.utils.exp_manager
nemo/utils/exp_manager.py:353:0: C0301: Line too long (120/119) (line-too-long)
nemo/utils/exp_manager.py:378:0: C0301: Line too long (124/119) (line-too-long)
nemo/utils/exp_manager.py:382:0: C0301: Line too long (122/119) (line-too-long)
nemo/utils/exp_manager.py:387:0: C0301: Line too long (127/119) (line-too-long)
nemo/utils/exp_manager.py:401:0: C0301: Line too long (120/119) (line-too-long)
nemo/utils/exp_manager.py:408:0: C0301: Line too long (120/119) (line-too-long)
nemo/utils/exp_manager.py:409:0: C0301: Line too long (126/119) (line-too-long)
nemo/utils/exp_manager.py:411:0: C0301: Line too long (136/119) (line-too-long)
nemo/utils/exp_manager.py:413:0: C0301: Line too long (139/119) (line-too-long)
nemo/utils/exp_manager.py:423:0: C0301: Line too long (123/119) (line-too-long)
nemo/utils/exp_manager.py:424:0: C0301: Line too long (128/119) (line-too-long)
nemo/utils/exp_manager.py:425:0: C0301: Line too long (132/119) (line-too-long)
nemo/utils/exp_manager.py:426:0: C0301: Line too long (127/119) (line-too-long)
nemo/utils/exp_manager.py:514:0: C0301: Line too long (122/119) (line-too-long)
nemo/utils/exp_manager.py:602:0: C0301: Line too long (208/119) (line-too-long)
nemo/utils/exp_manager.py:603:0: C0301: Line too long (230/119) (line-too-long)
nemo/utils/exp_manager.py:605:0: C0301: Line too long (178/119) (line-too-long)
nemo/utils/exp_manager.py:642:0: C0301: Line too long (136/119) (line-too-long)
nemo/utils/exp_manager.py:777:0: C0301: Line too long (121/119) (line-too-long)
nemo/utils/exp_manager.py:785:0: C0301: Line too long (156/119) (line-too-long)
nemo/utils/exp_manager.py:832:0: C0301: Line too long (141/119) (line-too-long)
nemo/utils/exp_manager.py:840:0: C0301: Line too long (152/119) (line-too-long)
nemo/utils/exp_manager.py:1110:0: C0301: Line too long (142/119) (line-too-long)
nemo/utils/exp_manager.py:1134:0: C0301: Line too long (154/119) (line-too-long)
nemo/utils/exp_manager.py:1200:0: C0301: Line too long (120/119) (line-too-long)
nemo/utils/exp_manager.py:1219:0: C0301: Line too long (123/119) (line-too-long)
nemo/utils/exp_manager.py:1245:0: C0301: Line too long (121/119) (line-too-long)
nemo/utils/exp_manager.py:91:0: C0115: Missing class docstring (missing-class-docstring)
nemo/utils/exp_manager.py:106:0: C0115: Missing class docstring (missing-class-docstring)
nemo/utils/exp_manager.py:132:0: C0115: Missing class docstring (missing-class-docstring)
nemo/utils/exp_manager.py:141:0: C0115: Missing class docstring (missing-class-docstring)
nemo/utils/exp_manager.py:150:0: C0115: Missing class docstring (missing-class-docstring)
nemo/utils/exp_manager.py:161:0: C0115: Missing class docstring (missing-class-docstring)
nemo/utils/exp_manager.py:274:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/utils/exp_manager.py:277:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/utils/exp_manager.py:280:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/utils/exp_manager.py:283:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/utils/exp_manager.py:286:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/utils/exp_manager.py:289:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/utils/exp_manager.py:292:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/utils/exp_manager.py:295:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/utils/exp_manager.py:333:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/utils/exp_manager.py:336:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/utils/exp_manager.py:339:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/utils/exp_manager.py:342:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/utils/exp_manager.py:1126:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/utils/exp_manager.py:1227:0: C0116: Missing function or method docstring (missing-function-docstring)
nemo/utils/exp_manager.py:1246:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/utils/exp_manager.py:1249:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/utils/exp_manager.py:1265:0: C0116: Missing function or method docstring (missing-function-docstring)

-----------------------------------
Your code has been rated at 9.41/10

Thank you for improving NeMo's documentation!

github-actions · 2024-12-04T21:38:20Z

[🤖]: Hi @pzelasko 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully

So it might be time to merge this PR or get some approvals

I'm just a bot so I'll leave it you what to do next.

//cc @pablo-garay @ko3n1g

krishnacpuvvada

LGTM

pzelasko added 5 commits October 22, 2024 14:52

Canary 2 prompt formatter

301d454

Signed-off-by: Piotr Żelasko <[email protected]>

autoregister canary2 prompt format

cc69cf4

Signed-off-by: Piotr Żelasko <[email protected]>

lift the restrictions in canary tokenizer

936784e

Signed-off-by: Piotr Żelasko <[email protected]>

work around canary tokenizer

aa79570

Signed-off-by: Piotr Żelasko <[email protected]>

unit test + fix number of issues

ffffd23

Signed-off-by: Piotr Żelasko <[email protected]>

pzelasko requested review from tbartley94 and krishnacpuvvada October 28, 2024 13:34

github-actions bot added ASR common labels Oct 28, 2024

github-advanced-security bot found potential problems Oct 28, 2024

View reviewed changes

scripts/speech_recognition/canary/build_canary_1b_special_tokenizer.py

@@ -0,0 +1,40 @@

#!/usr/bin/env python

from pathlib import Path

Check notice

Code scanning / CodeQL

Unused import Note

Import of 'Path' is not used.

Add Canary2 and Canary1 language name back compat

92e6375

Signed-off-by: Piotr Żelasko <[email protected]>

bonham79 reviewed Oct 28, 2024

View reviewed changes

krishnacpuvvada reviewed Oct 28, 2024

View reviewed changes

pzelasko and others added 2 commits November 21, 2024 14:31

16% speedup Canary training on 1GPU

9d17592

Signed-off-by: Piotr Żelasko <[email protected]>

Apply isort and black reformatting

6c39efc

Signed-off-by: pzelasko <[email protected]>

github-actions bot added the stale label Nov 22, 2024

tbartley94 removed the stale label Nov 26, 2024

pzelasko added 2 commits November 26, 2024 12:08

Fixes for Canary loss masking in train and val

2fa9bed

Signed-off-by: Piotr Żelasko <[email protected]>

Merge branch 'canary2-optimizations' into canary2

f8f4964

github-advanced-security bot found potential problems Nov 26, 2024

View reviewed changes

Simplified language codes back to Canary1 format but expanded to all …

8464f2e

…ISO lang codes Signed-off-by: Piotr Żelasko <[email protected]>

github-advanced-security bot found potential problems Nov 26, 2024

View reviewed changes

scripts/speech_recognition/canary/build_canary_2_special_tokenizer.py

@@ -0,0 +1,265 @@

#!/usr/bin/env python

import math

Check notice

Code scanning / CodeQL

Unused import Note

Import of 'math' is not used.

pzelasko added 2 commits December 3, 2024 19:41

Canary <0.5s inference fix via padding

aabcb76

Signed-off-by: Piotr Żelasko <[email protected]>

Add Lhotse issue workaround

4b1bfce

Signed-off-by: Piotr Żelasko <[email protected]>

pzelasko commented Dec 4, 2024

View reviewed changes

Make loss masking on prompt optional and disabled by default

2b75b09

Signed-off-by: Piotr Żelasko <[email protected]>

pzelasko commented Dec 4, 2024

View reviewed changes

pzelasko changed the title ~~Next Canary's prompt format~~ New extended prompt format for Canary, short utterances inference fix, and training micro-optimizations Dec 4, 2024

pzelasko marked this pull request as ready for review December 4, 2024 01:17

pzelasko added 2 commits December 3, 2024 20:18

Fix copyright notices

a84c255

Signed-off-by: Piotr Żelasko <[email protected]>

Merge branch 'main' into canary2

f59d320

pzelasko added the Run CICD label Dec 4, 2024

krishnacpuvvada reviewed Dec 4, 2024

View reviewed changes

Make pnc optional

bcfb775

Signed-off-by: Piotr Żelasko <[email protected]>

pzelasko added Run CICD and removed Run CICD labels Dec 4, 2024

pzelasko enabled auto-merge (squash) December 4, 2024 18:06

krishnacpuvvada previously approved these changes Dec 4, 2024

View reviewed changes

fix tests

3418770

Signed-off-by: Piotr Żelasko <[email protected]>

pzelasko dismissed krishnacpuvvada’s stale review via 3418770 December 4, 2024 19:24

pzelasko added Run CICD and removed Run CICD labels Dec 4, 2024

fix val loss step

f405086

Signed-off-by: Piotr Żelasko <[email protected]>

pzelasko added Run CICD and removed Run CICD labels Dec 4, 2024

krishnacpuvvada approved these changes Dec 5, 2024

View reviewed changes

pzelasko merged commit 4aa2e8a into main Dec 5, 2024
174 of 175 checks passed

pzelasko deleted the canary2 branch December 5, 2024 01:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New extended prompt format for Canary, short utterances inference fix, and training micro-optimizations #11058

New extended prompt format for Canary, short utterances inference fix, and training micro-optimizations #11058

pzelasko commented Oct 28, 2024 •

edited

Loading

bonham79 left a comment

bonham79 Oct 28, 2024

bonham79 Oct 28, 2024

krishnacpuvvada Oct 28, 2024 •

edited

Loading

tbartley94 Nov 4, 2024

pzelasko Nov 26, 2024

bonham79 Oct 28, 2024

pzelasko Nov 7, 2024

tbartley94 Nov 7, 2024

pzelasko Nov 26, 2024

bonham79 Oct 28, 2024

krishnacpuvvada Oct 28, 2024

pzelasko Oct 30, 2024

tbartley94 Nov 4, 2024

krishnacpuvvada Oct 28, 2024 •

edited

Loading

pzelasko Nov 26, 2024

krishnacpuvvada Dec 4, 2024

pzelasko Dec 4, 2024

pzelasko Dec 4, 2024

krishnacpuvvada Oct 28, 2024 •

edited

Loading

pzelasko Nov 26, 2024

github-actions bot commented Nov 22, 2024

pzelasko Dec 4, 2024 •

edited

Loading

pzelasko Dec 4, 2024

pzelasko Dec 4, 2024

pzelasko Dec 4, 2024

pzelasko Dec 4, 2024

krishnacpuvvada left a comment

krishnacpuvvada left a comment

github-actions bot commented Dec 4, 2024

github-actions bot commented Dec 4, 2024

krishnacpuvvada left a comment

		@@ -0,0 +1,40 @@
		#!/usr/bin/env python
		from pathlib import Path


		any_special_token_present = False

		lang_dict_compat = {"en": "en-US", "es": "es-ES", "fr": "fr-FR", "de": "de-DE"}

New extended prompt format for Canary, short utterances inference fix, and training micro-optimizations #11058

New extended prompt format for Canary, short utterances inference fix, and training micro-optimizations #11058

Conversation

pzelasko commented Oct 28, 2024 • edited Loading

What does this PR do ?

Changelog

Usage

GitHub Actions CI

Before your PR is "Ready for review"

Who can review?

Additional Information

bonham79 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

krishnacpuvvada Oct 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

krishnacpuvvada Oct 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

krishnacpuvvada Oct 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Nov 22, 2024

pzelasko Dec 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

krishnacpuvvada left a comment

Choose a reason for hiding this comment

krishnacpuvvada left a comment

Choose a reason for hiding this comment

github-actions bot commented Dec 4, 2024

github-actions bot commented Dec 4, 2024

krishnacpuvvada left a comment

Choose a reason for hiding this comment

pzelasko commented Oct 28, 2024 •

edited

Loading

krishnacpuvvada Oct 28, 2024 •

edited

Loading

krishnacpuvvada Oct 28, 2024 •

edited

Loading

krishnacpuvvada Oct 28, 2024 •

edited

Loading

pzelasko Dec 4, 2024 •

edited

Loading