-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New extended prompt format for Canary, short utterances inference fix, and training micro-optimizations #11058
Conversation
Signed-off-by: Piotr Żelasko <[email protected]>
Signed-off-by: Piotr Żelasko <[email protected]>
Signed-off-by: Piotr Żelasko <[email protected]>
Signed-off-by: Piotr Żelasko <[email protected]>
Signed-off-by: Piotr Żelasko <[email protected]>
Signed-off-by: Piotr Żelasko <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I'm a bit hesitant to create a separate Canary2 prompt class. I think it sets a bad precedent for future Canary releases that could lead to bloat. Is there any motivated reason the original Canary tokenizer class can't just redefined and then there's a separate conversion script to write up that converts the old Canary model into the new version? There's only one anyhow, so it's not like it's going to create absurd cascade problems for users.
|
||
any_special_token_present = False | ||
|
||
lang_dict_compat = {"en": "en-US", "es": "es-ES", "fr": "fr-FR", "de": "de-DE"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
es-US
f"Please ensure that every utterance in the input manifests contains these keys." | ||
) | ||
|
||
optional_slots = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a check to maintain consistency across slots or are you allowing variable slot sizes? (e.g. sample 1 has itn
sample 2 does not have that slot.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are combining different datasets annotated for different things, so the flexibility is required atleast at the level of each manifest if not per sample.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, most of these are binary prompts. What about just having default values for the prompts instead? Else I believe y'all are adding the extra difficulty of the model needing to learn variable length prompt input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no variable length (except if you add decoder context). We indeed have default values for slots here.
@@ -0,0 +1,40 @@ | |||
#!/usr/bin/env python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe this script is necessary. It can just be an example in the canary tutorial instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The people building the new tokenizer for Canary 3 or community Canary forks will have an easier time and won't need to guess how the previous special tokenizers were built :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those people will be using the tutorial anyhow no? From my understanding, Canary use hasn't picked up that much this year, so there's not a significant community. It's more likely you'll have greater growth with the current model, so providing backend support may not be crucial.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm re-doing the Canary 2 tokenizer right now and am grateful I put these scripts in here. Not removing them :)
@@ -0,0 +1,111 @@ | |||
#!/usr/bin/env python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
# User prompt. | ||
# This role is used for emotion / LID inference only - use it for two just two decoder inference steps | ||
# to retrieve the recognized emotion and language tokens. | ||
"user_partial": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving a note that this doesn't cover the cases like "identify language and transcribe with ITN" (which is fine for this version). should put sufficient guards regarding what users are allowed to ask in transcribe fn.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That will be complicated to implement in transcribe fn in general. We should think about whether a new API specifically for Canary would be cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to new API, even already the prompt setup can be a bit bulky. Putting thought into API changes before it an actual problem could save some heartache.
) | ||
|
||
# first, validate the utterance | ||
expected_slots = {"source_lang", "target_lang", "pnc"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about expected_slots = formatter.expected_slots
and same for optional_slots
should we make all the slots expected ie. source_lang, target_lang, pnc, itn, timestamp, diarize and we will set the default tags in data config yaml if something is missing in manifest file ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it will require rebuilding manifests (or at least data configs) for existing data, a bit of extra headache that I'm not sure is worth the time. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it shouldn't require us to rebuild manifests. it should be doable using the tags
we have in train_ds yaml config?
atleast may be the first part i.e, expected_slots = formatter.expected_slots
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
discussed offline - keeping the current code to hide the extra complexity of new canary2 slots for users that are not interested in them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after further conversation - made pnc optional too
if text.startswith(CANARY2_BOCTX): | ||
# Canary 2 prompt format. It starts with decoder context, which should be tokenized using | ||
# a different tokenizer than spl_tokens. We don't really know what it is, so we'll use the | ||
# following HACK solution: look up 5th token which is target_lang and tokenize this part |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
limitation: Hack will error out if we are decoding with unk_lang as target_lang
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Better use single tokenizer in these cases...
Signed-off-by: Piotr Żelasko <[email protected]>
Signed-off-by: pzelasko <[email protected]>
This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days. |
Signed-off-by: Piotr Żelasko <[email protected]>
from lhotse.testing.dummies import DummyManifest | ||
from lhotse import CutSet, MonoCut, SupervisionSegment | ||
from lhotse.testing.dummies import DummyManifest, dummy_cut | ||
from lhotse.testing.random import deterministic_rng |
Check notice
Code scanning / CodeQL
Unused import Note test
…ISO lang codes Signed-off-by: Piotr Żelasko <[email protected]>
Signed-off-by: Piotr Żelasko <[email protected]>
Signed-off-by: Piotr Żelasko <[email protected]>
@@ -58,27 +58,7 @@ def _build_pos_enc(self, hidden_size, max_sequence_length, device=None): | |||
self.register_buffer('pos_enc', pos_enc) | |||
|
|||
def forward(self, position_ids): | |||
max_pos_id = position_ids.max() | |||
# update positional encoding if needed | |||
if max_pos_id >= self._max_sequence_length: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check is super costly as it triggers a DtoH transfer and CUDA sync on every call to transformer decoder forward, and the proposed solution doesn't work anyway (bad results instead of a crash).
@@ -67,7 +67,8 @@ | |||
|
|||
def lens_to_mask(lens, max_length): | |||
batch_size = lens.shape[0] | |||
mask = torch.arange(max_length).repeat(batch_size, 1).to(lens.device) < lens[:, None] | |||
arange = torch.arange(max_length, device=lens.device) | |||
mask = arange.expand(batch_size, max_length) < lens.unsqueeze(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
micro-optimization, removes some copies and memory movement
Signed-off-by: Piotr Żelasko <[email protected]>
@@ -673,24 +674,33 @@ def training_step(self, batch: PromptedAudioToTextMiniBatch, batch_nb): | |||
return torch.tensor([0.0]) | |||
|
|||
input_ids, labels = batch.get_decoder_inputs_outputs() | |||
input_ids_lens = batch.prompted_transcript_lens - 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixing off-by-one issue that included an extra padding frame in decoder masks
num_frames = batch.audio_lens.sum().float() | ||
num_tokens = batch.prompted_transcript_lens.sum().float() | ||
tot_frames = torch.as_tensor(batch.audio.numel(), device=num_frames.device, dtype=torch.float) | ||
tot_tokens = torch.as_tensor(batch.prompted_transcript.numel(), device=num_frames.device, dtype=torch.float) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
micro-optimizations
'learning_rate': self._optimizer.param_groups[0]['lr'], | ||
'batch_size': batch.audio.shape[0], | ||
'learning_rate': torch.as_tensor(self._optimizer.param_groups[0]['lr']), | ||
'batch_size': torch.as_tensor(batch.audio.shape[0]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
micro-optimizations (the PTL logger turned out to have an inefficient way of converting scalars to tensors)
Signed-off-by: Piotr Żelasko <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall; left a minor comment.
Great work!
Signed-off-by: Piotr Żelasko <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Signed-off-by: Piotr Żelasko <[email protected]>
Signed-off-by: Piotr Żelasko <[email protected]>
beep boop 🤖: 🙏 The following files have warnings. In case you are familiar with these, please try helping us to improve the code base. Your code was analyzed with PyLint. The following annotations have been identified:
Thank you for improving NeMo's documentation! |
[🤖]: Hi @pzelasko 👋, We wanted to let you know that a CICD pipeline for this PR just finished successfully So it might be time to merge this PR or get some approvals I'm just a bot so I'll leave it you what to do next. //cc @pablo-garay @ko3n1g |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
What does this PR do ?
Collection: ASR
Changelog
Usage
# Add a code snippet demonstrating how to use this
GitHub Actions CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information