Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes issue 228 #234

Merged
merged 12 commits into from
Oct 23, 2024
Merged

Fixes issue 228 #234

merged 12 commits into from
Oct 23, 2024

Conversation

zoobereq
Copy link
Collaborator

@zoobereq zoobereq commented Oct 4, 2024

What does this PR do ?

The fix addresses one of the issues reported in #228, in particular:

text: Hier zoome ich auf die Läsion. Wir befinden uns also auf der 2D-Mammographie.
norm_text:Hier zoome ich auf die Läsion. Wir befinden uns also auf der 2D-Mammographie.
expected output: Hier zoome ich auf die Läsion. Wir befinden uns also auf der Zwei-D-Mammographie. (not sure)

The updated system correctly transduces these common hyphenated nominal compounds and can be easily expanded to include others.

Before your PR is "Ready for review"

Pre checks:

  • Have you signed your commits? Use git commit -s to sign.
  • Do all unittests finish successfully before sending PR?
    1. pytest or (if your machine does not have GPU) pytest --cpu from the root folder (given you marked your test cases accordingly @pytest.mark.run_only_on('CPU')).
    2. Sparrowhawk tests bash tools/text_processing_deployment/export_grammars.sh --MODE=test ...
  • If you are adding a new feature: Have you added test cases for both pytest and Sparrowhawk here.
  • Have you added __init__.py for every folder and subfolder, including data folder which has .TSV files?
  • Have you followed codeQL results and removed unused variables and imports (report is at the bottom of the PR in github review box) ?
  • Have you added the correct license header Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. to all newly added Python files?
  • If you copied nemo_text_processing/text_normalization/en/graph_utils.py your header's second line should be Copyright 2015 and onwards Google, Inc.. See an example here.
  • Remove import guards (try import: ... except: ...) if not already done.
  • If you added a new language or a new feature please update the NeMo documentation (lives in different repo).
  • Have you added your language support to tools/text_processing_deployment/pynini_export.py.

PR Type:

  • New Feature
  • Bugfix
  • Documentation
  • Test

If you haven't finished some of the above items you can still open "Draft" PR.

@zoobereq zoobereq marked this pull request as ready for review October 5, 2024 02:45
@zoobereq zoobereq requested a review from tbartley94 October 5, 2024 02:45
@zoobereq zoobereq self-assigned this Oct 5, 2024
Copy link
Collaborator

@tbartley94 tbartley94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can make this change a bit more powerful if you don't see any issue with suggestions.

Also, move this to whitelist class, not electronic since your current use cases are more whitelist styled.

@@ -63,14 +65,37 @@ def add_space_after_char():
domain = convert_defaults + pynini.closure(insert_space + convert_defaults)
domain @= verbalize_characters

# Vebalizes common hyphenated nominal compounds (e.g. 3D-Drucker)
verbalized_abbreviations = pynini.project(abbreviations, "output")
DE_CHARS = pynini.union(*"äöüß")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move to DE graph and import

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea.

graph_abbreviation = pynini.string_file(get_abs_path("data/electronic/abbreviations.tsv"))
hyphen_accep = pynini.accep("-")
graph_compound_a = (
pynutil.insert("fragment_id:")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this valid for sparrowhawk?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It's a valid field in SP's Electronic semiotic class implementation, but since it's better to move everything to Whitelist, then I suppose it doesn't matter?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm, what are our options with whitelist sparrowhawk wise?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing how the SP proto doesn't even list Whitelist as a semiotic class -- rightfully so IMO -- the options are endless? The current implementation of Whitelist doesn't do much besides case-folding and deterministic routing. Moving the logic there should work, however it will most definitely give precedence to cases that we may not want to prioritize (e.g. DGX-1 not being ELECTRONIC is probably more acceptable than 2-Liter not being MEASURE). On the other hand, leaving the current setup with the limitations imposed by SP's limitations lets you keep a tight lid on what "electronic" things go in it by simply editing the .tsv file.

Perhaps, the best of two worlds would be to leave the above in place, and expand Whitelist to match everything ABC-### and ###-ABC? That way all electronics stay Electronicand we have aWhitelistthat capturesG-9, COVID-19`, etc.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So to reiterate: we add the functionality to both whitelist and electronic, but whitelist is more general while elctronic captures specific whitelist?

I think that can work. But we can also just add negative weights within whitelist to take care of precedence cases.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As extra as that sounds -- yes. It may not matter for the final output, but I think it would be useful to separate evidently ELECTRONIC things from WHITELIST during tagging.

If you don't think it matters all that much (and it may not), I'll add the changes to WHITELIST and we push like that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm, redundancy becomes a pain on debugging. Just move everything to whitelist, but make the key electronic terms in the actual whitelist while the more liberal expansion a few lines in the tagger graph. Make the weight for the whitelist file have highest priority. If we get annoyed by the liberal expansion, we comment out and work on later. How does that sound?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds very sensible. Let me implement that and see how well it meshes with the rest of the the grammar.

@@ -72,7 +78,28 @@ def __init__(self, deterministic: bool = True):
protocol = pynutil.insert('protocol: "') + protocol + pynutil.insert('"')
url = protocol + insert_space + (domain_graph)

graph = url | domain_graph | email | tag
# Implements a graph for commonly-used hyphenated compound nouns (e.g. 3D-Drucker, 2D-Mammogram)
graph_abbreviation = pynini.string_file(get_abs_path("data/electronic/abbreviations.tsv"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm, thoughts on doing a more comprehensive pattern? Technically we can extend to any use of [0-9]+-[A-Za-z] Can you see potential conflict with that?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also reverse, would allow capturing things like G-9 summit, COVID-19.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, this will make it so that Electronic captures everything that's ###-ABC or ABC-###. This of course won't matter if all this is moved to Whitelist. Besides that, I don't think I'm seeing any immediate problems, which doesn't mean that they won't emerge once this is implemented.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm, my vote is trying out but make sure it's a change that can be commented out. That way the fix can be quick if we break anything.

zoobereq and others added 6 commits October 18, 2024 13:38
@zoobereq
Copy link
Collaborator Author

I added the explicitly targeted strings to WHITELIST. It turns out that the pattern [0-9]+-[A-Za-z] and its reverse is already matched by the MEASURE graph, so strings such as COVID-19 or G-9 will be tagged and verbalized accordingly. Since it's already there, I didn't reproduce it b/c redundancy.

@tbartley94
Copy link
Collaborator

Oh go figure. Can you do a quick check to see if that's consistent across the rest of the repo?

Will merge once import checks are fixed.

@zoobereq zoobereq dismissed tbartley94’s stale review October 21, 2024 18:08

The merge-base changed after approval.

tbartley94
tbartley94 previously approved these changes Oct 21, 2024
Copy link
Collaborator

@tbartley94 tbartley94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zoobereq zoobereq requested a review from tbartley94 October 22, 2024 15:42
tbartley94
tbartley94 previously approved these changes Oct 22, 2024
Copy link
Collaborator

@tbartley94 tbartley94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@zoobereq zoobereq dismissed tbartley94’s stale review October 22, 2024 17:37

The merge-base changed after approval.

tbartley94
tbartley94 previously approved these changes Oct 22, 2024
@tbartley94
Copy link
Collaborator

@zoobereq since i tried to push last time I'm not allowed for final review. Ping Marianna for a quick stamp.

@zoobereq zoobereq dismissed tbartley94’s stale review October 22, 2024 17:43

The merge-base changed after approval.

@zoobereq zoobereq requested a review from mgrafu October 22, 2024 17:44
mgrafu
mgrafu previously approved these changes Oct 23, 2024
@zoobereq zoobereq dismissed mgrafu’s stale review October 23, 2024 16:19

The merge-base changed after approval.

mgrafu
mgrafu previously approved these changes Oct 23, 2024
@zoobereq zoobereq closed this Oct 23, 2024
@zoobereq zoobereq reopened this Oct 23, 2024
@zoobereq zoobereq dismissed mgrafu’s stale review October 23, 2024 16:24

The merge-base changed after approval.

@zoobereq zoobereq requested a review from mgrafu October 23, 2024 16:27
Signed-off-by: Simon Zuberek <[email protected]>
@zoobereq zoobereq merged commit 3b3c3a3 into main Oct 23, 2024
5 checks passed
ankitnv pushed a commit to ankitnv/NeMo-text-processing that referenced this pull request Oct 24, 2024
* Fixes issue 228

Signed-off-by: Simon Zuberek <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixes issue 228

Signed-off-by: Simon Zuberek <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Expands Whitelist and removes redundant [0-9]+-[A-Za-z] (and in reverse) pattern matching from ELECTRONIC

Signed-off-by: Simon Zuberek <[email protected]>

* Updates the cache

Signed-off-by: Simon Zuberek <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Removes unused imports and variables

Signed-off-by: Simon Zuberek <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Removes redundant abbreviation mappings

Signed-off-by: Simon Zuberek <[email protected]>

* Updates the cache

Signed-off-by: Simon Zuberek <[email protected]>

---------

Signed-off-by: Simon Zuberek <[email protected]>
Co-authored-by: Simon Zuberek <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Ankit Narwade <[email protected]>
ankitnv pushed a commit to ankitnv/NeMo-text-processing that referenced this pull request Oct 24, 2024
* Fixes issue 228

Signed-off-by: Simon Zuberek <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixes issue 228

Signed-off-by: Simon Zuberek <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Expands Whitelist and removes redundant [0-9]+-[A-Za-z] (and in reverse) pattern matching from ELECTRONIC

Signed-off-by: Simon Zuberek <[email protected]>

* Updates the cache

Signed-off-by: Simon Zuberek <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Removes unused imports and variables

Signed-off-by: Simon Zuberek <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Removes redundant abbreviation mappings

Signed-off-by: Simon Zuberek <[email protected]>

* Updates the cache

Signed-off-by: Simon Zuberek <[email protected]>

---------

Signed-off-by: Simon Zuberek <[email protected]>
Signed-off-by: Ankit Narwade <[email protected]>
ankitnv pushed a commit to ankitnv/NeMo-text-processing that referenced this pull request Oct 28, 2024
* Fixes issue 228

Signed-off-by: Simon Zuberek <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixes issue 228

Signed-off-by: Simon Zuberek <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Expands Whitelist and removes redundant [0-9]+-[A-Za-z] (and in reverse) pattern matching from ELECTRONIC

Signed-off-by: Simon Zuberek <[email protected]>

* Updates the cache

Signed-off-by: Simon Zuberek <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Removes unused imports and variables

Signed-off-by: Simon Zuberek <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Removes redundant abbreviation mappings

Signed-off-by: Simon Zuberek <[email protected]>

* Updates the cache

Signed-off-by: Simon Zuberek <[email protected]>

---------

Signed-off-by: Simon Zuberek <[email protected]>
Signed-off-by: Ankit Narwade <[email protected]>
ngachchi pushed a commit to ngachchi/NeMo-text-processing that referenced this pull request Dec 6, 2024
* Fixes issue 228

Signed-off-by: Simon Zuberek <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixes issue 228

Signed-off-by: Simon Zuberek <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Expands Whitelist and removes redundant [0-9]+-[A-Za-z] (and in reverse) pattern matching from ELECTRONIC

Signed-off-by: Simon Zuberek <[email protected]>

* Updates the cache

Signed-off-by: Simon Zuberek <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Removes unused imports and variables

Signed-off-by: Simon Zuberek <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Removes redundant abbreviation mappings

Signed-off-by: Simon Zuberek <[email protected]>

* Updates the cache

Signed-off-by: Simon Zuberek <[email protected]>

---------

Signed-off-by: Simon Zuberek <[email protected]>
Co-authored-by: Simon Zuberek <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Namrata Gachchi <[email protected]>
ngachchi pushed a commit to ngachchi/NeMo-text-processing that referenced this pull request Dec 17, 2024
* Fixes issue 228

Signed-off-by: Simon Zuberek <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixes issue 228

Signed-off-by: Simon Zuberek <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Expands Whitelist and removes redundant [0-9]+-[A-Za-z] (and in reverse) pattern matching from ELECTRONIC

Signed-off-by: Simon Zuberek <[email protected]>

* Updates the cache

Signed-off-by: Simon Zuberek <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Removes unused imports and variables

Signed-off-by: Simon Zuberek <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Removes redundant abbreviation mappings

Signed-off-by: Simon Zuberek <[email protected]>

* Updates the cache

Signed-off-by: Simon Zuberek <[email protected]>

---------

Signed-off-by: Simon Zuberek <[email protected]>
Co-authored-by: Simon Zuberek <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Namrata Gachchi <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants