Fixes issue 228 #234

zoobereq · 2024-10-04T22:22:02Z

What does this PR do ?

The fix addresses one of the issues reported in #228, in particular:

text: Hier zoome ich auf die Läsion. Wir befinden uns also auf der 2D-Mammographie.
norm_text:Hier zoome ich auf die Läsion. Wir befinden uns also auf der 2D-Mammographie.
expected output: Hier zoome ich auf die Läsion. Wir befinden uns also auf der Zwei-D-Mammographie. (not sure)

The updated system correctly transduces these common hyphenated nominal compounds and can be easily expanded to include others.

Before your PR is "Ready for review"

Pre checks:

PR Type:

New Feature
Bugfix
Documentation
Test

If you haven't finished some of the above items you can still open "Draft" PR.

Signed-off-by: Simon Zuberek <[email protected]>

for more information, see https://pre-commit.ci

tbartley94

I think we can make this change a bit more powerful if you don't see any issue with suggestions.

Also, move this to whitelist class, not electronic since your current use cases are more whitelist styled.

tbartley94 · 2024-10-07T15:42:21Z

nemo_text_processing/text_normalization/de/verbalizers/electronic.py

@@ -63,14 +65,37 @@ def add_space_after_char():
        domain = convert_defaults + pynini.closure(insert_space + convert_defaults)
        domain @= verbalize_characters

+        # Vebalizes common hyphenated nominal compounds (e.g. 3D-Drucker)
+        verbalized_abbreviations = pynini.project(abbreviations, "output")
+        DE_CHARS = pynini.union(*"äöüß")


Move to DE graph and import

tbartley94 · 2024-10-07T15:42:52Z

nemo_text_processing/text_normalization/de/taggers/electronic.py

+        graph_abbreviation = pynini.string_file(get_abs_path("data/electronic/abbreviations.tsv"))
+        hyphen_accep = pynini.accep("-")
+        graph_compound_a = (
+            pynutil.insert("fragment_id:")


is this valid for sparrowhawk?

Yes. It's a valid field in SP's Electronic semiotic class implementation, but since it's better to move everything to Whitelist, then I suppose it doesn't matter?

Mmm, what are our options with whitelist sparrowhawk wise?

Seeing how the SP proto doesn't even list Whitelist as a semiotic class -- rightfully so IMO -- the options are endless? The current implementation of Whitelist doesn't do much besides case-folding and deterministic routing. Moving the logic there should work, however it will most definitely give precedence to cases that we may not want to prioritize (e.g. DGX-1 not being ELECTRONIC is probably more acceptable than 2-Liter not being MEASURE). On the other hand, leaving the current setup with the limitations imposed by SP's limitations lets you keep a tight lid on what "electronic" things go in it by simply editing the .tsv file.

Perhaps, the best of two worlds would be to leave the above in place, and expand Whitelist to match everything ABC-### and ###-ABC? That way all electronics stay Electronicand we have aWhitelistthat capturesG-9, COVID-19`, etc.

So to reiterate: we add the functionality to both whitelist and electronic, but whitelist is more general while elctronic captures specific whitelist?

I think that can work. But we can also just add negative weights within whitelist to take care of precedence cases.

As extra as that sounds -- yes. It may not matter for the final output, but I think it would be useful to separate evidently ELECTRONIC things from WHITELIST during tagging.

If you don't think it matters all that much (and it may not), I'll add the changes to WHITELIST and we push like that.

Mmm, redundancy becomes a pain on debugging. Just move everything to whitelist, but make the key electronic terms in the actual whitelist while the more liberal expansion a few lines in the tagger graph. Make the weight for the whitelist file have highest priority. If we get annoyed by the liberal expansion, we comment out and work on later. How does that sound?

Sounds very sensible. Let me implement that and see how well it meshes with the rest of the the grammar.

tbartley94 · 2024-10-07T15:45:35Z

nemo_text_processing/text_normalization/de/taggers/electronic.py

@@ -72,7 +78,28 @@ def __init__(self, deterministic: bool = True):
        protocol = pynutil.insert('protocol: "') + protocol + pynutil.insert('"')
        url = protocol + insert_space + (domain_graph)

-        graph = url | domain_graph | email | tag
+        # Implements a graph for commonly-used hyphenated compound nouns (e.g. 3D-Drucker, 2D-Mammogram)
+        graph_abbreviation = pynini.string_file(get_abs_path("data/electronic/abbreviations.tsv"))


Mmm, thoughts on doing a more comprehensive pattern? Technically we can extend to any use of [0-9]+-[A-Za-z] Can you see potential conflict with that?

Also reverse, would allow capturing things like G-9 summit, COVID-19.

Well, this will make it so that Electronic captures everything that's ###-ABC or ABC-###. This of course won't matter if all this is moved to Whitelist. Besides that, I don't think I'm seeing any immediate problems, which doesn't mean that they won't emerge once this is implemented.

Mmm, my vote is trying out but make sure it's a change that can be commented out. That way the fix can be quick if we break anything.

Signed-off-by: Simon Zuberek <[email protected]>

for more information, see https://pre-commit.ci

…se) pattern matching from ELECTRONIC Signed-off-by: Simon Zuberek <[email protected]>

Signed-off-by: Simon Zuberek <[email protected]>

for more information, see https://pre-commit.ci

nemo_text_processing/text_normalization/de/taggers/electronic.py

nemo_text_processing/text_normalization/de/verbalizers/electronic.py

zoobereq · 2024-10-18T20:52:16Z

I added the explicitly targeted strings to WHITELIST. It turns out that the pattern [0-9]+-[A-Za-z] and its reverse is already matched by the MEASURE graph, so strings such as COVID-19 or G-9 will be tagged and verbalized accordingly. Since it's already there, I didn't reproduce it b/c redundancy.

tbartley94 · 2024-10-21T17:52:42Z

Oh go figure. Can you do a quick check to see if that's consistent across the rest of the repo?

Will merge once import checks are fixed.

Signed-off-by: Simon Zuberek <[email protected]>

for more information, see https://pre-commit.ci

The merge-base changed after approval.

tbartley94

LGTM

tbartley94

lgtm

The merge-base changed after approval.

tbartley94 · 2024-10-22T17:38:14Z

@zoobereq since i tried to push last time I'm not allowed for final review. Ping Marianna for a quick stamp.

The merge-base changed after approval.

Signed-off-by: Simon Zuberek <[email protected]>

The merge-base changed after approval.

Signed-off-by: Simon Zuberek <[email protected]>

* Fixes issue 228 Signed-off-by: Simon Zuberek <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixes issue 228 Signed-off-by: Simon Zuberek <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Expands Whitelist and removes redundant [0-9]+-[A-Za-z] (and in reverse) pattern matching from ELECTRONIC Signed-off-by: Simon Zuberek <[email protected]> * Updates the cache Signed-off-by: Simon Zuberek <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Removes unused imports and variables Signed-off-by: Simon Zuberek <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Removes redundant abbreviation mappings Signed-off-by: Simon Zuberek <[email protected]> * Updates the cache Signed-off-by: Simon Zuberek <[email protected]> --------- Signed-off-by: Simon Zuberek <[email protected]> Co-authored-by: Simon Zuberek <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Ankit Narwade <[email protected]>

* Fixes issue 228 Signed-off-by: Simon Zuberek <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixes issue 228 Signed-off-by: Simon Zuberek <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Expands Whitelist and removes redundant [0-9]+-[A-Za-z] (and in reverse) pattern matching from ELECTRONIC Signed-off-by: Simon Zuberek <[email protected]> * Updates the cache Signed-off-by: Simon Zuberek <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Removes unused imports and variables Signed-off-by: Simon Zuberek <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Removes redundant abbreviation mappings Signed-off-by: Simon Zuberek <[email protected]> * Updates the cache Signed-off-by: Simon Zuberek <[email protected]> --------- Signed-off-by: Simon Zuberek <[email protected]> Signed-off-by: Ankit Narwade <[email protected]>

* Fixes issue 228 Signed-off-by: Simon Zuberek <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixes issue 228 Signed-off-by: Simon Zuberek <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Expands Whitelist and removes redundant [0-9]+-[A-Za-z] (and in reverse) pattern matching from ELECTRONIC Signed-off-by: Simon Zuberek <[email protected]> * Updates the cache Signed-off-by: Simon Zuberek <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Removes unused imports and variables Signed-off-by: Simon Zuberek <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Removes redundant abbreviation mappings Signed-off-by: Simon Zuberek <[email protected]> * Updates the cache Signed-off-by: Simon Zuberek <[email protected]> --------- Signed-off-by: Simon Zuberek <[email protected]> Co-authored-by: Simon Zuberek <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Namrata Gachchi <[email protected]>

zoobereq and others added 2 commits October 4, 2024 18:15

Fixes issue 228

8e4fa72

Signed-off-by: Simon Zuberek <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

7634e0b

for more information, see https://pre-commit.ci

zoobereq marked this pull request as ready for review October 5, 2024 02:45

zoobereq requested a review from tbartley94 October 5, 2024 02:45

zoobereq self-assigned this Oct 5, 2024

tbartley94 requested changes Oct 7, 2024

View reviewed changes

zoobereq and others added 6 commits October 18, 2024 13:38

Fixes issue 228

a50a8b6

Signed-off-by: Simon Zuberek <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

a3dca5f

for more information, see https://pre-commit.ci

Expands Whitelist and removes redundant [0-9]+-[A-Za-z] (and in rever…

b86da7d

…se) pattern matching from ELECTRONIC Signed-off-by: Simon Zuberek <[email protected]>

Updates the cache

3d90ae6

Signed-off-by: Simon Zuberek <[email protected]>

Finalizes the fix

d75d338

Signed-off-by: Simon Zuberek <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

6e9392f

for more information, see https://pre-commit.ci

github-advanced-security bot found potential problems Oct 18, 2024

View reviewed changes

nemo_text_processing/text_normalization/de/taggers/electronic.py Fixed Show resolved Hide resolved

nemo_text_processing/text_normalization/de/verbalizers/electronic.py Fixed Show resolved Hide resolved

zoobereq and others added 2 commits October 21, 2024 13:03

Removes unused imports and variables

23a7d34

Signed-off-by: Simon Zuberek <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

84b086e

for more information, see https://pre-commit.ci

tbartley94 previously approved these changes Oct 21, 2024

View reviewed changes

zoobereq requested a review from tbartley94 October 22, 2024 15:42

tbartley94 previously approved these changes Oct 22, 2024

View reviewed changes

zoobereq requested a review from mgrafu October 22, 2024 17:44

Removes redundant abbreviation mappings

bdc4514

Signed-off-by: Simon Zuberek <[email protected]>

mgrafu previously approved these changes Oct 23, 2024

View reviewed changes

zoobereq closed this Oct 23, 2024

zoobereq reopened this Oct 23, 2024

zoobereq requested a review from mgrafu October 23, 2024 16:27

Updates the cache

40db609

Signed-off-by: Simon Zuberek <[email protected]>

mgrafu approved these changes Oct 23, 2024

View reviewed changes

zoobereq merged commit 3b3c3a3 into main Oct 23, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes issue 228 #234

Fixes issue 228 #234

zoobereq commented Oct 4, 2024 •

edited

Loading

tbartley94 left a comment

tbartley94 Oct 7, 2024

zoobereq Oct 7, 2024

tbartley94 Oct 7, 2024

zoobereq Oct 7, 2024

tbartley94 Oct 8, 2024

zoobereq Oct 8, 2024

tbartley94 Oct 16, 2024

zoobereq Oct 16, 2024

tbartley94 Oct 17, 2024

zoobereq Oct 17, 2024

tbartley94 Oct 7, 2024

tbartley94 Oct 7, 2024

zoobereq Oct 7, 2024

tbartley94 Oct 8, 2024

zoobereq commented Oct 18, 2024

tbartley94 commented Oct 21, 2024

tbartley94 left a comment

tbartley94 left a comment

tbartley94 commented Oct 22, 2024

Fixes issue 228 #234

Fixes issue 228 #234

Conversation

zoobereq commented Oct 4, 2024 • edited Loading

What does this PR do ?

Before your PR is "Ready for review"

tbartley94 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zoobereq commented Oct 18, 2024

tbartley94 commented Oct 21, 2024

tbartley94 left a comment

Choose a reason for hiding this comment

tbartley94 left a comment

Choose a reason for hiding this comment

tbartley94 commented Oct 22, 2024

zoobereq commented Oct 4, 2024 •

edited

Loading