-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes issue 228 #234
Fixes issue 228 #234
Conversation
Signed-off-by: Simon Zuberek <[email protected]>
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can make this change a bit more powerful if you don't see any issue with suggestions.
Also, move this to whitelist class, not electronic since your current use cases are more whitelist styled.
@@ -63,14 +65,37 @@ def add_space_after_char(): | |||
domain = convert_defaults + pynini.closure(insert_space + convert_defaults) | |||
domain @= verbalize_characters | |||
|
|||
# Vebalizes common hyphenated nominal compounds (e.g. 3D-Drucker) | |||
verbalized_abbreviations = pynini.project(abbreviations, "output") | |||
DE_CHARS = pynini.union(*"äöüß") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move to DE graph and import
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea.
graph_abbreviation = pynini.string_file(get_abs_path("data/electronic/abbreviations.tsv")) | ||
hyphen_accep = pynini.accep("-") | ||
graph_compound_a = ( | ||
pynutil.insert("fragment_id:") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this valid for sparrowhawk?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. It's a valid field in SP's Electronic
semiotic class implementation, but since it's better to move everything to Whitelist, then I suppose it doesn't matter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmm, what are our options with whitelist sparrowhawk wise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeing how the SP proto doesn't even list Whitelist
as a semiotic class -- rightfully so IMO -- the options are endless? The current implementation of Whitelist
doesn't do much besides case-folding and deterministic routing. Moving the logic there should work, however it will most definitely give precedence to cases that we may not want to prioritize (e.g. DGX-1 not being ELECTRONIC
is probably more acceptable than 2-Liter
not being MEASURE
). On the other hand, leaving the current setup with the limitations imposed by SP's limitations lets you keep a tight lid on what "electronic" things go in it by simply editing the .tsv
file.
Perhaps, the best of two worlds would be to leave the above in place, and expand Whitelist
to match everything ABC-### and
###-ABC? That way all electronics stay
Electronicand we have a
Whitelistthat captures
G-9,
COVID-19`, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So to reiterate: we add the functionality to both whitelist and electronic, but whitelist is more general while elctronic captures specific whitelist?
I think that can work. But we can also just add negative weights within whitelist to take care of precedence cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As extra as that sounds -- yes. It may not matter for the final output, but I think it would be useful to separate evidently ELECTRONIC
things from WHITELIST
during tagging.
If you don't think it matters all that much (and it may not), I'll add the changes to WHITELIST
and we push like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmm, redundancy becomes a pain on debugging. Just move everything to whitelist, but make the key electronic terms in the actual whitelist while the more liberal expansion a few lines in the tagger graph. Make the weight for the whitelist file have highest priority. If we get annoyed by the liberal expansion, we comment out and work on later. How does that sound?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds very sensible. Let me implement that and see how well it meshes with the rest of the the grammar.
@@ -72,7 +78,28 @@ def __init__(self, deterministic: bool = True): | |||
protocol = pynutil.insert('protocol: "') + protocol + pynutil.insert('"') | |||
url = protocol + insert_space + (domain_graph) | |||
|
|||
graph = url | domain_graph | email | tag | |||
# Implements a graph for commonly-used hyphenated compound nouns (e.g. 3D-Drucker, 2D-Mammogram) | |||
graph_abbreviation = pynini.string_file(get_abs_path("data/electronic/abbreviations.tsv")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmm, thoughts on doing a more comprehensive pattern? Technically we can extend to any use of [0-9]+-[A-Za-z]
Can you see potential conflict with that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also reverse, would allow capturing things like G-9 summit, COVID-19.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, this will make it so that Electronic captures everything that's ###-ABC
or ABC-###
. This of course won't matter if all this is moved to Whitelist
. Besides that, I don't think I'm seeing any immediate problems, which doesn't mean that they won't emerge once this is implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmm, my vote is trying out but make sure it's a change that can be commented out. That way the fix can be quick if we break anything.
Signed-off-by: Simon Zuberek <[email protected]>
for more information, see https://pre-commit.ci
…se) pattern matching from ELECTRONIC Signed-off-by: Simon Zuberek <[email protected]>
Signed-off-by: Simon Zuberek <[email protected]>
Signed-off-by: Simon Zuberek <[email protected]>
for more information, see https://pre-commit.ci
nemo_text_processing/text_normalization/de/verbalizers/electronic.py
Fixed
Show resolved
Hide resolved
I added the explicitly targeted strings to |
Oh go figure. Can you do a quick check to see if that's consistent across the rest of the repo? Will merge once import checks are fixed. |
Signed-off-by: Simon Zuberek <[email protected]>
for more information, see https://pre-commit.ci
The merge-base changed after approval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
The merge-base changed after approval.
@zoobereq since i tried to push last time I'm not allowed for final review. Ping Marianna for a quick stamp. |
The merge-base changed after approval.
Signed-off-by: Simon Zuberek <[email protected]>
The merge-base changed after approval.
The merge-base changed after approval.
Signed-off-by: Simon Zuberek <[email protected]>
* Fixes issue 228 Signed-off-by: Simon Zuberek <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixes issue 228 Signed-off-by: Simon Zuberek <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Expands Whitelist and removes redundant [0-9]+-[A-Za-z] (and in reverse) pattern matching from ELECTRONIC Signed-off-by: Simon Zuberek <[email protected]> * Updates the cache Signed-off-by: Simon Zuberek <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Removes unused imports and variables Signed-off-by: Simon Zuberek <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Removes redundant abbreviation mappings Signed-off-by: Simon Zuberek <[email protected]> * Updates the cache Signed-off-by: Simon Zuberek <[email protected]> --------- Signed-off-by: Simon Zuberek <[email protected]> Co-authored-by: Simon Zuberek <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Ankit Narwade <[email protected]>
* Fixes issue 228 Signed-off-by: Simon Zuberek <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixes issue 228 Signed-off-by: Simon Zuberek <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Expands Whitelist and removes redundant [0-9]+-[A-Za-z] (and in reverse) pattern matching from ELECTRONIC Signed-off-by: Simon Zuberek <[email protected]> * Updates the cache Signed-off-by: Simon Zuberek <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Removes unused imports and variables Signed-off-by: Simon Zuberek <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Removes redundant abbreviation mappings Signed-off-by: Simon Zuberek <[email protected]> * Updates the cache Signed-off-by: Simon Zuberek <[email protected]> --------- Signed-off-by: Simon Zuberek <[email protected]> Signed-off-by: Ankit Narwade <[email protected]>
* Fixes issue 228 Signed-off-by: Simon Zuberek <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixes issue 228 Signed-off-by: Simon Zuberek <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Expands Whitelist and removes redundant [0-9]+-[A-Za-z] (and in reverse) pattern matching from ELECTRONIC Signed-off-by: Simon Zuberek <[email protected]> * Updates the cache Signed-off-by: Simon Zuberek <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Removes unused imports and variables Signed-off-by: Simon Zuberek <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Removes redundant abbreviation mappings Signed-off-by: Simon Zuberek <[email protected]> * Updates the cache Signed-off-by: Simon Zuberek <[email protected]> --------- Signed-off-by: Simon Zuberek <[email protected]> Signed-off-by: Ankit Narwade <[email protected]>
* Fixes issue 228 Signed-off-by: Simon Zuberek <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixes issue 228 Signed-off-by: Simon Zuberek <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Expands Whitelist and removes redundant [0-9]+-[A-Za-z] (and in reverse) pattern matching from ELECTRONIC Signed-off-by: Simon Zuberek <[email protected]> * Updates the cache Signed-off-by: Simon Zuberek <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Removes unused imports and variables Signed-off-by: Simon Zuberek <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Removes redundant abbreviation mappings Signed-off-by: Simon Zuberek <[email protected]> * Updates the cache Signed-off-by: Simon Zuberek <[email protected]> --------- Signed-off-by: Simon Zuberek <[email protected]> Co-authored-by: Simon Zuberek <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Namrata Gachchi <[email protected]>
* Fixes issue 228 Signed-off-by: Simon Zuberek <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixes issue 228 Signed-off-by: Simon Zuberek <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Expands Whitelist and removes redundant [0-9]+-[A-Za-z] (and in reverse) pattern matching from ELECTRONIC Signed-off-by: Simon Zuberek <[email protected]> * Updates the cache Signed-off-by: Simon Zuberek <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Removes unused imports and variables Signed-off-by: Simon Zuberek <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Removes redundant abbreviation mappings Signed-off-by: Simon Zuberek <[email protected]> * Updates the cache Signed-off-by: Simon Zuberek <[email protected]> --------- Signed-off-by: Simon Zuberek <[email protected]> Co-authored-by: Simon Zuberek <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Namrata Gachchi <[email protected]>
What does this PR do ?
The fix addresses one of the issues reported in #228, in particular:
The updated system correctly transduces these common hyphenated nominal compounds and can be easily expanded to include others.
Before your PR is "Ready for review"
Pre checks:
git commit -s
to sign.pytest
or (if your machine does not have GPU)pytest --cpu
from the root folder (given you marked your test cases accordingly@pytest.mark.run_only_on('CPU')
).bash tools/text_processing_deployment/export_grammars.sh --MODE=test ...
pytest
and Sparrowhawk here.__init__.py
for every folder and subfolder, includingdata
folder which has .TSV files?Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
to all newly added Python files?Copyright 2015 and onwards Google, Inc.
. See an example here.try import: ... except: ...
) if not already done.PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.