Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter out numerals, acronyms etc from word list for pattern hyhpenation (Bugzilla Bug 2537) #35

Open
albbas opened this issue Jan 25, 2019 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@albbas
Copy link
Contributor

albbas commented Jan 25, 2019

This issue was created automatically with bugzilla2github

Bugzilla Bug 2537

Date: 2019-01-25T09:31:29+01:00
From: Sjur Nørstebø Moshagen <<sjur.n.moshagen>>
To: Sjur Nørstebø Moshagen <<sjur.n.moshagen>>
CC: borre.gaup, chiara.argese

Last updated: 2019-01-25T09:42:51+01:00

@albbas
Copy link
Contributor Author

albbas commented Jan 25, 2019

Comment 13127

Date: 2019-01-25 09:31:29 +0100
From: Sjur Nørstebø Moshagen <<sjur.n.moshagen>>

The way it is generated now (random output from fst) makes it contain all sorts of random noise (over generation patterns that are usually harmless, but turns out to be really harmful in this context).

@albbas
Copy link
Contributor Author

albbas commented Jan 25, 2019

Comment 13128

Date: 2019-01-25 09:39:27 +0100
From: Sjur Nørstebø Moshagen <<sjur.n.moshagen>>

Use the weighted fst (do not convert to unweighted), add heavy weights to tags for all unwanted strings, then filter the output based on weight (ie only output with weight below threshold should survive).

Requires that the wordlist is printed with weights, or that we remove such paths from the fst first, whatever is more easily implemented.

@albbas
Copy link
Contributor Author

albbas commented Jan 25, 2019

Comment 13129

Date: 2019-01-25 09:42:51 +0100
From: Sjur Nørstebø Moshagen <<sjur.n.moshagen>>

Another alternative: add more paths to be removed from the lexicon - we don't need acronyms and abbreviations in the hyphenator lexicon (they will be covered by the rule component). The same goes for numbers.

We already do this, so this is definitely the easiest way forward.

@albbas albbas transferred this issue from giellalt/bugzilla-dummy Sep 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants