Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tag_test and escaped xfst tags #71

Open
trondtynnol opened this issue Feb 1, 2025 · 3 comments
Open

tag_test and escaped xfst tags #71

trondtynnol opened this issue Feb 1, 2025 · 3 comments

Comments

@trondtynnol
Copy link
Contributor

tag_test is failing for me with the message

FAIL: tag_test.sh
=================

grep: (standard input): binary file matches
FAIL: Have a look at these:
+Use%/GC
FAIL tag_test.sh (exit status: 1)

I found the offending tag +Use%/GC in shared-smi

$ rg "\+Use%/GC" shared-*
shared-smi/src/fst/stems/arabic_roman_digits.lexc
205:< [1|2|3|4|5|6|7|8|9|%0] %+Use%/GC:0 >        MEASUREMENTS     ; ! gc needs measurements after arabic loops

where it appears in embedded xfst in lexc, meaning the actual tag is +Use/GC. Could tag_test be adapted to handle this?

@snomos
Copy link
Member

snomos commented Feb 1, 2025

The tag test is too fragile in two different ways:

  1. it uses declared tags in root.lexc instead of extracting all tags from the compiled lexical fst
  2. it uses a simple diff, although the only relevant difference is one where there are tags in use that are not defined, not the other way around: defined tags that are NOT in use.

Fixing any of these would solve your issue, and fixing both would make the tag test much more robust.

@flammie could you have a look?

@flammie
Copy link
Contributor

flammie commented Feb 3, 2025

Yeah tag_test.sh and related extract scripts are quite hacky, I've patched a few of the escapes in now.

The problem that undeclared and typoed multichars compile into one arc per byte kind of paths cannot really be figured out from binary fst. It's a design failure in lexc that can ultimately only be fixed by rethinking the alphabet handling over all tools. The best that can be done finding misspelt and undeclared tags from lexc entries is by guessing that +anything is a tag by convention.

@snomos
Copy link
Member

snomos commented Feb 3, 2025

The best that can be done finding misspelt and undeclared tags from lexc entries is by guessing that +anything is a tag by convention.

Yes, and that is essentially what we already do. We can still improve the non-guessing part of the tag test, and that is what I suggest we do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants