-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loading the confusables file #19
Comments
Yes, it is.
Good and valid questions indeed. First, I must perhaps say that I don't think this confusable weighting functionality has really been used in practice yet, so there's no proper evaluation or anything. Though I implemented it, we never used it in the Golden Agents projects for which analiticcl was developed. I can tell, of course, how it is implemented: After all variants are scored in the regular way using the distance metrics and possibly frequency information (a log linear combination of various components), an extra rescoring is performed if a confusable list is provided. This rescoring is meant to give slight bonuses or penalties to the scores whenever certain confusables occur (with a certain confusable weight). In the documentation I write about this:
It is a bit hard to predict how this plays out in actual use-cases, the challenge is always in tweaking the weights so there is a balance between the confusable weights and the weights in the main score function (of which these are not a part but applied after-the-fact to that score as a whole). The only way to find out is to experiment with it. There is one relevant option which is not properly documented yet, there is a
The confusable lists and weights are a more refined mechanism and can express various things that the alphabet can't (like context information, and variable weights), but it does introduce an extra level of complexity. The alphabet file is much more crude, but if your confusables are unambiguous enough to fit in there, then that might indeed be the preferred option. If it causes only more ambiguity though, then it's probably not a good idea. |
I was trying to find this parameter for the Python objects but so far without success. Is it available? |
Good point, I think it's not propagated to the Python binding yet. I'll add it. |
…the --early-confusables parameter Ref: #19
This should now be fixed in v0.4.6, call |
Thanks! I'd like to use this parameter to achieve e.g. the following. So then if using this method, a
How should this be represented in the confusables file, e.g. similar as below? or likely without the preceding (and the tailing) context, which are not generic enough? (sorry for the multiple edits) |
... and is there a way to make patterns to behave symmetrically, and apply to the counterpart cases as well ? |
I guess I have found it out, so e.g. this works well: and the score depends on how the other scores are set, I guess. But 1.5 seems to return the desired lexemes well for my use case. Thanks a lot for the implementation! |
Great, I see you already figured it out! That indeed seems like the proper syntax, you indeed need both explicitly. It will give a higher score to variants that had жд and lost the д, and to variants that have ж and add д.. relative to the weighting of the other variants that do not exhibit such a pattern. Finding the proper score is always a bit trial and error, 1.5 might be a bit on the large side even as they'd best be fairly close to 1.0 in order not to have too big an influence. |
I wonder if this is the right way to loading the confusables file:
It would be brilliant to have an example about how the confusables list impacts the ranking of error candidates, resp. how the confusables penalty or promotion works (i.e. what do we gain by these).
Especially: what would happen in analiticcl (apart from the semantic heterogeneity), if confusables would be listed in the alphabet file? Many thanks!
The text was updated successfully, but these errors were encountered: