Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

old russian / church slavonic glyphs? #24

Open
yurytch opened this issue Mar 29, 2018 · 13 comments
Open

old russian / church slavonic glyphs? #24

yurytch opened this issue Mar 29, 2018 · 13 comments

Comments

@yurytch
Copy link

yurytch commented Mar 29, 2018

Is it possible to add support for the Old Russian / Church Slavonic glyphs, at least for the 'yat' (U+0462, U+0463), 'fita' (U+0472, U+0473), and 'izhitsa' (U+0474,U+0475) ?

@Shreeshrii
Copy link
Contributor

Which language traineddata are you using currently?

@yurytch
Copy link
Author

yurytch commented Mar 29, 2018

I'm using 'rus' from tessdata_best. Tried adding 'bul' and 'srp', to no avail.
Would be great if there were an additional datafile just for those glyphs recognition, also with cursive (yat!). Does tesseract work like this?

@Shreeshrii
Copy link
Contributor

  1. Try with 'rus' from tessdata_fast and see if that is better.

  2. Try the 'pluschar' training using 'rus' from tessdata_best as the continue_from model. Add at least 15 occurrences of the Old Russian / Church Slavonic glyphs that you want to add so that they get picked us in the unicharset.

  3. Also try with script/Cyrillic (or other appropriate script use for Russian).

  4. Please share about 150 lines of training text which has the added glyphs for testing.

@yurytch
Copy link
Author

yurytch commented Mar 30, 2018

@Shreeshrii While I'm trying to make sense of that plus-training procedure (your point 2):
your pt. 1 doesn't work (more OCR errors with 'rus' from *_fast),
I don't understand your pt. 3 - 'rus' is Cyrillic anyway, and 'yat' etc. are Cyrillic., too.
Regarding the pt. 4: do you mean the training text, like for inclusion in the 'rus' training dataset? But wouldn't you want the graphics with real typeset glyphs for that, too?

@Shreeshrii
Copy link
Contributor

Ray has trained for languages eg. Eng, rus and also for scripts in which various languages are written eg. Latin script for english, french, German etc.

My suggestion was for you to use script/Cyrrilic to compare results with rus. In case the letters you want to add are in one of the other languages, they might be recognised.

Re. 4, yes along with training text, also need a font which will render those glyphs correctly.

@Shreeshrii
Copy link
Contributor

Please review the following files:

https://github.com/tesseract-ocr/langdata/tree/master/rus
https://github.com/tesseract-ocr/langdata/blob/master/rus/desired_characters

https://github.com/tesseract-ocr/langdata/blob/master/Cyrillic.unicharset

Adding these glyphs will require changes in lagdata repo for rus, eg. adding these glyphs to desired_characters file.

@maxirmx
Copy link

maxirmx commented Oct 8, 2020

Does anybody know about any progress related to the subject - Old Russian support for tesseract ?

@stweil
Copy link
Member

stweil commented Oct 8, 2020

@maxirmx, maybe you can contribute by reviewing the files named above?

@maxirmx
Copy link

maxirmx commented Oct 27, 2020

@stweil, thank you.
https://github.com/tesseract-ocr/langdata/tree/master/rus is 'modern Russian'.
I have asked about older Russian that included three letters were made obsolete in 1917/1918. They were mentioned in the start of this thread: 'yat' (U+0462, U+0463), 'fita' (U+0472, U+0473), and 'izhitsa' (U+0474,U+0475).
I would imagine additional complications as well such as different paragraph sign and different fonts used at that time.

It is somewhat clear what to do, but I do not want to repeat other's work that might be done already.

@stweil
Copy link
Member

stweil commented Oct 27, 2020

Okay, "Ѣ" and maybe the other older glyphs are also missing in https://github.com/tesseract-ocr/langdata_lstm/blob/master/script/Cyrillic/Cyrillic.unicharset.

So you will need ground truth data to train a new model based on rus.traineddata or Cyrillic.traineddata, but with the additional glyphs. As soon as you have line images with text transcriptions, this process is supported pretty well with tesstrain.

@stweil
Copy link
Member

stweil commented Oct 27, 2020

See also issue tesseract-ocr/langdata_lstm#3 which looks like a duplicate. Maybe you can join efforts.

@dvrogozh
Copy link

@stweil are there any requirements for the training words/text (except beforementioned 150 lines)? For example, how many times each new character should be met in training set? Should there be at least 1 capital and non-capital letter? or something like that?

Arbitrary text(s) in old russian can be obtained, for example, from ru.wikisource.org. For example, https://ru.wikisource.org/wiki/%D0%91%D0%BE%D0%B6%D0%B5%D1%81%D1%82%D0%B2%D0%B5%D0%BD%D0%BD%D0%B0%D1%8F_%D0%BA%D0%BE%D0%BC%D0%B5%D0%B4%D0%B8%D1%8F_(%D0%94%D0%B0%D0%BD%D1%82%D0%B5;_%D0%9C%D0%B8%D0%BD)/%D0%94%D0%9E.

@yurytch
Copy link
Author

yurytch commented Dec 25, 2020

I still fail to comprehend the process well enough.
But I guess I understand why glyph can't be 'added' to an existing dataset -- because of how the deep learning works, right?
But retraining the complete set is rather beyond my resources, in terms of computing power and time.

Here's a thought/question:
would it be useful to train a separate (small) set consisting of those missing glyphs and glyphs that look like those missing ones? I.e. consisting of 'YAT's and 'HARD SIGN's.
Then one could use it in a set of languages:
rus+yat
Would this work at all?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants