-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Difference between Fraktur_5000000_0.*.traineddata and frak2021_*.traineddata #5
Comments
Generally color or grayscale images are better for OCR than binarized ones like the one above (that was different with older OCR software). If you have grayscale scans, I suggest to retry OCR with those. |
With model frak2021_1.069 I get this text:
|
Generally we try to improve the models over time, so newer ones should ideally be better. |
How does the naming-scheme work for the models? How can I know which one is the newest? |
frak2021_0.905_1587027_9141630.traineddata and frak2021_1.069_755545_3685930.traineddata for example are from the same training process. "0.905" and "1.069" are indicators for the accuracy. That value decreases during the training. The smallest value is the last one produced, but not necessarily the best one because the training can overfit. So usually one of the smaller ones is typically best, and you have to try which one fits best for your case. GT4HistOCR/ and Fraktur_5000000/ are older training results, frak2021/ and frak2021_09 are newer ones. See https://github.com/tesseract-ocr/tesstrain/wiki/Training-Fraktur and https://github.com/tesseract-ocr/tesstrain/wiki/GT4HistOCR for some details on the training process. |
Thank you for the explanation, this is a great source of information. frak2021-09.traineddata produced following text after I replaced every occurence of ſ with s:
This results looks good enough that I would be able to use it, thank you very much for your help |
frak2021_1.069 might be even a little bit better. This is a case where the newer training did not improve the model. |
It did work slightly better, fixing previous mistakes but introducing a few newer ones as well. Here is the new text:
This new text has 17 differences, 10 of those were previously wrong and seem to be fixed now, so I would say it is indeed better :D I will continue to test other models and see if any improve the text even more |
I own a copy of every newspaper from the Paderborner Volksblatt from 1849 and I was looking for a way to digitalize them using OCR with Tesseract. During my research I found this project, however looking at the models online provided by UB-Mannheim I found multiple versions with no clear way to show which one is the one I should use.
For reference, this is the test-image I use:
![test](https://user-images.githubusercontent.com/39961794/202708600-64e76a77-441c-45b4-b04c-39da0e656c9e.png)
With the OCR-result using Fraktur_5000000_0.466.traineddata
Am 17. Jan. fand in Berlin im Englifden Haufe die erfte
Generalverfammliung der berliner gemeinnübigen Baugefell fait
Gtatt. Eroÿ der megen der jebigen Borwabhiverfammiungen nidt
günftigen Seit, war Dod der Gual anjebnlid) gefüllt. Jadbem der
provijorije Borfisende Des Romités, LRandhaumeifter Goff:
manu, femme Sreude Dariber ausgefproden batte, bab das Bo,
trob der ungünitigen Scitumftände, glüctlid bis bicrher gedieben fei, -
ergriff der bisberige Syndifus der Gefelliduft, Rammergeridtss
Ajfetfor Dr, Gaebler, das Bot, um in Furgen Sügen Die
bisberige IBirfiamfeit des Somités angudeuten. Derfelbe nabm
bierbei Gcelegeubeit, wicderbolt die grobe, fittlige Sdee bervorgu:
beben, welhe dem Unternebmen gum Grande liegt, und Die dem:
fetben cine viel tiefere Bedeutung gibt, al8 der Name der Gejele
jaft bein erften “nfeben vermutben laffen follte. Siernad) mu
man in der That anncbmen, daÿ der Plan des Gangen gecignet
ift, eine grobe, bisber dunfle Tartie unfers fogialen Lebens aufsubellen
und erfreulicher su macen. Der ,fleine Mann“ fol moralij ges
fraftigt, und burd) Den ibm in Ausficbt geftellten Grundbefig, re.
durd) Die au ermartenden Rapitals-Abfindungen, gu der fiheren und
feften Haltuns emporgeboben werden, den ein veblid) und burd
rbeit erworbener Beñh immer gemübrt. Diefer fonfervative
Gbarafter des Statuts, im edelffen @inne des Bortes, stebt fi
Dur daÿ gange Gtatut bindurd, und verbreitet bei Das gefaumte
Unternebmen den Geift der Gittlifeit und der Goliditat, Dem
Redner erfdien c8 nidt gmeifelbaft, da wenn bei der IBabl der
Micther mit Borfidt und GOewiffenbaitigfeit verfabren, und auf
die Aufredrhaitung des Statuts mit Strenge gemadt merde, für
jeben Bewobner der Gefellfhañtsbäufer die Meinung eines ordent:
lien, folidben Gefbaftmannes ermedt merden mürde. ,C8 mu
dabin fommen”, fprad berfelbe, ,das jedem fleineren Genverb:
fletbenden bei ben Gabrifanten und Grofbändlern ein offener
Rrebdit su Gebote ftebt, fobald er nadmeift, daB er Mietber der
gemeinnübigen Baugelelihaft ift!* Sntereflant war e8 guglei,
aus dem Bortrage des Gerrn Gacbler gu erfabren, daÿ nidt
allein im übrigen Deutfhland, fondern aud in der belgiféen,
franaüfifhen und italientfhen Preffe das Statut der Gefellfaft
Die wärmfte Anerfennung gefuuden bat.
The model seems to have issues with the s that looks like an f in Fraktur and some other letters as well. Is there a specific model that would fix this issue? Or would I have to train my own model for this usecase?
The text was updated successfully, but these errors were encountered: