Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing white spaces #5448

Closed
esraa-abdelmaksoud opened this issue Feb 11, 2022 · 19 comments
Closed

Missing white spaces #5448

esraa-abdelmaksoud opened this issue Feb 11, 2022 · 19 comments
Assignees

Comments

@esraa-abdelmaksoud
Copy link

System Environment: Windows 10
Version:v2.4
Related components:PP-OCR
Command Code:
PaddleOCR(use_angle_cls=True, lang='en', ocr_version='PP-OCR', use_space_char=True)

Hello,

Before anything, I'd like to say thank you for the great effort you exerted in the creation of this work.

I'm facing a problem that the OCR engine misses white spaces many times even though I'm setting use_space_char to true. The following are some of the images with the output.

In the following image, all spaces in the first line are missing:
draw-20210831_143019-4

In this image, lines 4,6, and 7 are missing spaces.
draw-20210831_171152-4

Is there any configuration I can do to overcome this issue?

@andyjiang1116
Copy link
Collaborator

you can try to decrease the params unclip_ratio

@esraa-abdelmaksoud
Copy link
Author

esraa-abdelmaksoud commented Feb 11, 2022

I've tried 0.8, 1, 1.2, 1.8, 2 and the default 1.5 for det_db_unclip_ratio. The problem is that this works for some images and spoils the output for others. Below are some examples for the following clean image:

IMG-20210831-WA0005

0.8:

[[[[668.0, 471.0], [699.0, 471.0], [699.0, 690.0], [668.0, 690.0]], ('Budelizer', 0.9989222)]]
[[[61.0, 113.0], [263.0, 118.0], [262.0, 142.0], [60.0, 137.0]], ('Budelizer', 0.970733)]
[[[53.0, 161.0], [300.0, 165.0], [299.0, 179.0], [53.0, 175.0]], (**'Budesonide micronized400 mco'**, 0.9597847)]
[[[53.0, 233.0], [293.0, 234.0], [293.0, 249.0], [53.0, 248.0]], (**'Hard capsules containingdr'**, 0.9630118)]
[[[53.0, 262.0], [272.0, 261.0], [272.0, 275.0], [53.0, 276.0]], (**'powder For oralinhalatior'**, 0.9352509)]
[[[524.0, 372.0], [593.0, 372.0], [593.0, 388.0], [524.0, 388.0]], ('inhale', 0.9461186)]
[[[52.0, 415.0], [138.0, 415.0], [138.0, 434.0], [52.0, 434.0]], ('EUROPFAN', 0.9215055)]
[[[510.0, 412.0], [603.0, 412.0], [603.0, 427.0], [510.0, 427.0]], ('capsule', 0.8058552)]

1:

[[[[666.0, 470.0], [701.0, 470.0], [701.0, 691.0], [666.0, 691.0]], ('Budelizer', 0.9989196)]]
[[[60.0, 111.0], [264.0, 116.0], [263.0, 144.0], [59.0, 139.0]], ('Budelizer', 0.97098225)]
[[[53.0, 161.0], [300.0, 165.0], [299.0, 179.0], [53.0, 175.0]], (**'Budesonide micronized400 mco'**, 0.9597847)]
[[[52.0, 232.0], [294.0, 233.0], [294.0, 250.0], [52.0, 249.0]], ('Hard capsules containing dry', 0.9746936)]
[[[52.0, 261.0], [273.0, 260.0], [273.0, 276.0], [52.0, 277.0]], (**'powder For oralinhalatior'**, 0.96679384)]
[[[523.0, 371.0], [594.0, 371.0], [594.0, 389.0], [523.0, 389.0]], ('inhale', 0.9650893)]

1.2:

[[[[664.0, 468.0], [702.0, 468.0], [702.0, 693.0], [664.0, 693.0]], ('Budelizer', 0.9995226)]]
[[[58.0, 109.0], [266.0, 115.0], [265.0, 145.0], [57.0, 140.0]], ('Budelizer', 0.99847627)]
[[[52.0, 160.0], [301.0, 164.0], [300.0, 180.0], [52.0, 176.0]], (**'Budesonide micronized400 mco'**, 0.9571237)]
[[[52.0, 232.0], [294.0, 233.0], [294.0, 250.0], [52.0, 249.0]], ('Hard capsules containing dry', 0.9746936)]
[[[52.0, 260.0], [273.0, 259.0], [273.0, 277.0], [52.0, 278.0]], (**'powder for oralinhalatior'**, 0.9528579)]
[[[522.0, 370.0], [595.0, 370.0], [595.0, 390.0], [522.0, 390.0]], (**'1inhale'**, 0.89756185)]
[[[50.0, 413.0], [140.0, 413.0], [140.0, 436.0], [50.0, 436.0]], ('LUROPEAN', 0.92258036)]

1.5:

[[[[661.0, 465.0], [706.0, 465.0], [706.0, 696.0], [661.0, 696.0]], ('Budelizer', 0.99889064)]]
[[[56.0, 107.0], [268.0, 113.0], [267.0, 147.0], [55.0, 142.0]], ('Budelizer', 0.9836219)]
[[[51.0, 159.0], [302.0, 163.0], [301.0, 181.0], [51.0, 177.0]], (**'Budesonide micronized400 mco'**, 0.9659994)]
[[[51.0, 231.0], [295.0, 232.0], [295.0, 251.0], [51.0, 250.0]], ('Hard capsules containing dry', 0.9827056)]
[[[50.0, 259.0], [275.0, 258.0], [275.0, 278.0], [50.0, 279.0]], ('powder for oral inhalation', 0.9851054)]
[[[521.0, 369.0], [596.0, 369.0], [596.0, 391.0], [521.0, 391.0]], ('inhaler', 0.8621807)]
[[[49.0, 412.0], [141.0, 412.0], [141.0, 437.0], [49.0, 437.0]], ('EUROPEAN', 0.9411056)]

2:

[[[[658.0, 461.0], [709.0, 461.0], [709.0, 700.0], [658.0, 700.0]], ('Budelizer', 0.99946606)]]
[[[53.0, 104.0], [271.0, 110.0], [270.0, 150.0], [52.0, 145.0]], ('Budelizer', 0.9977702)]
[[[49.0, 157.0], [304.0, 161.0], [303.0, 183.0], [49.0, 179.0]], (**'Budesonidemicronized400mcg'**, 0.99825233)]
[[[49.0, 229.0], [298.0, 230.0], [297.0, 253.0], [49.0, 252.0]], (**'Hard capsules containingdry'**, 0.9628299)]
[[[49.0, 257.0], [276.0, 256.0], [276.0, 280.0], [49.0, 281.0]], ('Powder for oral inhalation', 0.9759482)]
[[[518.0, 366.0], [599.0, 366.0], [599.0, 394.0], [518.0, 394.0]], ('1E innater', 0.8560789)]
[[[46.0, 409.0], [144.0, 409.0], [144.0, 440.0], [46.0, 440.0]], ('LUROFEAN', 0.9025044)]

@tink2123
Copy link
Collaborator

tink2123 commented Mar 4, 2022

@esraa-abdelmaksoud Thanks for using and providing valuable feedback on our products.The space problem is also a problem we hope to focus on in the next version.

You can also try to set the params use_dilation=True to expand the border of the detection box a little, which is helpful for space rec.

If it still can't solve your problem well, please look forward to our next version upgrade.

@esraa-abdelmaksoud
Copy link
Author

Unfortunately, it didn't help enough and removed a good deal of the text. However, I would like to thank you for your great work. I'm looking forward to trying your next version. Thanks! :)

@Ankur-singh
Copy link

The new version (2.5) does a much better job at handling white spaces, but the issue is still not resolved completely. There are situations where white spaces are missing. I have tried different values of det_db_unclip_ratio and use_dilation=True. Anything else that can be done to further improve the performance ?

@joel1895
Copy link

joel1895 commented Jul 3, 2023

Hi, I am facing the same issue. Using det_db_unclip_ration and use_dilation=True help in some extend. However, it affects other results. Any way to improve it?

@asif-ca
Copy link
Contributor

asif-ca commented Jul 20, 2023

@tink2123 I am Facing the Same issue with PaddlePaddle 2.5.0 many whitespaces are ignored and words are combined!

Using use_dilation=True adds spaces (to some extent not all) but recognition are missing for many words

@MosbehBarhoumiRAI
Copy link

I used PaddleOCR with these parameters, yielding favorable outcomes in space detection, even for more challenging cases.

PaddleOCR(use_angle_cls=True, lang='en', ocr_version='PP-OCRv4', use_space_char=True)

@asif-ca
Copy link
Contributor

asif-ca commented Feb 2, 2024

I solved the issue by adding more data with white spaces, such as images with multiple words containing white spaces between them. Then, I fine-tuned the model and now it works great.

@nam-leduc
Copy link

I solved the issue by adding more data with white spaces, such as images with multiple words containing white spaces between them. Then, I fine-tuned the model and now it works great.

That is great!
@asif-ca I want to tried to do the fine-tuning but it seem we must to use the original data set as the mentioned in the following place. https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.7/doc/doc_en/finetune_en.md#:~:text=add%20it%20to%20the%20original%20dataset%20and%20use%20a%20small%20learning%20rate%20for%20fine%2Dtuning

I concern that is that real, is that necessary to use whole original dataset for fine-tuning when we just want to improve some characters.

@asif-ca
Copy link
Contributor

asif-ca commented Feb 16, 2024

Hello @nam-leduc

Adding original data can be very useful in fine-tuning your model. This is because it enables the model to learn from various text variations, such as different fonts, text sizes, and image effects like blurriness. By doing so, your model's accuracy can be significantly improved when fine-tuning it.

However, if you are fine-tuning your model for a specific use case, such as a particular data format or font, you can ignore the original dataset. Your model will be more accurate for similar data on which you fine-tuned it, as it was given limited data to learn from.

In summary, while adding original data can be extremely helpful, it is not a must if you are fine-tuning the model for a specific use case.

@phuchm
Copy link

phuchm commented Mar 6, 2024

Hello @asif-ca
Which model you chose to fine-tune: det, rec or both of them? I also want to fine-tune but I'm not sure which model should be used?

@asif-ca
Copy link
Contributor

asif-ca commented Mar 7, 2024

@phuchm It is important to note that white space issues typically occur during the recognition phase rather than in text detection. To address this, it is recommended to fine-tune the recognizer model with additional data that includes white spaces.

For example, using images containing white spaces between words can improve the model's ability to detect and recognize white spaces.

@kilimchoi
Copy link

@asif-ca would you be able to share some of the datasets you used to fine-tune?

@Ahmed-Mks
Copy link

PaddleOCR(use_angle_cls=True, lang='en', ocr_version='PP-OCRv4', use_space_char=True)

thank you this work good

@lucasjinreal
Copy link

For Chinese and English interleaved , the space still missing...

@naveen-164
Copy link

the spacing issue is still not solved

@msciancalepore98
Copy link

when PP-OCRv4 will be available for "latin" (multilingual)? the spacing situation is very limiting..

@GoldenBigAnt
Copy link

I made Thai lang model. but paddleocr don't support the Thai lang yet. how can I use directly the trained thai model?
please help me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests