Missing white spaces #5448

esraa-abdelmaksoud · 2022-02-11T05:01:13Z

System Environment: Windows 10
Version：v2.4
Related components：PP-OCR
Command Code：
PaddleOCR(use_angle_cls=True, lang='en', ocr_version='PP-OCR', use_space_char=True)

Hello,

Before anything, I'd like to say thank you for the great effort you exerted in the creation of this work.

I'm facing a problem that the OCR engine misses white spaces many times even though I'm setting use_space_char to true. The following are some of the images with the output.

In the following image, all spaces in the first line are missing:

In this image, lines 4,6, and 7 are missing spaces.

Is there any configuration I can do to overcome this issue?

The text was updated successfully, but these errors were encountered:

andyjiang1116 · 2022-02-11T08:25:11Z

you can try to decrease the params unclip_ratio

esraa-abdelmaksoud · 2022-02-11T23:36:51Z

I've tried 0.8, 1, 1.2, 1.8, 2 and the default 1.5 for det_db_unclip_ratio. The problem is that this works for some images and spoils the output for others. Below are some examples for the following clean image:

0.8:

[[[[668.0, 471.0], [699.0, 471.0], [699.0, 690.0], [668.0, 690.0]], ('Budelizer', 0.9989222)]]
[[[61.0, 113.0], [263.0, 118.0], [262.0, 142.0], [60.0, 137.0]], ('Budelizer', 0.970733)]
[[[53.0, 161.0], [300.0, 165.0], [299.0, 179.0], [53.0, 175.0]], (**'Budesonide micronized400 mco'**, 0.9597847)]
[[[53.0, 233.0], [293.0, 234.0], [293.0, 249.0], [53.0, 248.0]], (**'Hard capsules containingdr'**, 0.9630118)]
[[[53.0, 262.0], [272.0, 261.0], [272.0, 275.0], [53.0, 276.0]], (**'powder For oralinhalatior'**, 0.9352509)]
[[[524.0, 372.0], [593.0, 372.0], [593.0, 388.0], [524.0, 388.0]], ('inhale', 0.9461186)]
[[[52.0, 415.0], [138.0, 415.0], [138.0, 434.0], [52.0, 434.0]], ('EUROPFAN', 0.9215055)]
[[[510.0, 412.0], [603.0, 412.0], [603.0, 427.0], [510.0, 427.0]], ('capsule', 0.8058552)]

1:

[[[[666.0, 470.0], [701.0, 470.0], [701.0, 691.0], [666.0, 691.0]], ('Budelizer', 0.9989196)]]
[[[60.0, 111.0], [264.0, 116.0], [263.0, 144.0], [59.0, 139.0]], ('Budelizer', 0.97098225)]
[[[53.0, 161.0], [300.0, 165.0], [299.0, 179.0], [53.0, 175.0]], (**'Budesonide micronized400 mco'**, 0.9597847)]
[[[52.0, 232.0], [294.0, 233.0], [294.0, 250.0], [52.0, 249.0]], ('Hard capsules containing dry', 0.9746936)]
[[[52.0, 261.0], [273.0, 260.0], [273.0, 276.0], [52.0, 277.0]], (**'powder For oralinhalatior'**, 0.96679384)]
[[[523.0, 371.0], [594.0, 371.0], [594.0, 389.0], [523.0, 389.0]], ('inhale', 0.9650893)]

1.2:

[[[[664.0, 468.0], [702.0, 468.0], [702.0, 693.0], [664.0, 693.0]], ('Budelizer', 0.9995226)]]
[[[58.0, 109.0], [266.0, 115.0], [265.0, 145.0], [57.0, 140.0]], ('Budelizer', 0.99847627)]
[[[52.0, 160.0], [301.0, 164.0], [300.0, 180.0], [52.0, 176.0]], (**'Budesonide micronized400 mco'**, 0.9571237)]
[[[52.0, 232.0], [294.0, 233.0], [294.0, 250.0], [52.0, 249.0]], ('Hard capsules containing dry', 0.9746936)]
[[[52.0, 260.0], [273.0, 259.0], [273.0, 277.0], [52.0, 278.0]], (**'powder for oralinhalatior'**, 0.9528579)]
[[[522.0, 370.0], [595.0, 370.0], [595.0, 390.0], [522.0, 390.0]], (**'1inhale'**, 0.89756185)]
[[[50.0, 413.0], [140.0, 413.0], [140.0, 436.0], [50.0, 436.0]], ('LUROPEAN', 0.92258036)]

1.5:

[[[[661.0, 465.0], [706.0, 465.0], [706.0, 696.0], [661.0, 696.0]], ('Budelizer', 0.99889064)]]
[[[56.0, 107.0], [268.0, 113.0], [267.0, 147.0], [55.0, 142.0]], ('Budelizer', 0.9836219)]
[[[51.0, 159.0], [302.0, 163.0], [301.0, 181.0], [51.0, 177.0]], (**'Budesonide micronized400 mco'**, 0.9659994)]
[[[51.0, 231.0], [295.0, 232.0], [295.0, 251.0], [51.0, 250.0]], ('Hard capsules containing dry', 0.9827056)]
[[[50.0, 259.0], [275.0, 258.0], [275.0, 278.0], [50.0, 279.0]], ('powder for oral inhalation', 0.9851054)]
[[[521.0, 369.0], [596.0, 369.0], [596.0, 391.0], [521.0, 391.0]], ('inhaler', 0.8621807)]
[[[49.0, 412.0], [141.0, 412.0], [141.0, 437.0], [49.0, 437.0]], ('EUROPEAN', 0.9411056)]

2:

[[[[658.0, 461.0], [709.0, 461.0], [709.0, 700.0], [658.0, 700.0]], ('Budelizer', 0.99946606)]]
[[[53.0, 104.0], [271.0, 110.0], [270.0, 150.0], [52.0, 145.0]], ('Budelizer', 0.9977702)]
[[[49.0, 157.0], [304.0, 161.0], [303.0, 183.0], [49.0, 179.0]], (**'Budesonidemicronized400mcg'**, 0.99825233)]
[[[49.0, 229.0], [298.0, 230.0], [297.0, 253.0], [49.0, 252.0]], (**'Hard capsules containingdry'**, 0.9628299)]
[[[49.0, 257.0], [276.0, 256.0], [276.0, 280.0], [49.0, 281.0]], ('Powder for oral inhalation', 0.9759482)]
[[[518.0, 366.0], [599.0, 366.0], [599.0, 394.0], [518.0, 394.0]], ('1E innater', 0.8560789)]
[[[46.0, 409.0], [144.0, 409.0], [144.0, 440.0], [46.0, 440.0]], ('LUROFEAN', 0.9025044)]

tink2123 · 2022-03-04T03:45:20Z

@esraa-abdelmaksoud Thanks for using and providing valuable feedback on our products.The space problem is also a problem we hope to focus on in the next version.

You can also try to set the params use_dilation=True to expand the border of the detection box a little, which is helpful for space rec.

If it still can't solve your problem well, please look forward to our next version upgrade.

esraa-abdelmaksoud · 2022-03-04T13:36:04Z

Unfortunately, it didn't help enough and removed a good deal of the text. However, I would like to thank you for your great work. I'm looking forward to trying your next version. Thanks! :)

Ankur-singh · 2022-06-14T01:40:53Z

The new version (2.5) does a much better job at handling white spaces, but the issue is still not resolved completely. There are situations where white spaces are missing. I have tried different values of det_db_unclip_ratio and use_dilation=True. Anything else that can be done to further improve the performance ?

joel1895 · 2023-07-03T12:52:26Z

Hi, I am facing the same issue. Using det_db_unclip_ration and use_dilation=True help in some extend. However, it affects other results. Any way to improve it?

asif-ca · 2023-07-20T10:42:56Z

@tink2123 I am Facing the Same issue with PaddlePaddle 2.5.0 many whitespaces are ignored and words are combined!

Using use_dilation=True adds spaces (to some extent not all) but recognition are missing for many words

MosbehBarhoumiRAI · 2024-02-01T14:33:00Z

I used PaddleOCR with these parameters, yielding favorable outcomes in space detection, even for more challenging cases.

PaddleOCR(use_angle_cls=True, lang='en', ocr_version='PP-OCRv4', use_space_char=True)

asif-ca · 2024-02-02T12:34:19Z

I solved the issue by adding more data with white spaces, such as images with multiple words containing white spaces between them. Then, I fine-tuned the model and now it works great.

nam-leduc · 2024-02-16T04:42:20Z

I solved the issue by adding more data with white spaces, such as images with multiple words containing white spaces between them. Then, I fine-tuned the model and now it works great.

That is great!
@asif-ca I want to tried to do the fine-tuning but it seem we must to use the original data set as the mentioned in the following place. https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.7/doc/doc_en/finetune_en.md#:~:text=add%20it%20to%20the%20original%20dataset%20and%20use%20a%20small%20learning%20rate%20for%20fine%2Dtuning

I concern that is that real, is that necessary to use whole original dataset for fine-tuning when we just want to improve some characters.

asif-ca · 2024-02-16T04:52:18Z

Hello @nam-leduc

Adding original data can be very useful in fine-tuning your model. This is because it enables the model to learn from various text variations, such as different fonts, text sizes, and image effects like blurriness. By doing so, your model's accuracy can be significantly improved when fine-tuning it.

However, if you are fine-tuning your model for a specific use case, such as a particular data format or font, you can ignore the original dataset. Your model will be more accurate for similar data on which you fine-tuned it, as it was given limited data to learn from.

In summary, while adding original data can be extremely helpful, it is not a must if you are fine-tuning the model for a specific use case.

phuchm · 2024-03-06T08:01:08Z

Hello @asif-ca
Which model you chose to fine-tune: det, rec or both of them? I also want to fine-tune but I'm not sure which model should be used?

asif-ca · 2024-03-07T05:01:45Z

@phuchm It is important to note that white space issues typically occur during the recognition phase rather than in text detection. To address this, it is recommended to fine-tune the recognizer model with additional data that includes white spaces.

For example, using images containing white spaces between words can improve the model's ability to detect and recognize white spaces.

kilimchoi · 2024-03-19T00:58:19Z

@asif-ca would you be able to share some of the datasets you used to fine-tune?

Ahmed-Mks · 2024-04-22T19:32:43Z

PaddleOCR(use_angle_cls=True, lang='en', ocr_version='PP-OCRv4', use_space_char=True)

thank you this work good

lucasjinreal · 2024-06-26T15:32:24Z

For Chinese and English interleaved , the space still missing...

naveen-164 · 2024-11-25T16:22:33Z

the spacing issue is still not solved

msciancalepore98 · 2024-12-19T09:17:16Z

when PP-OCRv4 will be available for "latin" (multilingual)? the spacing situation is very limiting..

GoldenBigAnt · 2024-12-30T20:30:14Z

I made Thai lang model. but paddleocr don't support the Thai lang yet. how can I use directly the trained thai model?
please help me

paddle-bot-old bot assigned andyjiang1116 Feb 11, 2022

andyjiang1116 added the help wanted label Feb 17, 2022

Evezerest assigned tink2123 and Evezerest Mar 4, 2022

MissPenguin added the bad case label Mar 4, 2022

paddle-bot-old bot closed this as completed Sep 13, 2022

paddle-bot bot added the status/close label Sep 13, 2022

cainmagi mentioned this issue Jan 9, 2025

Bug: 文档提及的PaddleOCRv4（英语）文字识别模型未上传到HuggingFace RapidAI/RapidOCR#319

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing white spaces #5448

Missing white spaces #5448

esraa-abdelmaksoud commented Feb 11, 2022

andyjiang1116 commented Feb 11, 2022

esraa-abdelmaksoud commented Feb 11, 2022 •

edited

Loading

tink2123 commented Mar 4, 2022

esraa-abdelmaksoud commented Mar 4, 2022

Ankur-singh commented Jun 14, 2022

joel1895 commented Jul 3, 2023

asif-ca commented Jul 20, 2023 •

edited

Loading

MosbehBarhoumiRAI commented Feb 1, 2024

asif-ca commented Feb 2, 2024

nam-leduc commented Feb 16, 2024

asif-ca commented Feb 16, 2024

phuchm commented Mar 6, 2024

asif-ca commented Mar 7, 2024

kilimchoi commented Mar 19, 2024

Ahmed-Mks commented Apr 22, 2024

lucasjinreal commented Jun 26, 2024

naveen-164 commented Nov 25, 2024

msciancalepore98 commented Dec 19, 2024

GoldenBigAnt commented Dec 30, 2024

Missing white spaces #5448

Missing white spaces #5448

Comments

esraa-abdelmaksoud commented Feb 11, 2022

andyjiang1116 commented Feb 11, 2022

esraa-abdelmaksoud commented Feb 11, 2022 • edited Loading

tink2123 commented Mar 4, 2022

esraa-abdelmaksoud commented Mar 4, 2022

Ankur-singh commented Jun 14, 2022

joel1895 commented Jul 3, 2023

asif-ca commented Jul 20, 2023 • edited Loading

MosbehBarhoumiRAI commented Feb 1, 2024

asif-ca commented Feb 2, 2024

nam-leduc commented Feb 16, 2024

asif-ca commented Feb 16, 2024

phuchm commented Mar 6, 2024

asif-ca commented Mar 7, 2024

kilimchoi commented Mar 19, 2024

Ahmed-Mks commented Apr 22, 2024

lucasjinreal commented Jun 26, 2024

naveen-164 commented Nov 25, 2024

msciancalepore98 commented Dec 19, 2024

GoldenBigAnt commented Dec 30, 2024

esraa-abdelmaksoud commented Feb 11, 2022 •

edited

Loading

asif-ca commented Jul 20, 2023 •

edited

Loading