adding an Arabic vocab file #514

mzeidhassan · 2021-09-29T23:15:28Z

I have added Arabic alphabet plus some Farsi alphabet because they may exist in many Arabic texts. I also added "Hindi" numbers and Arabic diacritics. Arabic uses connected shaping when it comes to characters, so characters are not isolated as in Latin-based languages. Hope this won't be an issue. Please let me know if you have any questions.

charlesmindee

Hi, thanks for the PR!

Could you please move this to doctr/datasets/vocabs.py ?

And as you mentioned, your are aggregating Hindi + Arabic + Farsi characters if I understood well, would it be possible to split those entries in the dictionary right there (below ancient greek):

VOCABS: Dict[str, str] = {
    'digits': string.digits,
    'ascii_letters': string.ascii_letters,
    'punctuation': string.punctuation,
    'currency': '£€¥¢฿',
    'ancient_greek': 'αβγδεζηθικλμνξοπρστυφχψωΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩ',
}

For instance: 'hindi_letters', 'arabic_diacritics', etc... And try to be as exhaustive as possibe for each sub class. Then you can add below the dictionnary your full vocab (you can follow the example of european vocabs):

VOCABS['arabic'] = VOCABS['hindi_letters'] + VOCABS['farsi'] + VOCABS['arabic_diacritics'] + ...

Thanks for that ! 🙏

mzeidhassan · 2021-09-30T18:35:24Z

Thanks @charlesmindee for your help and guidance. I appreciate it. I am new to PR pushing, so I hope I am doing it right.

Please find the changes I made here
3faad13

Instead of creating a new item "arabic_numbers", I am using "VOCABS['digits']".

Please let me know if there is anything else that I need to do at my end.

Thanks,
Mohamed

fg-mindee

Thanks for the edits! Just a small modification and it looks like we're good to merge!

arabic_vocab

charlesmindee

Thanks for the edit, you only need to split the last line in 2 lines to pass the flake8 test (code style enforcement, each line must be < 120) and we are good to merge (all the other tests are OK)!

charlesmindee · 2021-10-01T07:26:05Z

closes #490

codecov · 2021-10-01T07:27:12Z

Codecov Report

Merging #514 (9fe9594) into main (14b376e) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main     #514   +/-   ##
=======================================
  Coverage   95.38%   95.38%           
=======================================
  Files         109      109           
  Lines        4183     4184    +1     
=======================================
+ Hits         3990     3991    +1     
  Misses        193      193

Flag	Coverage Δ
unittests	`95.38% <100.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
doctr/datasets/vocabs.py	`100.00% <100.00%> (ø)`
doctr/models/recognition/master/tensorflow.py	`96.47% <0.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 14b376e...9fe9594. Read the comment docs.

fg-mindee

Thanks again, I added some suggestions to fix flake8 and improve naming 👌

doctr/datasets/vocabs.py

charlesmindee

Thanks for the updates!

fg-mindee

Thanks for edits, looks good to me!

adding an Arabic vocab file

a44535a

mzeidhassan mentioned this pull request Sep 29, 2021

Arabic data detection and recognition #490

Closed

charlesmindee self-assigned this Sep 30, 2021

charlesmindee added type: enhancement Improvement module: datasets Related to doctr.datasets topic: text recognition Related to the task of text recognition labels Sep 30, 2021

charlesmindee requested changes Sep 30, 2021

View reviewed changes

fg-mindee reviewed Sep 30, 2021

View reviewed changes

arabic_vocab Outdated Show resolved Hide resolved

mzeidhassan force-pushed the arabic_vocab branch from 3faad13 to 986732d Compare October 1, 2021 04:47

charlesmindee requested changes Oct 1, 2021

View reviewed changes

fg-mindee suggested changes Oct 1, 2021

View reviewed changes

doctr/datasets/vocabs.py Outdated Show resolved Hide resolved

doctr/datasets/vocabs.py Outdated Show resolved Hide resolved

doctr/datasets/vocabs.py Outdated Show resolved Hide resolved

doctr/datasets/vocabs.py Outdated Show resolved Hide resolved

Updating vocabs.py to add Arabic vocab

aaae04c

mzeidhassan force-pushed the arabic_vocab branch from 986732d to aaae04c Compare October 1, 2021 17:56

mzeidhassan added 3 commits October 1, 2021 11:59

Updating vocabs.py to be flake8-compliant

1235f11

make it flake-8 compliant

f311464

made it flake8 compliant and updated Arabic/Farsi chrs.

9fe9594

charlesmindee approved these changes Oct 4, 2021

View reviewed changes

charlesmindee requested review from fg-mindee and removed request for fg-mindee October 4, 2021 08:20

fg-mindee approved these changes Oct 4, 2021

View reviewed changes

fg-mindee merged commit 40678ac into mindee:main Oct 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adding an Arabic vocab file #514

adding an Arabic vocab file #514

mzeidhassan commented Sep 29, 2021 •

edited

Loading

charlesmindee left a comment •

edited

Loading

mzeidhassan commented Sep 30, 2021 •

edited

Loading

fg-mindee left a comment

charlesmindee left a comment

charlesmindee commented Oct 1, 2021

codecov bot commented Oct 1, 2021 •

edited

Loading

fg-mindee left a comment

charlesmindee left a comment

fg-mindee left a comment

adding an Arabic vocab file #514

adding an Arabic vocab file #514

Conversation

mzeidhassan commented Sep 29, 2021 • edited Loading

charlesmindee left a comment • edited Loading

Choose a reason for hiding this comment

mzeidhassan commented Sep 30, 2021 • edited Loading

fg-mindee left a comment

Choose a reason for hiding this comment

charlesmindee left a comment

Choose a reason for hiding this comment

charlesmindee commented Oct 1, 2021

codecov bot commented Oct 1, 2021 • edited Loading

Codecov Report

fg-mindee left a comment

Choose a reason for hiding this comment

charlesmindee left a comment

Choose a reason for hiding this comment

fg-mindee left a comment

Choose a reason for hiding this comment

mzeidhassan commented Sep 29, 2021 •

edited

Loading

charlesmindee left a comment •

edited

Loading

mzeidhassan commented Sep 30, 2021 •

edited

Loading

codecov bot commented Oct 1, 2021 •

edited

Loading