Add multiline Japanese strings support to HocrVisualParser() to fix #534 and redo #537 #542

YasushiMiyata · 2021-05-11T00:24:53Z

Description of the problems or issues

Is your pull request related to a problem? Please describe.
See #534.
This request redoes #537, which needs prior fixing #538 (fixed by #539).

Does your pull request fix any issue.
See #534

Description of the proposed changes

In case of multi line Japanese strings 'AAAA\nBBBB', spacy[ja] sometimes generates tokens ['AAA', 'AB', 'B', 'BB']. Proposal defines bbox of 'AB' as a multi line word (i.e. left is min left of ['A','B'], top is the top of 'A', right is max right of ['A','B'] and bottom is the bottom of 'B').

Test plan

This is cause of Japanese morphological analysis. So, I have added Japanese test data to 'tests/data/hocr_simple/japan.hocr' and test code to 'tests/parser/test_parser.py::test_parse_hocr'

Checklist

I have updated the documentation accordingly.
I have added tests to cover my changes.
All new and existing tests passed.
I have updated the CHANGELOG.rst accordingly.

…VisualParser() to fix #534

This reverts commit d76bd76.

codecov-commenter · 2021-05-11T02:01:08Z

Codecov Report

Merging #542 (d76bd76) into master (5ab8e9c) will increase coverage by 0.04%.
The diff coverage is 100.00%.

❗ Current head d76bd76 differs from pull request most recent head 5ca10e8. Consider uploading reports for the commit 5ca10e8 to get more accurate results

@@            Coverage Diff             @@
##           master     #542      +/-   ##
==========================================
+ Coverage   86.02%   86.07%   +0.04%     
==========================================
  Files          92       92              
  Lines        4773     4775       +2     
  Branches      899      899              
==========================================
+ Hits         4106     4110       +4     
+ Misses        476      475       -1     
+ Partials      191      190       -1

Flag	Coverage Δ
unittests	`86.07% <100.00%> (+0.04%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...fonduer/parser/visual_parser/hocr_visual_parser.py	`97.64% <100.00%> (+2.46%)`	⬆️

lukehsiao · 2021-05-11T17:00:14Z

To avoid the need for you to rebase this again, I'm going to squash these commits and merge. But please update your other PR to be based off master if possible.

YasushiMiyata added 5 commits May 11, 2021 09:11

Add multiline Japanese strings support to HocrVisualParser() to fix #534

fde94fd

Add test data and code for multiline Japanese strings support of Hocr…

c2c07a1

…VisualParser() to fix #534

Add #542 log to CHANGELOG

c163dca

Specify sphinx version to avoid import error above v4.0.1

d76bd76

Revert "Specify sphinx version to avoid import error above v4.0.1"

5ca10e8

This reverts commit d76bd76.

Merge remote-tracking branch 'upstream/master'

a27bc90

YasushiMiyata marked this pull request as ready for review May 11, 2021 08:00

lukehsiao added this to the v0.8.4 milestone May 11, 2021

lukehsiao added the enhancement New feature or request label May 11, 2021

lukehsiao approved these changes May 11, 2021

View reviewed changes

lukehsiao merged commit b0154c3 into HazyResearch:master May 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multiline Japanese strings support to HocrVisualParser() to fix #534 and redo #537 #542

Add multiline Japanese strings support to HocrVisualParser() to fix #534 and redo #537 #542

YasushiMiyata commented May 11, 2021 •

edited

Loading

codecov-commenter commented May 11, 2021 •

edited

Loading

lukehsiao commented May 11, 2021

Add multiline Japanese strings support to HocrVisualParser() to fix #534 and redo #537 #542

Add multiline Japanese strings support to HocrVisualParser() to fix #534 and redo #537 #542

Conversation

YasushiMiyata commented May 11, 2021 • edited Loading

Description of the problems or issues

Description of the proposed changes

Test plan

Checklist

codecov-commenter commented May 11, 2021 • edited Loading

Codecov Report

lukehsiao commented May 11, 2021

YasushiMiyata commented May 11, 2021 •

edited

Loading

codecov-commenter commented May 11, 2021 •

edited

Loading