v0.4.0
Features
3 new table structure recognition options!
- Added
TabledFormatter
, with support of the fantastic new Tabled library from VikParuchuri. Check out the demo notebook for a quick example. - Added
HistogramFormatter
, a super-fast and decently accurate algorithmic option for table structure recognition. The algorithm uses word bboxes to detect separating lines between text. Check out the demo notebook for a quick example. - Added
DITRFormatter
. This formatter is a blend between TATRFormatter and HistogramFormatter, being trained to recognize table separating lines rather than cells. It fine tunesmicrosoft/table-transformer-structure-recognition-v1.1-all
on PubTables-1M for 15 epochs. Its main draw is mixing and matching deep and algorithmic separating line detection. Check out the demo notebook for a quick example.
These formatters can all be used in combination with any detector (like TATRDetector).
A visual to explain HistogramFormatter
:
Bugfixes
- Tweaked spanning cell merging
- Fixed bug where it would overwrite data
- Give warning when importing from
gmft
directly (usegmft.auto
instead) - Merged PR #32, thanks!