Skip to content

v0.4.0

Latest
Compare
Choose a tag to compare
@conjuncts conjuncts released this 30 Oct 00:11
· 1 commit to main since this release

v0.4.0

Features

3 new table structure recognition options!

  • Added TabledFormatter, with support of the fantastic new Tabled library from VikParuchuri. Check out the demo notebook for a quick example.
  • Added HistogramFormatter, a super-fast and decently accurate algorithmic option for table structure recognition. The algorithm uses word bboxes to detect separating lines between text. Check out the demo notebook for a quick example.
  • Added DITRFormatter. This formatter is a blend between TATRFormatter and HistogramFormatter, being trained to recognize table separating lines rather than cells. It fine tunes microsoft/table-transformer-structure-recognition-v1.1-all on PubTables-1M for 15 epochs. Its main draw is mixing and matching deep and algorithmic separating line detection. Check out the demo notebook for a quick example.

These formatters can all be used in combination with any detector (like TATRDetector).

A visual to explain HistogramFormatter:

Bugfixes

  • Tweaked spanning cell merging
    • Fixed bug where it would overwrite data
  • Give warning when importing from gmft directly (use gmft.auto instead)
  • Merged PR #32, thanks!