Cannot extract headers properly #20

wassim · 2024-09-09T15:21:53Z

First thank you for making this lib!

I'm unable to extract headers properly however and your help will be much appreciated. First data row is always considered as header in this example. Am I doing something wrong?

Example pdf: synthetic_bank_statement.pdf

from gmft import CroppedTable, TableDetector, AutoTableFormatter
from gmft.pdf_bindings import PyPDFium2Document
import pandas as pd

detector = TableDetector()
formatter = AutoTableFormatter()

def ingest_pdf(pdf_path): # produces list[CroppedTable]
    doc = PyPDFium2Document(pdf_path)
    tables = []
    for page in doc:
        tables += detector.extract(page)
    return tables, doc

def print_table_data(table, index):
    print(f"\n{'='*50}")
    print(f"Table {index + 1}:")
    print(f"Confidence Score: {table.confidence_score}")

    ft = formatter.extract(table)
    df = ft.df()

    print("\nTable Headers:")
    print(df.columns.tolist())

    print("\nTable Data:")
    with pd.option_context('display.max_rows', None, 'display.max_columns', None, 'display.width', 1000):
        print(df.to_string(index=False))

    print(f"{'='*50}\n")

tables, doc = ingest_pdf(pdf_path)

print(f"Extracted {len(tables)} tables from {pdf_path}")

for i, table in enumerate(tables):
    print_table_data(table, i)

doc.close()

conjuncts · 2024-09-15T04:35:19Z

Okay, this problem is tricky.

tables[0].visualize()
fts[0].visualize(margin='auto')

First, the detection model does not include the desired headers:

My first thought (to pass the formatter more margin) doesn't work. But what does work is that you can nudge the table bbox:

for table in tables:
    table.bbox[1] -= 15
fts = [formatter.extract(table) for table in tables]

The downside is that this requires you to know in advance that the model will consistently miss the header. At that point it might be beneficial to instead set y0 to a known location.

But at that point, if you know that the headers will always be the same across all the tables you're extracting, you can always consider simply hardcoding headers into the pandas dataframe.

wassim · 2024-09-22T11:23:54Z

@conjuncts thank you, can we use an llm as a fallback to detect the headers in this case?

conjuncts · 2024-10-04T04:24:08Z

Sure, it sounds like if you're able to get header information you can use that to update the table -- akin to #25

conjuncts added detection accuracy issue related to detection accuracy structure accuracy issue related to recognizing table structure ("format") labels Sep 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot extract headers properly #20

Cannot extract headers properly #20

wassim commented Sep 9, 2024 •

edited

Loading

conjuncts commented Sep 15, 2024

wassim commented Sep 22, 2024

conjuncts commented Oct 4, 2024

Cannot extract headers properly #20

Cannot extract headers properly #20

Comments

wassim commented Sep 9, 2024 • edited Loading

conjuncts commented Sep 15, 2024

wassim commented Sep 22, 2024

conjuncts commented Oct 4, 2024

wassim commented Sep 9, 2024 •

edited

Loading