Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot extract headers properly #20

Open
wassim opened this issue Sep 9, 2024 · 3 comments
Open

Cannot extract headers properly #20

wassim opened this issue Sep 9, 2024 · 3 comments
Labels
detection accuracy issue related to detection accuracy structure accuracy issue related to recognizing table structure ("format")

Comments

@wassim
Copy link

wassim commented Sep 9, 2024

First thank you for making this lib!

I'm unable to extract headers properly however and your help will be much appreciated. First data row is always considered as header in this example. Am I doing something wrong?

Example pdf: synthetic_bank_statement.pdf

from gmft import CroppedTable, TableDetector, AutoTableFormatter
from gmft.pdf_bindings import PyPDFium2Document
import pandas as pd

detector = TableDetector()
formatter = AutoTableFormatter()

def ingest_pdf(pdf_path): # produces list[CroppedTable]
    doc = PyPDFium2Document(pdf_path)
    tables = []
    for page in doc:
        tables += detector.extract(page)
    return tables, doc

def print_table_data(table, index):
    print(f"\n{'='*50}")
    print(f"Table {index + 1}:")
    print(f"Confidence Score: {table.confidence_score}")

    ft = formatter.extract(table)
    df = ft.df()

    print("\nTable Headers:")
    print(df.columns.tolist())

    print("\nTable Data:")
    with pd.option_context('display.max_rows', None, 'display.max_columns', None, 'display.width', 1000):
        print(df.to_string(index=False))

    print(f"{'='*50}\n")

tables, doc = ingest_pdf(pdf_path)

print(f"Extracted {len(tables)} tables from {pdf_path}")

for i, table in enumerate(tables):
    print_table_data(table, i)

doc.close()
@conjuncts conjuncts added detection accuracy issue related to detection accuracy structure accuracy issue related to recognizing table structure ("format") labels Sep 15, 2024
@conjuncts
Copy link
Owner

Okay, this problem is tricky.

tables[0].visualize()
fts[0].visualize(margin='auto')

First, the detection model does not include the desired headers:

image

My first thought (to pass the formatter more margin) doesn't work. But what does work is that you can nudge the table bbox:

for table in tables:
    table.bbox[1] -= 15
fts = [formatter.extract(table) for table in tables]

The downside is that this requires you to know in advance that the model will consistently miss the header. At that point it might be beneficial to instead set y0 to a known location.

But at that point, if you know that the headers will always be the same across all the tables you're extracting, you can always consider simply hardcoding headers into the pandas dataframe.

@wassim
Copy link
Author

wassim commented Sep 22, 2024

@conjuncts thank you, can we use an llm as a fallback to detect the headers in this case?

@conjuncts
Copy link
Owner

Sure, it sounds like if you're able to get header information you can use that to update the table -- akin to #25

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
detection accuracy issue related to detection accuracy structure accuracy issue related to recognizing table structure ("format")
Projects
None yet
Development

No branches or pull requests

2 participants