-
-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pages that failed to scan end up missing entirely from the index - should have rows with blank text instead #23
Comments
Demonstration that there are 102 files in
|
Added a bunch of debug output: @@ -468,6 +487,7 @@ def index(bucket, database, **boto_options):
to_fetch_job_ids = list(
job_ids_in_ocr_jobs.intersection(available_job_ids - fetched_job_ids)
)
+ print("len(to_fetch_job_ids)=", len(to_fetch_job_ids))
# Figure out total length to retrieve in bytes, for the progress bar
items_to_fetch = []
for item in items:
@@ -477,6 +497,7 @@ def index(bucket, database, **boto_options):
and ".s3_access_check" not in item["Key"]
):
items_to_fetch.append(item)
+ print("len(items_to_fetch)=", len(items_to_fetch))
total_length = sum(item["Size"] for item in items_to_fetch)
with click.progressbar(length=total_length, label="Populating pages table") as bar:
for item in items_to_fetch:
@@ -505,6 +526,7 @@ def index(bucket, database, **boto_options):
pages[page] = []
pages[page].append(block["Text"])
# And insert those into the database
+ print("Inserting", path, "page", page, pages)
for page_number, lines in pages.items():
db["pages"].insert(
{
|
There are 102 output lines in that block. I figured out which were missing from the pages table with this query: with expected as (select '20.pdf' as p union
select '32.pdf' as p union
select '68.pdf' as p union
select '21.pdf' as p union
select '53.pdf' as p union
select '38.pdf' as p union
select '24.pdf' as p union
select '41.pdf' as p union
select '75.pdf' as p union
select '101.pdf' as p union
select '69.pdf' as p union
select '78.pdf' as p union
select '61.pdf' as p union
select '55.pdf' as p union
select '84.pdf' as p union
select '67.pdf' as p union
select '62.pdf' as p union
select '22.pdf' as p union
select '3.pdf' as p union
select '12.pdf' as p union
select '60.pdf' as p union
select '13.pdf' as p union
select '77.pdf' as p union
select '35.pdf' as p union
select '102.pdf' as p union
select '74.pdf' as p union
select '89.pdf' as p union
select '90.pdf' as p union
select '80.pdf' as p union
select '28.pdf' as p union
select '25.pdf' as p union
select '8.pdf' as p union
select '97.pdf' as p union
select '54.pdf' as p union
select '98.pdf' as p union
select '66.pdf' as p union
select '71.pdf' as p union
select '52.pdf' as p union
select '76.pdf' as p union
select '85.pdf' as p union
select '29.pdf' as p union
select '59.pdf' as p union
select '93.pdf' as p union
select '42.pdf' as p union
select '47.pdf' as p union
select '72.pdf' as p union
select '37.pdf' as p union
select '51.pdf' as p union
select '17.pdf' as p union
select '73.pdf' as p union
select '26.pdf' as p union
select '7.pdf' as p union
select '33.pdf' as p union
select '1.pdf' as p union
select '30.pdf' as p union
select '49.pdf' as p union
select '5.pdf' as p union
select '57.pdf' as p union
select '63.pdf' as p union
select '6.pdf' as p union
select '27.pdf' as p union
select '86.pdf' as p union
select '9.pdf' as p union
select '92.pdf' as p union
select '48.pdf' as p union
select '56.pdf' as p union
select '10.pdf' as p union
select '99.pdf' as p union
select '87.pdf' as p union
select '79.pdf' as p union
select '65.pdf' as p union
select '83.pdf' as p union
select '82.pdf' as p union
select '36.pdf' as p union
select '14.pdf' as p union
select '23.pdf' as p union
select '4.pdf' as p union
select '43.pdf' as p union
select '45.pdf' as p union
select '40.pdf' as p union
select '100.pdf' as p union
select '58.pdf' as p union
select '70.pdf' as p union
select '11.pdf' as p union
select '19.pdf' as p union
select '44.pdf' as p union
select '18.pdf' as p union
select '39.pdf' as p union
select '2.pdf' as p union
select '31.pdf' as p union
select '95.pdf' as p union
select '94.pdf' as p union
select '88.pdf' as p union
select '34.pdf' as p union
select '46.pdf' as p union
select '91.pdf' as p union
select '81.pdf' as p union
select '64.pdf' as p union
select '50.pdf' as p union
select '96.pdf' as p union
select '16.pdf' as p union
select '15.pdf' as p)
select p from expected where p not in (select path from pages) It returned:
|
Investigating The
So the Job ID is I can't see a |
I tried re-submitting the files but it told me they were all done:
|
What's weird here is that there are 102 But it looks like the job IDs don't match up for some reason? |
This gets a list of all job IDs in the bucket:
It returns 102. So it looks like there are job IDs in the This suggests that they got submitted to Textract but for some reason it wrote the results out using a different job ID? No idea why that might happen. Short-term fix could be to scan for job IDs in |
Alternative query showing the pages that are missing: select * from ocr_jobs where key not in (select path from pages) This also returns 18 rows for me.
|
This returns 0 rows, showing that they are the same: select job_id from fetched_jobs union select job_id from ocr_jobs |
The missing pages were highlighted in the above output:
Page 87 is one of the missing ones. |
I think I found the bug. Here's the content of that {
"Blocks": [
{
"BlockType": "PAGE",
"ColumnIndex": null,
"ColumnSpan": null,
"Confidence": null,
"DocumentType": null,
"EntityTypes": null,
"Geometry": {
"BoundingBox": {
"Height": 1,
"Left": 0,
"Top": 0,
"Width": 1
},
"Polygon": [
{
"X": 1,
"Y": 0.0005257856100797653
},
{
"X": 0.999599814414978,
"Y": 1
},
{
"X": 0,
"Y": 0.9994742274284363
},
{
"X": 0.000400237477151677,
"Y": 0
}
]
},
"Hint": null,
"Id": "27a217ad-f736-4b37-866e-fccbc506e3bb",
"Page": 1,
"PageClassification": null,
"Query": null,
"Relationships": null,
"RowIndex": null,
"RowSpan": null,
"SelectionStatus": null,
"Text": null,
"TextType": null
}
],
"DetectDocumentTextModelVersion": "1.0",
"DocumentMetadata": {
"Pages": 1
},
"JobStatus": "SUCCEEDED",
"NextToken": null,
"StatusMessage": null,
"Warnings": null
} That doesn't have any blocks of a type other than PAGE! Compare with one that worked: {
"Blocks": [
{
"BlockType": "PAGE",
"ColumnIndex": null,
"ColumnSpan": null,
"Confidence": null,
"DocumentType": null,
"EntityTypes": null,
"Geometry": {
"BoundingBox": {
"Height": 1,
"Left": 0,
"Top": 0,
"Width": 1
},
"Polygon": [
{
"X": 0.000400237477151677,
"Y": 0
},
{
"X": 1,
"Y": 0.0005257856100797653
},
{
"X": 0.9995997548103333,
"Y": 1
},
{
"X": 0,
"Y": 0.9994742274284363
}
]
},
"Hint": null,
"Id": "845f5e22-e855-4056-865d-a10e1e5af905",
"Page": 1,
"PageClassification": null,
"Query": null,
"Relationships": [
{
"Ids": [
"5ef460d9-91c1-4336-a7b6-debe368c074d"
],
"Type": "CHILD"
}
],
"RowIndex": null,
"RowSpan": null,
"SelectionStatus": null,
"Text": null,
"TextType": null
},
{
"BlockType": "LINE",
"ColumnIndex": null,
"ColumnSpan": null,
"Confidence": 99.12808227539062,
"DocumentType": null,
"EntityTypes": null,
"Geometry": {
"BoundingBox": {
"Height": 0.030083946883678436,
"Left": 0.011475927196443081,
"Top": 0.012375177815556526,
"Width": 0.06685517728328705
},
"Polygon": [
{
"X": 0.012199141085147858,
"Y": 0.012375177815556526
},
{
"X": 0.0783311054110527,
"Y": 0.013356308452785015
},
{
"X": 0.0776078924536705,
"Y": 0.04245912656188011
},
{
"X": 0.011475927196443081,
"Y": 0.0414779931306839
}
]
},
"Hint": null,
"Id": "5ef460d9-91c1-4336-a7b6-debe368c074d",
"Page": 1,
"PageClassification": null,
"Query": null,
"Relationships": [
{
"Ids": [
"15503f5e-463d-4cbe-9994-bf5701ae6527"
],
"Type": "CHILD"
}
],
"RowIndex": null,
"RowSpan": null,
"SelectionStatus": null,
"Text": "{89}",
"TextType": null
},
{
"BlockType": "WORD",
"ColumnIndex": null,
"ColumnSpan": null,
"Confidence": 99.12808227539062,
"DocumentType": null,
"EntityTypes": null,
"Geometry": {
"BoundingBox": {
"Height": 0.030083946883678436,
"Left": 0.011475927196443081,
"Top": 0.012375177815556526,
"Width": 0.06685517728328705
},
"Polygon": [
{
"X": 0.012199141085147858,
"Y": 0.012375177815556526
},
{
"X": 0.0783311054110527,
"Y": 0.013356308452785015
},
{
"X": 0.0776078924536705,
"Y": 0.04245912656188011
},
{
"X": 0.011475927196443081,
"Y": 0.0414779931306839
}
]
},
"Hint": null,
"Id": "15503f5e-463d-4cbe-9994-bf5701ae6527",
"Page": 1,
"PageClassification": null,
"Query": null,
"Relationships": null,
"RowIndex": null,
"RowSpan": null,
"SelectionStatus": null,
"Text": "{89}",
"TextType": "PRINTED"
}
],
"DetectDocumentTextModelVersion": "1.0",
"DocumentMetadata": {
"Pages": 1
},
"JobStatus": "SUCCEEDED",
"NextToken": null,
"StatusMessage": null,
"Warnings": null
} The file looks like this: My best guess is that this particular shape confused the OCR - maybe it looked like something that wasn't a word - and so it didn't get correctly protected. Then there's a bug in this code which writes nothing to the Lines 496 to 509 in ba47d9a
|
Thinking more about this: I think the way the tool works right now is actually correct. The tool works in terms of I think a better fix is to introduce a |
I changed my mind, for reasons described here: |
Example document showing that problem: https://sfms-history.vercel.app/docs/8e0175dd Which in S3 is: The
So job ID is I looked up its output in {
"Blocks": [
{
"BlockType": "PAGE",
"ColumnIndex": null,
"ColumnSpan": null,
"Confidence": null,
"DocumentType": null,
"EntityTypes": null,
"Geometry": {
"BoundingBox": {
"Height": 1,
"Left": 0,
"Top": 0,
"Width": 1
},
"Polygon": [
{
"X": 1,
"Y": 0.0005188702489249408
},
{
"X": 0.9996047616004944,
"Y": 1
},
{
"X": 0,
"Y": 0.9994811415672302
},
{
"X": 0.0003951845574192703,
"Y": 0
}
]
},
"Hint": null,
"Id": "3444df06-df9a-4c0a-93b0-132ae48a9680",
"Page": 1,
"Query": null,
"Relationships": null,
"RowIndex": null,
"RowSpan": null,
"SelectionStatus": null,
"Text": null,
"TextType": null
},
{
"BlockType": "PAGE",
"ColumnIndex": null,
"ColumnSpan": null,
"Confidence": null,
"DocumentType": null,
"EntityTypes": null,
"Geometry": {
"BoundingBox": {
"Height": 1,
"Left": 0,
"Top": 0,
"Width": 1
},
"Polygon": [
{
"X": 0.00039482428110204637,
"Y": 0
},
{
"X": 1,
"Y": 0.0003579388139769435
},
{
"X": 0.9996051788330078,
"Y": 1
},
{
"X": 0,
"Y": 0.999642014503479
}
]
},
"Hint": null,
"Id": "72b8cafa-dfea-490d-bb7a-c62a6322156f",
"Page": 2,
"Query": null,
"Relationships": [
{
"Ids": [
"192777fe-155c-4485-8754-32b6c6de51a3",
"85cf50c3-31fe-4ab8-81b9-eba31d306fb3",
"c71c0c5e-8c0a-4962-9996-126b1bc6cda2",
"758e004d-1952-44d1-a1cd-682533531289",
"366f5dcd-bcbb-41d1-b008-f94eba11d2d9",
"97c6606e-a4d9-402a-a921-2d0f357d01f2",
"6ebbee1b-96eb-4c86-a5f8-8269a2da3eb7",
"a87c9dc2-2cb3-42f6-b7c3-9efedcb2b360",
"1ea431eb-96c9-422a-adab-b4468322274e"
],
"Type": "CHILD"
}
],
"RowIndex": null,
"RowSpan": null,
"SelectionStatus": null,
"Text": null,
"TextType": null
},
{
"BlockType": "LINE",
"ColumnIndex": null,
"ColumnSpan": null,
"Confidence": 56.1424446105957,
"DocumentType": null,
"EntityTypes": null,
"Geometry": {
"BoundingBox": {
"Height": 0.008917947299778461,
"Left": 0.42938727140426636,
"Top": 0.7764229774475098,
"Width": 0.13498438894748688
},
"Polygon": [
{
"X": 0.4295380413532257,
"Y": 0.7764229774475098
},
{
"X": 0.564371645450592,
"Y": 0.7783024907112122
},
{
"X": 0.5642209053039551,
"Y": 0.7853409051895142
},
{
"X": 0.42938727140426636,
"Y": 0.7834613919258118
}
]
},
"Hint": null,
"Id": "192777fe-155c-4485-8754-32b6c6de51a3",
"Page": 2,
"Query": null,
"Relationships": [
{
"Ids": [
"5b45ce15-b8c1-4c12-82c1-4ce37467e431",
"dc5c12ce-61c1-4fd6-8044-d8e7d3ee7c6b"
],
"Type": "CHILD"
}
],
"RowIndex": null,
"RowSpan": null,
"SelectionStatus": null,
"Text": "WAR DEPARTMENT,",
"TextType": null
}, So there is a |
Original title: s3-ocr index not catching every page - 84 out of 102
Spotted while working on:
This command runs against a bucket with 102 PDFs in, all of which have been OCRd:
The resulting DB looks like this:
The
pages
table should have 102 records in it, not 84.The text was updated successfully, but these errors were encountered: