Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pages that failed to scan end up missing entirely from the index - should have rows with blank text instead #23

Closed
simonw opened this issue Aug 7, 2022 · 14 comments
Labels
bug Something isn't working

Comments

@simonw
Copy link
Owner

simonw commented Aug 7, 2022

Original title: s3-ocr index not catching every page - 84 out of 102

Spotted while working on:

This command runs against a bucket with 102 PDFs in, all of which have been OCRd:

s3-ocr index s3-ocr-many-pdfs /tmp/many.db

The resulting DB looks like this:

(s3-ocr) s3-ocr % sqlite-utils tables --counts /tmp/many.db 
[{"table": "pages", "count": 84},
 {"table": "pages_fts", "count": 84},
 {"table": "pages_fts_data", "count": 8},
 {"table": "pages_fts_idx", "count": 6},
 {"table": "pages_fts_docsize", "count": 84},
 {"table": "pages_fts_config", "count": 1},
 {"table": "ocr_jobs", "count": 102},
 {"table": "fetched_jobs", "count": 102}]

The pages table should have 102 records in it, not 84.

@simonw simonw added the bug Something isn't working label Aug 7, 2022
@simonw
Copy link
Owner Author

simonw commented Aug 7, 2022

Demonstration that there are 102 files in textract-output:

% s3-credentials list-bucket s3-ocr-many-pdfs --nl | grep '.s3-ocr.json' | wc -l
     102
% s3-credentials list-bucket s3-ocr-many-pdfs --nl | grep 'textract-output' | grep -v '.s3_acc' | wc -l
     102

@simonw
Copy link
Owner Author

simonw commented Aug 7, 2022

Added a bunch of debug output:

@@ -468,6 +487,7 @@ def index(bucket, database, **boto_options):
     to_fetch_job_ids = list(
         job_ids_in_ocr_jobs.intersection(available_job_ids - fetched_job_ids)
     )
+    print("len(to_fetch_job_ids)=", len(to_fetch_job_ids))
     # Figure out total length to retrieve in bytes, for the progress bar
     items_to_fetch = []
     for item in items:
@@ -477,6 +497,7 @@ def index(bucket, database, **boto_options):
             and ".s3_access_check" not in item["Key"]
         ):
             items_to_fetch.append(item)
+    print("len(items_to_fetch)=", len(items_to_fetch))
     total_length = sum(item["Size"] for item in items_to_fetch)
     with click.progressbar(length=total_length, label="Populating pages table") as bar:
         for item in items_to_fetch:
@@ -505,6 +526,7 @@ def index(bucket, database, **boto_options):
                         pages[page] = []
                     pages[page].append(block["Text"])
             # And insert those into the database
+            print("Inserting", path, "page", page, pages)
             for page_number, lines in pages.items():
                 db["pages"].insert(
                     {
% s3-ocr index s3-ocr-many-pdfs /tmp/many.db
Fetching job details  [####################################]  100%          
len(to_fetch_job_ids)= 102
len(items_to_fetch)= 102
Populating pages table  [------------------------------------]    1%Inserting 20.pdf page 1 {1: ['{20}']}
Populating pages table  [------------------------------------]    2%Inserting 32.pdf page 1 {1: ['{32}']}
Populating pages table  [#-----------------------------------]    3%Inserting 68.pdf page 1 {1: ['{68}']}
Populating pages table  [#-----------------------------------]    4%Inserting 21.pdf page 1 {1: ['{21}']}
Populating pages table  [##----------------------------------]    5%Inserting 53.pdf page 1 {1: ['{53}']}
Populating pages table  [##----------------------------------]    6%Inserting 38.pdf page 1 {1: ['{38}']}
Populating pages table  [##----------------------------------]    7%Inserting 24.pdf page 1 {1: ['{24}']}
Populating pages table  [###---------------------------------]    8%  00:00:12Inserting 41.pdf page 1 {1: ['{41}']}
Populating pages table  [###---------------------------------]   10%  00:00:11Inserting 75.pdf page 1 {1: ['{75}']}
Populating pages table  [####--------------------------------]   11%  00:00:11Inserting 101.pdf page 1 {1: ['{101}']}
Populating pages table  [####--------------------------------]   12%  00:00:11Inserting 69.pdf page 1 {1: ['{69}']}
Populating pages table  [####--------------------------------]   13%  00:00:11Inserting 78.pdf page 1 {1: ['{78}']}
Populating pages table  [#####-------------------------------]   14%  00:00:11Inserting 61.pdf page 1 {1: ['{61}']}
Populating pages table  [#####-------------------------------]   15%  00:00:11Inserting 55.pdf page 1 {1: ['{55}']}
Populating pages table  [######------------------------------]   16%  00:00:11Inserting 84.pdf page 1 {1: ['{84}']}
Populating pages table  [######------------------------------]   17%  00:00:11Inserting 67.pdf page 1 {1: ['{67}']}
Populating pages table  [######------------------------------]   18%  00:00:11Inserting 62.pdf page 1 {1: ['{62}']}
Populating pages table  [#######-----------------------------]   20%  00:00:10Inserting 22.pdf page 1 {1: ['{22}']}
Inserting 3.pdf page 1 {}
Inserting 12.pdf page 1 {}
Populating pages table  [#######-----------------------------]   21%  00:00:11Inserting 60.pdf page 1 {1: ['{60}']}
Populating pages table  [########----------------------------]   22%  00:00:11Inserting 13.pdf page 1 {}
Populating pages table  [########----------------------------]   22%  00:00:10Inserting 77.pdf page 1 {}
Populating pages table  [########----------------------------]   23%  00:00:10Inserting 35.pdf page 1 {1: ['{35}']}
Populating pages table  [########----------------------------]   24%  00:00:10Inserting 102.pdf page 1 {1: ['{102}']}
Populating pages table  [#########---------------------------]   25%  00:00:10Inserting 74.pdf page 1 {1: ['{74}']}
Populating pages table  [#########---------------------------]   27%  00:00:10Inserting 89.pdf page 1 {1: ['{89}']}
Populating pages table  [##########--------------------------]   28%  00:00:10Inserting 90.pdf page 1 {1: ['{90}']}
Populating pages table  [##########--------------------------]   29%  00:00:10Inserting 80.pdf page 1 {1: ['{80}']}
Populating pages table  [##########--------------------------]   30%  00:00:10Inserting 28.pdf page 1 {1: ['{28}']}
Populating pages table  [###########-------------------------]   31%  00:00:10Inserting 25.pdf page 1 {1: ['{25}']}
Populating pages table  [###########-------------------------]   31%  00:00:09Inserting 8.pdf page 1 {}
Populating pages table  [###########-------------------------]   32%  00:00:09Inserting 97.pdf page 1 {1: ['{97}']}
Populating pages table  [############------------------------]   34%  00:00:09Inserting 54.pdf page 1 {1: ['{54}']}
Populating pages table  [############------------------------]   35%  00:00:09Inserting 98.pdf page 1 {1: ['{98}']}
Populating pages table  [#############-----------------------]   36%  00:00:09Inserting 66.pdf page 1 {1: ['{66}']}
Populating pages table  [#############-----------------------]   37%  00:00:09Inserting 71.pdf page 1 {1: ['{71}']}
Populating pages table  [#############-----------------------]   38%  00:00:09Inserting 52.pdf page 1 {1: ['{52}']}
Populating pages table  [##############----------------------]   39%  00:00:09Inserting 76.pdf page 1 {1: ['{76}']}
Populating pages table  [##############----------------------]   40%  00:00:08Inserting 85.pdf page 1 {}
Populating pages table  [##############----------------------]   41%  00:00:08Inserting 29.pdf page 1 {1: ['{29}']}
Populating pages table  [###############---------------------]   42%  00:00:08Inserting 59.pdf page 1 {1: ['{59}']}
Populating pages table  [###############---------------------]   43%  00:00:08Inserting 93.pdf page 1 {1: ['{93}']}
Populating pages table  [################--------------------]   44%  00:00:08Inserting 42.pdf page 1 {1: ['{42}']}
Populating pages table  [################--------------------]   45%  00:00:08Inserting 47.pdf page 1 {1: ['{47}']}
Populating pages table  [################--------------------]   46%  00:00:08Inserting 72.pdf page 1 {1: ['{72}']}
Populating pages table  [################--------------------]   47%  00:00:08Inserting 37.pdf page 1 {}
Populating pages table  [#################-------------------]   48%  00:00:07Inserting 51.pdf page 1 {1: ['{51}']}
Populating pages table  [#################-------------------]   49%  00:00:07Inserting 17.pdf page 1 {1: ['(1)']}
Inserting 73.pdf page 1 {}
Populating pages table  [##################------------------]   50%  00:00:07Inserting 26.pdf page 1 {1: ['{26}']}
Populating pages table  [##################------------------]   51%  00:00:07Inserting 7.pdf page 1 {}
Inserting 33.pdf page 1 {}
Populating pages table  [##################------------------]   52%  00:00:07Inserting 1.pdf page 1 {1: ['{1}']}
Populating pages table  [###################-----------------]   53%  00:00:07Inserting 30.pdf page 1 {1: ['{30}']}
Populating pages table  [###################-----------------]   54%  00:00:07Inserting 49.pdf page 1 {1: ['{49}']}
Populating pages table  [###################-----------------]   55%  00:00:07Inserting 5.pdf page 1 {}
Populating pages table  [####################----------------]   56%  00:00:06Inserting 57.pdf page 1 {1: ['{57}']}
Populating pages table  [####################----------------]   57%  00:00:06Inserting 63.pdf page 1 {1: ['{63}']}
Populating pages table  [#####################---------------]   58%  00:00:06Inserting 6.pdf page 1 {1: ['{6}']}
Populating pages table  [#####################---------------]   59%  00:00:06Inserting 27.pdf page 1 {1: ['{27}']}
Populating pages table  [#####################---------------]   60%  00:00:06Inserting 86.pdf page 1 {1: ['{86}']}
Populating pages table  [######################--------------]   61%  00:00:06Inserting 9.pdf page 1 {1: ['{9}']}
Populating pages table  [######################--------------]   62%  00:00:05Inserting 92.pdf page 1 {1: ['{92}']}
Populating pages table  [#######################-------------]   64%  00:00:05Inserting 48.pdf page 1 {1: ['{48}']}
Populating pages table  [#######################-------------]   65%  00:00:05Inserting 56.pdf page 1 {1: ['{56}']}
Populating pages table  [#######################-------------]   66%  00:00:05Inserting 10.pdf page 1 {1: ['{10}']}
Populating pages table  [########################------------]   67%  00:00:05Inserting 99.pdf page 1 {1: ['{99}']}
Inserting 87.pdf page 1 {}
Populating pages table  [########################------------]   68%  00:00:05Inserting 79.pdf page 1 {1: ['{79}']}
Populating pages table  [#########################-----------]   70%  00:00:04Inserting 65.pdf page 1 {1: ['{65}']}
Inserting 83.pdf page 1 {}
Populating pages table  [#########################-----------]   71%  00:00:04Inserting 82.pdf page 1 {1: ['{82}']}
Populating pages table  [##########################----------]   72%  00:00:04Inserting 36.pdf page 1 {1: ['{36}']}
Populating pages table  [##########################----------]   73%  00:00:04Inserting 14.pdf page 1 {1: ['{14}']}
Populating pages table  [##########################----------]   74%  00:00:04Inserting 23.pdf page 1 {1: ['{23}']}
Populating pages table  [###########################---------]   75%  00:00:03Inserting 4.pdf page 1 {1: ['{4}']}
Populating pages table  [###########################---------]   77%  00:00:03Inserting 43.pdf page 1 {1: ['{43}']}
Populating pages table  [############################--------]   78%  00:00:03Inserting 45.pdf page 1 {1: ['{45}']}
Populating pages table  [############################--------]   79%  00:00:03Inserting 40.pdf page 1 {1: ['{40}']}
Populating pages table  [############################--------]   80%  00:00:03Inserting 100.pdf page 1 {1: ['{100}']}
Populating pages table  [#############################-------]   81%  00:00:03Inserting 58.pdf page 1 {1: ['{58}']}
Populating pages table  [#############################-------]   82%  00:00:02Inserting 70.pdf page 1 {1: ['{70}']}
Inserting 11.pdf page 1 {}
Populating pages table  [##############################------]   83%  00:00:02Inserting 19.pdf page 1 {}
Populating pages table  [##############################------]   84%  00:00:02Inserting 44.pdf page 1 {1: ['{44}']}
Inserting 18.pdf page 1 {}
Populating pages table  [##############################------]   85%  00:00:02Inserting 39.pdf page 1 {1: ['{39}']}
Populating pages table  [###############################-----]   87%  00:00:02Inserting 2.pdf page 1 {1: ['{2}']}
Inserting 31.pdf page 1 {}
Populating pages table  [###############################-----]   88%  00:00:01Inserting 95.pdf page 1 {1: ['{95}']}
Populating pages table  [################################----]   89%  00:00:01Inserting 94.pdf page 1 {1: ['{94}']}
Inserting 88.pdf page 1 {}
Populating pages table  [################################----]   91%  00:00:01Inserting 34.pdf page 1 {1: ['{34}']}
Populating pages table  [#################################---]   92%  00:00:01Inserting 46.pdf page 1 {1: ['{46}']}
Populating pages table  [#################################---]   93%  00:00:01Inserting 91.pdf page 1 {1: ['{91}']}
Populating pages table  [#################################---]   94%  00:00:00Inserting 81.pdf page 1 {1: ['{81}']}
Populating pages table  [##################################--]   95%  00:00:00Inserting 64.pdf page 1 {1: ['{64}']}
Populating pages table  [##################################--]   96%  00:00:00Inserting 50.pdf page 1 {1: ['{50}']}
Populating pages table  [###################################-]   97%  00:00:00Inserting 96.pdf page 1 {1: ['{96}']}
Populating pages table  [###################################-]   98%  00:00:00Inserting 16.pdf page 1 {1: ['{16}']}
Populating pages table  [####################################]  100%          Inserting 15.pdf page 1 {1: ['($1)']}

@simonw
Copy link
Owner Author

simonw commented Aug 7, 2022

There are 102 output lines in that block.

I figured out which were missing from the pages table with this query:

with expected as (select '20.pdf' as p union
select '32.pdf' as p union
select '68.pdf' as p union
select '21.pdf' as p union
select '53.pdf' as p union
select '38.pdf' as p union
select '24.pdf' as p union
select '41.pdf' as p union
select '75.pdf' as p union
select '101.pdf' as p union
select '69.pdf' as p union
select '78.pdf' as p union
select '61.pdf' as p union
select '55.pdf' as p union
select '84.pdf' as p union
select '67.pdf' as p union
select '62.pdf' as p union
select '22.pdf' as p union
select '3.pdf' as p union
select '12.pdf' as p union
select '60.pdf' as p union
select '13.pdf' as p union
select '77.pdf' as p union
select '35.pdf' as p union
select '102.pdf' as p union
select '74.pdf' as p union
select '89.pdf' as p union
select '90.pdf' as p union
select '80.pdf' as p union
select '28.pdf' as p union
select '25.pdf' as p union
select '8.pdf' as p union
select '97.pdf' as p union
select '54.pdf' as p union
select '98.pdf' as p union
select '66.pdf' as p union
select '71.pdf' as p union
select '52.pdf' as p union
select '76.pdf' as p union
select '85.pdf' as p union
select '29.pdf' as p union
select '59.pdf' as p union
select '93.pdf' as p union
select '42.pdf' as p union
select '47.pdf' as p union
select '72.pdf' as p union
select '37.pdf' as p union
select '51.pdf' as p union
select '17.pdf' as p union
select '73.pdf' as p union
select '26.pdf' as p union
select '7.pdf' as p union
select '33.pdf' as p union
select '1.pdf' as p union
select '30.pdf' as p union
select '49.pdf' as p union
select '5.pdf' as p union
select '57.pdf' as p union
select '63.pdf' as p union
select '6.pdf' as p union
select '27.pdf' as p union
select '86.pdf' as p union
select '9.pdf' as p union
select '92.pdf' as p union
select '48.pdf' as p union
select '56.pdf' as p union
select '10.pdf' as p union
select '99.pdf' as p union
select '87.pdf' as p union
select '79.pdf' as p union
select '65.pdf' as p union
select '83.pdf' as p union
select '82.pdf' as p union
select '36.pdf' as p union
select '14.pdf' as p union
select '23.pdf' as p union
select '4.pdf' as p union
select '43.pdf' as p union
select '45.pdf' as p union
select '40.pdf' as p union
select '100.pdf' as p union
select '58.pdf' as p union
select '70.pdf' as p union
select '11.pdf' as p union
select '19.pdf' as p union
select '44.pdf' as p union
select '18.pdf' as p union
select '39.pdf' as p union
select '2.pdf' as p union
select '31.pdf' as p union
select '95.pdf' as p union
select '94.pdf' as p union
select '88.pdf' as p union
select '34.pdf' as p union
select '46.pdf' as p union
select '91.pdf' as p union
select '81.pdf' as p union
select '64.pdf' as p union
select '50.pdf' as p union
select '96.pdf' as p union
select '16.pdf' as p union
select '15.pdf' as p)
select p from expected where p not in (select path from pages)

It returned:

11.pdf
12.pdf
13.pdf
18.pdf
19.pdf
3.pdf
31.pdf
33.pdf
37.pdf
5.pdf
7.pdf
73.pdf
77.pdf
8.pdf
83.pdf
85.pdf
87.pdf
88.pdf

@simonw
Copy link
Owner Author

simonw commented Aug 7, 2022

Investigating 11.pdf.

The 11.pdf.s3-ocr.json file contains this:

{"job_id": "c9c7d4a969fae600e69cb07459299541ecc22df888ca963931c58fc0bc27a8a8", "etag": "\"024040f99b320904bc6876cc45ccb3a9\""}

So the Job ID is c9c7d4a969fae600e69cb07459299541ecc22df888ca963931c58fc0bc27a8a8.

I can't see a textract-output/xxx folder with that job ID.

@simonw
Copy link
Owner Author

simonw commented Aug 7, 2022

I tried re-submitting the files but it told me they were all done:

% s3-ocr start s3-ocr-many-pdfs --all       
Found 102 files with .s3-ocr.json out of 102 PDFs

@simonw
Copy link
Owner Author

simonw commented Aug 7, 2022

What's weird here is that there are 102 textract-output/xxx folders, one for each job.

But it looks like the job IDs don't match up for some reason?

@simonw
Copy link
Owner Author

simonw commented Aug 7, 2022

This gets a list of all job IDs in the bucket:

s3-credentials list-bucket s3-ocr-many-pdfs --prefix textract-output/ | jq -r '.[] | .Key' | grep -v 's3_access_check

It returns 102.

So it looks like there are job IDs in the .s3-ocr.json files that do not have a corresponding textract-output/xxx result.

This suggests that they got submitted to Textract but for some reason it wrote the results out using a different job ID? No idea why that might happen.

Short-term fix could be to scan for job IDs in .s3-ocr.json that don't have corresponding output and offer to resubmit those files as brand new jobs, over-writing the old .s3-ocr.json job ID for them with a new one.

@simonw
Copy link
Owner Author

simonw commented Aug 7, 2022

Alternative query showing the pages that are missing:

select * from ocr_jobs where key not in (select path from pages)

This also returns 18 rows for me.

key job_id etag s3_ocr_etag
11.pdf c9c7d4a969fae600e69cb07459299541ecc22df888ca963931c58fc0bc27a8a8 "024040f99b320904bc6876cc45ccb3a9" "7fe0df92a7342dc0dda02903a734702b"
12.pdf 2a32bf2d1326b25e707b713fafcc4186d2aa5d165352baa6c5f00dba6e281aa8 "7c17af4fa6729b1810cdc6a3cb97f034" "51af12fc9fe0255047304e97406404a2"
13.pdf 2ab37969dc430228be54f50243fadd838eb1c784ca390ca150b87c33a860aa6d "bd12aa0cf8052ca924d8f752d44c8337" "6e51cfb84511849fa3717fb1335c597a"
18.pdf d307e59e0479b88ffb1a241d09c012a02e579af1776ea75373be7226cf966766 "6e8870cc70c1c2cd7504894a0a6c64e3" "ee16ef163e497b81f0219b3cfa4d8127"
19.pdf ca74c41ae17e97512b57468a9f7dec366650573095b2aacab6fa2ebaf388cc39 "9e7f2167a8a4f8a17e5ea9ca0602e2db" "fbb5675243729135a6ef9be3a7e05b43"
3.pdf 2a2a7588512c19f6eb17ee8260a12c13ef6188284e9247a2f5bd1681677d4359 "8eacff9711e61c0e9e8342309784fd78" "ffe3bdd68cd694e22fdf613d5c4af8d9"
31.pdf df2f0b0cfb2a265be67c45f88186602d76b71dea3bb61d510155c61a915b58d6 "02e7f0219f775a4f6a513bad52670440" "6de16e9999181c95f707b8d53055dc52"
33.pdf 80aa8a896c5aa01a33982046dddb4a75b9658ec85fdca3ea417922e26b37e4a7 "9799039b040420a481be42c31de8dddf" "0dde6760f818ba04110376c74cf3e746"
37.pdf 75ed25985b90585dead686d030b3ec76a30138e0a6c8c9aec840339dc14af6a4 "a26406621da131859603247c028f27c6" "3734ba42d5533b5bc8530963aba1987b"
5.pdf 8f6ef4df68b9354efa58827ec70575a6993ecb318cd991ea5cd05442499fb923 "9f09c865fac13969124ba979c2ba4f17" "e21d2fdbc3cc5a1d37c93ef63e1fa957"
7.pdf 7be2dc6d68ba36378a39b3ed1b2e817e8be7705f4d3e2f49542d908f673a9757 "9ee4733cce4970b29f45d1173ba731b0" "80ea7758d092a0551e9656a93569aedb"
73.pdf 791988835b8f389bc173217000107585fdddcadbe07f95d0670954ae5caa6e86 "f3ed636c3a03c8f18560230a76d805f3" "42d80fa6143e72b44a040b6813cb9cc0"
77.pdf 2ba138f4fb1632718c687213e599cec67ce08e8a7e601b32a62e22696539bcce "197856a38d282a7a774b9bddfff4741f" "471ab5cb99b0a613b6f3c4e2edab17a9"
8.pdf 455471fcfb6825a72ead5fa3d200d777071579c793ccf76e2cd2487b064a2e35 "6a48dd04e6841c4a5fd679fc3da83aa9" "a0d8f3128fe03e736dbf9b30296ec3bc"
83.pdf ac5f5e6b19ed88cac3bd77808e784e3bb6bd8188a26b2cd71d82d7342fa1201b "e5456e5bde5c1fec5ac07ff24620a864" "a7785a74354b2e66f12388709f5ee11d"
85.pdf 6344316c010bdbcb7d745a004d60cd705e37390931bc99e5730832c96be58952 "75c7a92142714284531a84fd51951f67" "d2453b41adab436c9d7b9694f8c3e48d"
87.pdf a71d8d4aaec1893b090e7145d751aa0401b9e2e67d0b8afabfdb8ed156b8339e "46cc3664170935c562680cde97339520" "c6819976e2cb886c39c4b5c24496054e"
88.pdf e3770c4573bd54221045a8f572e8a4effa9e062364fbe647aa8d0c7eed2b22ee "50efb6b1dda0879add8f7c83919cf752" "0976a822e6d571c18702e105f0ce4e19"

@simonw
Copy link
Owner Author

simonw commented Aug 7, 2022

This returns 0 rows, showing that they are the same:

select job_id from fetched_jobs union select job_id from ocr_jobs

@simonw
Copy link
Owner Author

simonw commented Aug 7, 2022

The missing pages were highlighted in the above output:

Populating pages table  [########################------------]   67%  00:00:05Inserting 99.pdf page 1 {1: ['{99}']}
Inserting 87.pdf page 1 {}

Page 87 is one of the missing ones.

@simonw
Copy link
Owner Author

simonw commented Aug 7, 2022

I think I found the bug. 88.pdf is one of the missing files. Its job ID is e3770c4573bd54221045a8f572e8a4effa9e062364fbe647aa8d0c7eed2b22ee

Here's the content of that textract-output file:

{
    "Blocks": [
        {
            "BlockType": "PAGE",
            "ColumnIndex": null,
            "ColumnSpan": null,
            "Confidence": null,
            "DocumentType": null,
            "EntityTypes": null,
            "Geometry": {
                "BoundingBox": {
                    "Height": 1,
                    "Left": 0,
                    "Top": 0,
                    "Width": 1
                },
                "Polygon": [
                    {
                        "X": 1,
                        "Y": 0.0005257856100797653
                    },
                    {
                        "X": 0.999599814414978,
                        "Y": 1
                    },
                    {
                        "X": 0,
                        "Y": 0.9994742274284363
                    },
                    {
                        "X": 0.000400237477151677,
                        "Y": 0
                    }
                ]
            },
            "Hint": null,
            "Id": "27a217ad-f736-4b37-866e-fccbc506e3bb",
            "Page": 1,
            "PageClassification": null,
            "Query": null,
            "Relationships": null,
            "RowIndex": null,
            "RowSpan": null,
            "SelectionStatus": null,
            "Text": null,
            "TextType": null
        }
    ],
    "DetectDocumentTextModelVersion": "1.0",
    "DocumentMetadata": {
        "Pages": 1
    },
    "JobStatus": "SUCCEEDED",
    "NextToken": null,
    "StatusMessage": null,
    "Warnings": null
}

That doesn't have any blocks of a type other than PAGE!

Compare with one that worked:

{
    "Blocks": [
        {
            "BlockType": "PAGE",
            "ColumnIndex": null,
            "ColumnSpan": null,
            "Confidence": null,
            "DocumentType": null,
            "EntityTypes": null,
            "Geometry": {
                "BoundingBox": {
                    "Height": 1,
                    "Left": 0,
                    "Top": 0,
                    "Width": 1
                },
                "Polygon": [
                    {
                        "X": 0.000400237477151677,
                        "Y": 0
                    },
                    {
                        "X": 1,
                        "Y": 0.0005257856100797653
                    },
                    {
                        "X": 0.9995997548103333,
                        "Y": 1
                    },
                    {
                        "X": 0,
                        "Y": 0.9994742274284363
                    }
                ]
            },
            "Hint": null,
            "Id": "845f5e22-e855-4056-865d-a10e1e5af905",
            "Page": 1,
            "PageClassification": null,
            "Query": null,
            "Relationships": [
                {
                    "Ids": [
                        "5ef460d9-91c1-4336-a7b6-debe368c074d"
                    ],
                    "Type": "CHILD"
                }
            ],
            "RowIndex": null,
            "RowSpan": null,
            "SelectionStatus": null,
            "Text": null,
            "TextType": null
        },
        {
            "BlockType": "LINE",
            "ColumnIndex": null,
            "ColumnSpan": null,
            "Confidence": 99.12808227539062,
            "DocumentType": null,
            "EntityTypes": null,
            "Geometry": {
                "BoundingBox": {
                    "Height": 0.030083946883678436,
                    "Left": 0.011475927196443081,
                    "Top": 0.012375177815556526,
                    "Width": 0.06685517728328705
                },
                "Polygon": [
                    {
                        "X": 0.012199141085147858,
                        "Y": 0.012375177815556526
                    },
                    {
                        "X": 0.0783311054110527,
                        "Y": 0.013356308452785015
                    },
                    {
                        "X": 0.0776078924536705,
                        "Y": 0.04245912656188011
                    },
                    {
                        "X": 0.011475927196443081,
                        "Y": 0.0414779931306839
                    }
                ]
            },
            "Hint": null,
            "Id": "5ef460d9-91c1-4336-a7b6-debe368c074d",
            "Page": 1,
            "PageClassification": null,
            "Query": null,
            "Relationships": [
                {
                    "Ids": [
                        "15503f5e-463d-4cbe-9994-bf5701ae6527"
                    ],
                    "Type": "CHILD"
                }
            ],
            "RowIndex": null,
            "RowSpan": null,
            "SelectionStatus": null,
            "Text": "{89}",
            "TextType": null
        },
        {
            "BlockType": "WORD",
            "ColumnIndex": null,
            "ColumnSpan": null,
            "Confidence": 99.12808227539062,
            "DocumentType": null,
            "EntityTypes": null,
            "Geometry": {
                "BoundingBox": {
                    "Height": 0.030083946883678436,
                    "Left": 0.011475927196443081,
                    "Top": 0.012375177815556526,
                    "Width": 0.06685517728328705
                },
                "Polygon": [
                    {
                        "X": 0.012199141085147858,
                        "Y": 0.012375177815556526
                    },
                    {
                        "X": 0.0783311054110527,
                        "Y": 0.013356308452785015
                    },
                    {
                        "X": 0.0776078924536705,
                        "Y": 0.04245912656188011
                    },
                    {
                        "X": 0.011475927196443081,
                        "Y": 0.0414779931306839
                    }
                ]
            },
            "Hint": null,
            "Id": "15503f5e-463d-4cbe-9994-bf5701ae6527",
            "Page": 1,
            "PageClassification": null,
            "Query": null,
            "Relationships": null,
            "RowIndex": null,
            "RowSpan": null,
            "SelectionStatus": null,
            "Text": "{89}",
            "TextType": "PRINTED"
        }
    ],
    "DetectDocumentTextModelVersion": "1.0",
    "DocumentMetadata": {
        "Pages": 1
    },
    "JobStatus": "SUCCEEDED",
    "NextToken": null,
    "StatusMessage": null,
    "Warnings": null
}

The file looks like this:

image

My best guess is that this particular shape confused the OCR - maybe it looked like something that wasn't a word - and so it didn't get correctly protected.

Then there's a bug in this code which writes nothing to the pages table if the OCR output didn't include any LINE blocks:

s3-ocr/s3_ocr/cli.py

Lines 496 to 509 in ba47d9a

blocks = json.loads(
s3.get_object(Bucket=bucket, Key=item["Key"])["Body"].read()
)["Blocks"]
# Just extract the line blocks
pages = {}
for block in blocks:
if block["BlockType"] == "LINE":
page = block["Page"]
if page not in pages:
pages[page] = []
pages[page].append(block["Text"])
# And insert those into the database
for page_number, lines in pages.items():
db["pages"].insert(

@simonw simonw changed the title s3-ocr index not catching every page - 84 out of 102 Pages that failed to scan end up missing entirely from the index - should have rows with blank text instead Aug 7, 2022
@simonw
Copy link
Owner Author

simonw commented Aug 7, 2022

Thinking more about this: I think the way the tool works right now is actually correct.

The tool works in terms of pages. If a document fails to have any OCRd content, does it make sense to create a single page 1 row for that document with blank text?

I think a better fix is to introduce a documents database table, which can then be used to represent documents with 0 scanned pages.

@simonw simonw added the wontfix This will not be worked on label Aug 7, 2022
@simonw simonw closed this as completed Aug 7, 2022
@simonw
Copy link
Owner Author

simonw commented Aug 9, 2022

I changed my mind, for reasons described here:

@simonw simonw reopened this Aug 9, 2022
@simonw simonw removed the wontfix This will not be worked on label Aug 9, 2022
@simonw
Copy link
Owner Author

simonw commented Aug 9, 2022

Example document showing that problem: https://sfms-history.vercel.app/docs/8e0175dd

Which in S3 is: INTAKE/Bancroft Library - May 2, 2022/cubanc00009905_ae_a.pdf

The .s3-ocr.json file is:

{"job_id": "86648b8b76c92c373ae82d7a9bd6e262d3c4f8cee8eb100bf9023d8ca6669001", "etag": "\"9316a3254c1b8cb12d8fb6cc251e46cf\""}

So job ID is 86648b8b76c92c373ae82d7a9bd6e262d3c4f8cee8eb100bf9023d8ca6669001.

I looked up its output in textract-output/ - that file starts with this, representing the blank page and then the second page with content on it:

{
    "Blocks": [
        {
            "BlockType": "PAGE",
            "ColumnIndex": null,
            "ColumnSpan": null,
            "Confidence": null,
            "DocumentType": null,
            "EntityTypes": null,
            "Geometry": {
                "BoundingBox": {
                    "Height": 1,
                    "Left": 0,
                    "Top": 0,
                    "Width": 1
                },
                "Polygon": [
                    {
                        "X": 1,
                        "Y": 0.0005188702489249408
                    },
                    {
                        "X": 0.9996047616004944,
                        "Y": 1
                    },
                    {
                        "X": 0,
                        "Y": 0.9994811415672302
                    },
                    {
                        "X": 0.0003951845574192703,
                        "Y": 0
                    }
                ]
            },
            "Hint": null,
            "Id": "3444df06-df9a-4c0a-93b0-132ae48a9680",
            "Page": 1,
            "Query": null,
            "Relationships": null,
            "RowIndex": null,
            "RowSpan": null,
            "SelectionStatus": null,
            "Text": null,
            "TextType": null
        },
        {
            "BlockType": "PAGE",
            "ColumnIndex": null,
            "ColumnSpan": null,
            "Confidence": null,
            "DocumentType": null,
            "EntityTypes": null,
            "Geometry": {
                "BoundingBox": {
                    "Height": 1,
                    "Left": 0,
                    "Top": 0,
                    "Width": 1
                },
                "Polygon": [
                    {
                        "X": 0.00039482428110204637,
                        "Y": 0
                    },
                    {
                        "X": 1,
                        "Y": 0.0003579388139769435
                    },
                    {
                        "X": 0.9996051788330078,
                        "Y": 1
                    },
                    {
                        "X": 0,
                        "Y": 0.999642014503479
                    }
                ]
            },
            "Hint": null,
            "Id": "72b8cafa-dfea-490d-bb7a-c62a6322156f",
            "Page": 2,
            "Query": null,
            "Relationships": [
                {
                    "Ids": [
                        "192777fe-155c-4485-8754-32b6c6de51a3",
                        "85cf50c3-31fe-4ab8-81b9-eba31d306fb3",
                        "c71c0c5e-8c0a-4962-9996-126b1bc6cda2",
                        "758e004d-1952-44d1-a1cd-682533531289",
                        "366f5dcd-bcbb-41d1-b008-f94eba11d2d9",
                        "97c6606e-a4d9-402a-a921-2d0f357d01f2",
                        "6ebbee1b-96eb-4c86-a5f8-8269a2da3eb7",
                        "a87c9dc2-2cb3-42f6-b7c3-9efedcb2b360",
                        "1ea431eb-96c9-422a-adab-b4468322274e"
                    ],
                    "Type": "CHILD"
                }
            ],
            "RowIndex": null,
            "RowSpan": null,
            "SelectionStatus": null,
            "Text": null,
            "TextType": null
        },
        {
            "BlockType": "LINE",
            "ColumnIndex": null,
            "ColumnSpan": null,
            "Confidence": 56.1424446105957,
            "DocumentType": null,
            "EntityTypes": null,
            "Geometry": {
                "BoundingBox": {
                    "Height": 0.008917947299778461,
                    "Left": 0.42938727140426636,
                    "Top": 0.7764229774475098,
                    "Width": 0.13498438894748688
                },
                "Polygon": [
                    {
                        "X": 0.4295380413532257,
                        "Y": 0.7764229774475098
                    },
                    {
                        "X": 0.564371645450592,
                        "Y": 0.7783024907112122
                    },
                    {
                        "X": 0.5642209053039551,
                        "Y": 0.7853409051895142
                    },
                    {
                        "X": 0.42938727140426636,
                        "Y": 0.7834613919258118
                    }
                ]
            },
            "Hint": null,
            "Id": "192777fe-155c-4485-8754-32b6c6de51a3",
            "Page": 2,
            "Query": null,
            "Relationships": [
                {
                    "Ids": [
                        "5b45ce15-b8c1-4c12-82c1-4ce37467e431",
                        "dc5c12ce-61c1-4fd6-8044-d8e7d3ee7c6b"
                    ],
                    "Type": "CHILD"
                }
            ],
            "RowIndex": null,
            "RowSpan": null,
            "SelectionStatus": null,
            "Text": "WAR DEPARTMENT,",
            "TextType": null
        },

So there is a PAGE block for every page, but the blank pages have "Relationships": null.

@simonw simonw closed this as completed in 76d4fdc Aug 10, 2022
simonw added a commit that referenced this issue Aug 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant