Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

User can't deposit JSON files unless they contain valid JSON #3075

Closed
honeybadger bot opened this issue May 30, 2023 · 13 comments · Fixed by #3083
Closed

User can't deposit JSON files unless they contain valid JSON #3075

honeybadger bot opened this issue May 30, 2023 · 13 comments · Fixed by #3083
Assignees

Comments

@honeybadger
Copy link

honeybadger bot commented May 30, 2023

Error message

RuntimeError: unexpected response: #<Faraday::Response:0x00007fc128e20ba8 @on_complete_callbacks=[], @env=#<Faraday::Env @method=:put @request_body=#<File:/data/h2-files/jy/fi/jyfi2xh06cszo21unpzmha8tis9m> @url=#<URI::HTTPS https://sdr-api-prod.stanford.edu/v1/disk/REDACTED--REDACTED> @request=#<Faraday::RequestOptions timeout=900, read_timeout=1800> @request_headers={"User-Agent"=>"Faraday v2.7.4", "Content-Type"=>"application/json", "Content-Length"=>"216381", "Authorization"=>"Bearer REDACTED"} @ssl=#<Faraday::SSLOptions verify=true> @response=#<Faraday::Response:0x00007fc128e20ba8 ...> @response_headers={"date"=>"Tue, 30 May 2023 06:00:12 GMT", "server"=>"Apache/2.4.41 (Ubuntu)", "cache-control"=>"no-cache", "x-request-id"=>"adc7687e-1ff2-4ce9-95b0-0b89e1d3032a", "x-runtime"=>"0.001560", "x-powered-by"=>"Phusion Passenger(R) 6.0.12", "strict-transport-security"=>"max-age=31536000; includeSubDomains", "status"=>"400 Bad Request", "connection"=>"close", "transfer-encoding"=>"chunked", "content-type"=>"application/json"} @status=400 @reason_phrase="Bad Request" @response_body="{\"id\":\"bad_request\",\"message\":\"Request body wasn't valid JSON.\"}">>

(short) Backtrace

line 62 of [PROJECT_ROOT]/app/jobs/deposit_job.rb: perform_upload
line 40 of [PROJECT_ROOT]/app/jobs/deposit_job.rb: update_dro_with_file_identifiers
line 18 of [PROJECT_ROOT]/app/jobs/deposit_job.rb: perform

(longer) Backtrace

[GEM_ROOT]/gems/sdr-client-2.4.0/lib/sdr_client/deposit/upload_files.rb:74 :in `upload_file`
[GEM_ROOT]/gems/sdr-client-2.4.0/lib/sdr_client/deposit/upload_files.rb:36 :in `block (2 levels) in run`
unknown:? :in `unknown`
[GEM_ROOT]/gems/sdr-client-2.4.0/lib/sdr_client/deposit/upload_files.rb:33 :in `block in run`
[GEM_ROOT]/gems/sdr-client-2.4.0/lib/sdr_client/deposit/upload_files.rb:32 :in `each`

[GEM_ROOT]/gems/sdr-client-2.4.0/lib/sdr_client/deposit/upload_files.rb:32 :in `map`
[GEM_ROOT]/gems/sdr-client-2.4.0/lib/sdr_client/deposit/upload_files.rb:32 :in `run`
[GEM_ROOT]/gems/sdr-client-2.4.0/lib/sdr_client/deposit/upload_files.rb:16 :in `upload`
[PROJECT_ROOT]/app/jobs/deposit_job.rb:62 :in `perform_upload`
[PROJECT_ROOT]/app/jobs/deposit_job.rb:40 :in `update_dro_with_file_identifiers`
[PROJECT_ROOT]/app/jobs/deposit_job.rb:18 :in `perform` 

View full backtrace and more info at honeybadger.io

upload fails here

content type from this response is what's used

which probably comes from DepositJob's original determination of file metadata here? https://github.com/sul-dlss/happy-heron/blob/398d3d6ee5da1836bc4321127a37e164bce4feba/app/jobs/deposit_job.rb#L72C1-L81

possible fixes:

  • sdr-api does not validate user deposited files with content type JSON, just because the file content happens to be in a POST body with a request type of application/json
  • H2 passes along a more generic content type
@jmartin-sul jmartin-sul changed the title [happy-heron/prod] RuntimeError: unexpected response: #<Faraday::Response:0x00007fc128e20ba8 @on_complete_callbacks=[], @env=#<Faraday::Env @method=:put @request_body=#<File:/data/h2-files/jy/fi/jyfi2xh06cszo21unpzmha8tis9m> @url=#<URI::HTTPS https://sdr-api-prod.stanford.edu/v1/disk/eyJfcmFpbHMiOnsibWVzc2FnZSI6IkJBaDdDam9JYTJWNVNTSWhkV2hoYW1sMGNtMW5jbWxyYkhKcmJEQjRZV1UzWVRCaFpUVmlZZ1k2QmtWVU9oRmpiMjUwWlc1MFgzUjVjR1ZKSWhWaGNIQnNhV05oZEdsdmJpOXFjMjl1QmpzR1ZEb1RZMjl1ZEdWdWRGOXNaVzVuZEdocEF6MU5Bem9OWTJobFky... User can't deposit JSON files unless they contain valid JSON May 30, 2023
@jmartin-sul
Copy link
Member

here's a branch where i tested a couple of changes suggested by @edsu in a slack discussion:
https://github.com/sul-dlss/sdr-api/compare/relax_put_disk_request_check

i deployed and tested both on stage friday evening, but neither worked, unfortunately. related HB alert for that branch and my test case to reproduce the issue: https://app.honeybadger.io/projects/77112/faults/96288372

i set sdr-api-stage back to main before signing off for the week, but when i did test those two commits, i still got errors back from sdr-api. they were slightly more generic, e.g.:

RuntimeError: unexpected response: #<Faraday::Response:0x00007f42bcc310a8 @on_complete_callbacks=[], @env=#<Faraday::Env @method=:put @request_body=#<File:/data/h2-files/oj/pp/ojppn63g3qny0mjwcddkwllo0ij1> @url=#<URI::HTTPS https://sdr-api-stage.stanford.edu/v1/disk/REDACTED=--REDACTED> @request=#<Faraday::RequestOptions timeout=900, read_timeout=1800> @request_headers={"User-Agent"=>"Faraday v2.7.4", "Content-Type"=>"application/json", "Content-Length"=>"29", "Authorization"=>"Bearer REDACTED"} @ssl=#<Faraday::SSLOptions verify=true> @response=#<Faraday::Response:0x00007f42bcc310a8 ...> @response_headers={"date"=>"Sat, 27 May 2023 01:03:31 GMT", "server"=>"Apache/2.4.41 (Ubuntu)", "x-request-id"=>"fec1171b-b2d3-4475-8456-c7dad007ce19", "x-runtime"=>"0.003580", "x-powered-by"=>"Phusion Passenger(R) 6.0.13", "strict-transport-security"=>"max-age=31536000; includeSubDomains", "content-length"=>"0", "status"=>"400 Bad Request", "connection"=>"close", "content-type"=>"text/html; charset=UTF-8"} @status=400 @reason_phrase="Bad Request" @response_body="">>

jmartin-sul referenced this issue in sul-dlss/sdr-api May 30, 2023
…penapi that's rejecting user uploads of .json files with invalid json content (which gets sent in POST body, so checked by openapi according to content type
@jmartin-sul
Copy link
Member

jmartin-sul commented May 30, 2023

my test case on stage: https://sdr-stage.stanford.edu/works/19265

HB alert for stage, very much like the prod one for which this was filed: https://app.honeybadger.io/projects/77112/faults/96288372/

zip file that should reproduce the issue (deposit to a collection in H2, and choose the Upload a single zip file (less than 10GB) which will be unzipped, including any file hierarchy option after selecting content type):
some_json_examples.zip

@jmartin-sul jmartin-sul self-assigned this May 30, 2023
@jmartin-sul
Copy link
Member

@andrewjbtw confirms that we don't want to accidentally validate as we're doing now: "yes, I think we should not validate that a .json file is valid JSON unless we’ve specifically planned to do that and have a process for returning files to the user, which we have not developed."

@peetucket
Copy link
Member

peetucket commented Jun 5, 2023

@jmartin-sul and I tested a couple theories here: #3078 and sul-dlss/sdr-client#283

Basically force H2 to ignore the actual content type of the files when posting to SDR-API (via sdr-client) and just say everything is application/octet-stream. This allowed the invalid JSON file to be sent without any errors, but since the content-type sent in this API call is then used in cocina, it carries through to the content structure and PURL page. See https://argo-stage.stanford.edu/view/yc662xq8687 and https://sul-purl-stage.stanford.edu/yc662xq8687

So this type of change would drop all content-types stored and shown to users and does not seem like the right approach.

So possible approaches would be:

  1. Use this approach and re-compute content types before creating cocina and after files are uploaded (seems like a bad idea)
  2. Create special cases in H2/sdr-client that looks for certain content types that are likely to be validated and can possibly be invalid (like application/json) and then switch these to something like application/text. This creates special case code and also would force the content type to be application/text all the way through to the user display. Seems sub-optimal.
  3. Continue on with the attempt to prevent inadvertant validation of an uploaded file by sdr-api. Could be tricky since we seem to use just standard ActiveStorage::DirectUploadsController (see https://github.com/sul-dlss/sdr-api/blob/main/app/controllers/direct_uploads_controller.rb) This is the best approach since it is the actual behavior we want (i.e. user can upload a file that is identified as application/json and remains identified as application/json even if it has some internal structural errors in it).

@jmartin-sul
Copy link
Member

jmartin-sul commented Jun 5, 2023

rails docs about parameter wrapping (the thing i tried to disable in the second commit of my sdr-api test branch from week before last): https://guides.rubyonrails.org/action_controller_overview.html#json-parameters

the other thing i tried on that branch was disabling request validation by the committee gem for file uploads (see first commit). that still ran into a very similar (but slightly differently messaged) error as the original glitch.

@peetucket
Copy link
Member

Basically I learned that if you post the file with application/octet-stream instead of it's actual content type (as shown here: main...do-not-post-content-types), it works just fine. Since we know the actual correct content type before we post the file and in fact seem to generate and post the cocina structural with the actual mime-types, i was hoping to just prevent cocina structural's mime type from being reset to application/octet-stream after the file is posted, but cannot figure out why it is getting reset, like this: https://argo-stage.stanford.edu/view/druid:jx898ys9576)

The better solution is to understand how activestorage is used by sdr-api and have it not care about a content type validating the actual content of the file (like this: https://github.com/sul-dlss/sdr-api/pull/585/files) but I cannot find any details on how that works

@jmartin-sul
Copy link
Member

one more data point: @andrewjbtw and @peetucket both discovered in testing that if a user uploads a .json file with invalid JSON contents to H2, without using the expand-a-zip file deposit functionality, the invalid .json file will be rejected in the deposit UI, disallowing deposit or draft saving. since this doesn't involve sdr-api at all, that seems like circumstantial evidence that the issue is committee request validation and/or rails controller machinery and/or activestorage machinery.

if the invalid JSON content is put in a file with a non-.json extension, no error.

@edsu
Copy link
Contributor

edsu commented Jun 8, 2023

@amyehodge since solving this is turning into a bit of a thorny issue, if the depositor is eager to move there deposit through the system I believe they could change the file extension on their JSON files to .jsonl or .ndjson which are two popular file extensions used for line oriented JSON which is what they are attempting to add?

The two files are:

  • SamplingPaper-1/data/app/10_2022/validation_10_2022.json
  • SamplingPaper-1/data/app/05_2022/bquxjob_4eb29efb_180f7d82eb2.json

Clearly this would not resolve this issue, but I just thought I'd share it as a possible strategy for the user since they have already been waiting a while.

edsu added a commit to sul-dlss/sdr-api that referenced this issue Jun 9, 2023
This additional test ensures that the file uploading API works.

Refs sul-dlss/happy-heron#3075
edsu added a commit to sul-dlss/sdr-api that referenced this issue Jun 9, 2023
This additional test ensures that the file uploading API works.

Refs sul-dlss/happy-heron#3075
@edsu
Copy link
Contributor

edsu commented Jun 9, 2023

@peetucket hopefully it's ok if I gave this to you since you have the most viable solution in a PR at the moment?

@jmartin-sul
Copy link
Member

reproduced bug on stage one last time with the old unfixed main branches that were deployed for this week's dep updates, then deployed the fixed main branches of H2 and sdr-api to stage, and confirmed the fix worked 🎉

my test deposits:

running infra integration tests against stage now...

@jmartin-sul
Copy link
Member

the original DepositJob that ran into this in prod H2 has now hit the dead jobs queue in sidekiq, but we can manually retry it from there once the fix is deployed to prod.

@jmartin-sul
Copy link
Member

infrastructure-integration-test suite passed against stage with the fixed H2 and sdr-api deployed (seems like no other SDR applications had been re-deployed since dep updates went out on monday).

@jmartin-sul
Copy link
Member

the fix for this issue went out with dependency updates this morning, and the user deposit on prod that originally got us to look into this behavior has now made it through: https://purl.stanford.edu/bk700vd4496

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
3 participants