Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

invalid utf-8 characters appearing during characterization process #314

Closed
rococodogs opened this issue Oct 29, 2019 · 2 comments · Fixed by #318
Closed

invalid utf-8 characters appearing during characterization process #314

rococodogs opened this issue Oct 29, 2019 · 2 comments · Fixed by #318
Labels

Comments

@rococodogs
Copy link
Member

this was a particularly nasty one i just worked through. a pdf sent to the characterization process returned:

{
  # ...
  :file_author => ["Microsoft\xAE Word 2010"],
  # ...
}

and trying to run file_set.characterization_proxy.save! after updating the metadata threw an ArgumentError: invalid byte sequence in UTF-8.

running this strips out the invalid bytes:

"Microsoft\xAE Word 2010".encode('UTF-8', invalid: :replace, replace: '')
=> "Microsoft Word 2010"

so we might want to do something like that?

relative source code:

https://github.com/samvera/hydra-works/blob/master/lib/hydra/works/services/characterization_service.rb#L105-L111

@rococodogs
Copy link
Member Author

essentially i'm just going through the sidekiq retries and getting the file_set ids and then running:

fs = FileSet.find('abc123def')
service = Spot::RemoteCharacterizationService.new(fs.characterization_proxy, Hyrax::WorkingDirectory.find_or_retrieve(fs.characterization_proxy.id, fs.id), {})
service.characterize

# note which field has invalid values, it'll either be `:file_title` or `:file_author`
service.object.creator = service.object.creator.map { |v| v.encode('UTF-8', invalid: :replace, replace: '') }
service.object.save!

# then create derivatives
CreateDerivativesJob.perform_now(fs, fs.files.first.id, Hyrax::WorkingDirectory.find_or_retrieve(fs.files.first.id, fs.id))

@rococodogs
Copy link
Member Author

also of note: a FitsServlet characterization tool already exists and i just never knew? might be worth:

  • removing our Spot::RemoteCharacterizationService in favor of the "officially supported" one
  • after characterizing (as above) iterate through the known problem fields (file.creator or file.file_title) and encoding those

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant