Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nokogiri XML Namespaces and gzip decoding #13

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

nathanstitt
Copy link

This is needed for fixing a few issues that DocumentCloud has encountered while using calais for our entity extraction.

The first is that Calias sometimes returns gzipped content. When that occurs an exception is thrown since the content can't be decoded (of course). This may have been an intermittent issue with the api, but our thoughts were that it can't hurt to attempt to handle it. A further enhancement would be to request gzip encoding on the request so it would be more efficient.

The second is more pressing. It has to do with newer nokogiri differing on how it handles namespace prefixes. I believe issues #10 and #11 are attempting to fix the same bug. #11 indicates that the bug started with Nokogiri 1.5.6, but I haven't tracked down when the change occurred.

DocumentCloud has been running with this branch in production for several months now without issue (https://github.com/documentcloud/documentcloud/blob/master/Gemfile#L5). We'd really like to get it merged and a new gem cut so we can remove the "git" references out of our Gemfile.

Thanks for the excellent job you've done with the gem thus far. If I can help with any further testing or merging, please let me know.

Either newer version of Nokogiri is stricter on parsing attribute
namespaces, or Calais has radically changed their schema.

Either way, just about all the parts of the code that read attribute
values was broken.

This is a first pass at correcting the issues and takes it to the point
that all the tests now pass successfully.

There are probably additional issues lurking that the tests aren't
covering.  I'll fix those as DocumentCloud encounters them.
I don't think anything we're doing is triggering this.  I've verified that
the Accept-Encoding header isn't present, but Calais is still sending
gzip'ed xml as the reply.

Probably a mis-configuration on their end but is easy enough to handle.

A future enhancement might be to set the {"Accept-Encoding" => "gzip"}
header on the request, then we should get gzip data all the time.
@abhay
Copy link
Owner

abhay commented Jun 25, 2014

@nathanstitt, I'm looking for someone to properly take ownership of this project since I don't have the cycles to do it myself. Any thoughts on DocumentCloud or yourself taking this on? I could see you guys actually running with it.

@nathanstitt
Copy link
Author

@abhay I totally understand, stuff can get crazy and sometimes there's just not enough days in the week.

We'd be very interested in taking over the project. I think it would fit very well with DocumentCloud's mission since we depend on it quite a bit for our entity support.

I'm not 100% sure on how that would go down, but I'm assuming you could just transfer the repo to documentcloud's github account and transfer the ruby gem to us. Feel free to email me directly, or swing by the #documentcloud irc channel if you'd like to discuss in real-time.

@nathanstitt
Copy link
Author

Hi Abhay,

Have you given any further thought to allowing DocumentCloud to take over support of the Gem? We're still attempting to cleanup the Gemfile. Please let us know if we can help further.

Thanks very much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants