Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Periodically Failing Tests resulting in failing master build #260

Closed
tpendragon opened this issue Jul 19, 2019 · 7 comments
Closed

Periodically Failing Tests resulting in failing master build #260

tpendragon opened this issue Jul 19, 2019 · 7 comments
Assignees
Labels

Comments

@tpendragon
Copy link
Contributor

https://circleci.com/gh/samvera/questioning_authority/766

@tpendragon tpendragon added the bug label Jul 19, 2019
@botimer botimer self-assigned this Jul 30, 2019
@botimer
Copy link
Contributor

botimer commented Jul 30, 2019

I've done a pretty thorough investigation today. The webmock response takes effect at the Net::HTTP level by way of the RDF gem's graph loading. The Faraday stuff for FAST is not implicated here. The failures only happened on one occasion that I see, and they all appear to be from mismatched expectations vs. mocked responses, not real HTTP requests leaking out.

Specifically, multiple were from getting a Qa::ServiceError (500) instead of Qa::ServiceUnavailable (503), but other mocked responses hit other examples, too.

My only suspicion at this point is that there was an unusual timing/parallel issue. I've run the suite with the random seed of the failed run (44328), and it passes. There is a bit of code in the RDF gem that uses Net::HTTP::start in an until loop, and then there is some class-level interception in webmock. This is likely a narrow isolation issue well upstream.

Since this has only failed once and we have confirmed that the mocks are at least being applied consistently, I'll close this for now, pending recurrent problems.

@botimer botimer closed this as completed Jul 30, 2019
@botimer
Copy link
Contributor

botimer commented Aug 1, 2019

This is failing more often now. There must be some sticky/leaking state. Looking into it...

@jrgriffiniii
Copy link
Contributor

Exploring these failures on CircleCI, I can replicate these over SSH and use that environment for debugging. From what I’m seeing on CI, https://github.com/samvera/questioning_authority/blob/master/app/services/qa/linked_data/graph_service.rb#L93 parses the error and looks for the HTTP status code when RDF::Graph.load(url) fails to load from a URL in https://github.com/samvera/questioning_authority/blob/master/app/services/qa/linked_data/graph_service.rb#L11
https://github.com/samvera/questioning_authority/blob/master/app/services/qa/linked_data/graph_service.rb#L104 is what generates the Internal Server Error - Fetch term... which fail the expectations

Further, https://github.com/ruby-rdf/rdf/blob/master/lib/rdf/reader.rb#L199 returns nil, and the failures seem to start on https://github.com/ruby-rdf/rdf/blob/master/lib/rdf/reader.rb#L209. I was able to trigger https://github.com/ruby-rdf/rdf/blob/master/lib/rdf/reader.rb#L234 to be raised.

@jrgriffiniii
Copy link
Contributor

elrayle further diagnosed this issue on CircleCI, finding that there was a disparity in error messages which are parsed for the HTTP status code:

• local msg=http://experimental.worldcat.org/fast/search?maximumRecords=3&query=cql.any%20all%20%22cornell%22&sortKeys=usage: (404)
• CCI msg=http://experimental.worldcat.org/fast/search?maximumRecords=3&query=cql.any%20all%20%22cornell%22&sortKeys=usage: 404

Please note the missing parenthesis in the error message raised on CircleCI.

@botimer
Copy link
Contributor

botimer commented Aug 8, 2019

This was caused by an older version of the Coveralls gem being included in the bundle on CircleCI. How?

  1. QA depends on the coveralls gem without a version specified
  2. The cached bundle for CI builds used version 0.7.2, which depends on rest-client
  3. When rest-client is loaded, RDF::Util::File uses it, rather than Net::HTTP
  4. The IOError raised only includes a formatted message, which is parsed to re-specify the error for downstream use within QA
  5. The adapters format their error messages differently (without and with parentheses, respectively), so the environmental difference between an old bundle and a new one (coveralls 0.8.23) results in different QA errors being raised by way of a missed case condition

As far as the underlying issues:

  1. The rdf gem should return structured info it already has (subclass IOError to include url and status code attributes)
  2. The rdf gem uses implicit config that it does not log or reveal in a meaningful way (i.e., RDF::Graph.load relies on a class variable of RDF::Util::File that is lazily set based on which gems are loaded, so even the existence of the adapter, let alone its identity, is not obvious)
  3. Implicit dependency versions combined with caching conspire to create a CI environment that is sensitive to when it was initialized, in a way that is challenging to detect or reproduce locally (even using circleci local execute on a workstation)

@botimer botimer closed this as completed Aug 8, 2019
@botimer
Copy link
Contributor

botimer commented Aug 8, 2019

PR #264 is merged and works around the issue. We will follow up with ruby-rdf upstream and, pending changes, move to a more robust solution (e.g., explicit status codes rather than parsing).

@elrayle
Copy link
Contributor

elrayle commented Aug 8, 2019

The only additional info I would add to this is that the occurrence of error message with parentheses and without was random causing random failures of the tests. The patch in PR #264 uses regex to extract the error code and allows for both formats of the error message. This is only slightly less fragile than before. The recommendation to move to a structured error message for RDF::Graph.load would allow the error code processing to be more robust.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants