Periodically Failing Tests resulting in failing master build #260

tpendragon · 2019-07-19T19:25:29Z

https://circleci.com/gh/samvera/questioning_authority/766

botimer · 2019-07-30T23:20:36Z

I've done a pretty thorough investigation today. The webmock response takes effect at the Net::HTTP level by way of the RDF gem's graph loading. The Faraday stuff for FAST is not implicated here. The failures only happened on one occasion that I see, and they all appear to be from mismatched expectations vs. mocked responses, not real HTTP requests leaking out.

Specifically, multiple were from getting a Qa::ServiceError (500) instead of Qa::ServiceUnavailable (503), but other mocked responses hit other examples, too.

My only suspicion at this point is that there was an unusual timing/parallel issue. I've run the suite with the random seed of the failed run (44328), and it passes. There is a bit of code in the RDF gem that uses Net::HTTP::start in an until loop, and then there is some class-level interception in webmock. This is likely a narrow isolation issue well upstream.

Since this has only failed once and we have confirmed that the mocks are at least being applied consistently, I'll close this for now, pending recurrent problems.

botimer · 2019-08-01T19:50:11Z

This is failing more often now. There must be some sticky/leaking state. Looking into it...

jrgriffiniii · 2019-08-08T15:40:56Z

Exploring these failures on CircleCI, I can replicate these over SSH and use that environment for debugging. From what I’m seeing on CI, https://github.com/samvera/questioning_authority/blob/master/app/services/qa/linked_data/graph_service.rb#L93 parses the error and looks for the HTTP status code when RDF::Graph.load(url) fails to load from a URL in https://github.com/samvera/questioning_authority/blob/master/app/services/qa/linked_data/graph_service.rb#L11
https://github.com/samvera/questioning_authority/blob/master/app/services/qa/linked_data/graph_service.rb#L104 is what generates the Internal Server Error - Fetch term... which fail the expectations

Further, https://github.com/ruby-rdf/rdf/blob/master/lib/rdf/reader.rb#L199 returns nil, and the failures seem to start on https://github.com/ruby-rdf/rdf/blob/master/lib/rdf/reader.rb#L209. I was able to trigger https://github.com/ruby-rdf/rdf/blob/master/lib/rdf/reader.rb#L234 to be raised.

jrgriffiniii · 2019-08-08T15:48:47Z

elrayle further diagnosed this issue on CircleCI, finding that there was a disparity in error messages which are parsed for the HTTP status code:

• local msg=http://experimental.worldcat.org/fast/search?maximumRecords=3&query=cql.any%20all%20%22cornell%22&sortKeys=usage: (404)
• CCI msg=http://experimental.worldcat.org/fast/search?maximumRecords=3&query=cql.any%20all%20%22cornell%22&sortKeys=usage: 404

Please note the missing parenthesis in the error message raised on CircleCI.

botimer · 2019-08-08T19:03:45Z

This was caused by an older version of the Coveralls gem being included in the bundle on CircleCI. How?

QA depends on the coveralls gem without a version specified
The cached bundle for CI builds used version 0.7.2, which depends on rest-client
When rest-client is loaded, RDF::Util::File uses it, rather than Net::HTTP
The IOError raised only includes a formatted message, which is parsed to re-specify the error for downstream use within QA
The adapters format their error messages differently (without and with parentheses, respectively), so the environmental difference between an old bundle and a new one (coveralls 0.8.23) results in different QA errors being raised by way of a missed case condition

As far as the underlying issues:

The rdf gem should return structured info it already has (subclass IOError to include url and status code attributes)
The rdf gem uses implicit config that it does not log or reveal in a meaningful way (i.e., RDF::Graph.load relies on a class variable of RDF::Util::File that is lazily set based on which gems are loaded, so even the existence of the adapter, let alone its identity, is not obvious)
Implicit dependency versions combined with caching conspire to create a CI environment that is sensitive to when it was initialized, in a way that is challenging to detect or reproduce locally (even using circleci local execute on a workstation)

botimer · 2019-08-08T19:05:43Z

PR #264 is merged and works around the issue. We will follow up with ruby-rdf upstream and, pending changes, move to a more robust solution (e.g., explicit status codes rather than parsing).

elrayle · 2019-08-08T19:16:05Z

The only additional info I would add to this is that the occurrence of error message with parentheses and without was random causing random failures of the tests. The patch in PR #264 uses regex to extract the error code and allows for both formats of the error message. This is only slightly less fragile than before. The recommendation to move to a structured error message for RDF::Graph.load would allow the error code processing to be more robust.

tpendragon added the bug label Jul 19, 2019

botimer self-assigned this Jul 30, 2019

botimer closed this as completed Jul 30, 2019

botimer reopened this Aug 1, 2019

jrgriffiniii mentioned this issue Aug 7, 2019

Investigate webmock dependencies samvera/maintenance#27

Open

3 tasks

botimer closed this as completed Aug 8, 2019

elrayle mentioned this issue Aug 8, 2019

allow error from RDF::Graph to be parsed with or without paren around code #264

Merged

jrgriffiniii mentioned this issue Aug 9, 2019

Refactor the RSpec test suites #267

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Periodically Failing Tests resulting in failing master build #260

Periodically Failing Tests resulting in failing master build #260

tpendragon commented Jul 19, 2019

botimer commented Jul 30, 2019

botimer commented Aug 1, 2019

jrgriffiniii commented Aug 8, 2019

jrgriffiniii commented Aug 8, 2019

botimer commented Aug 8, 2019

botimer commented Aug 8, 2019

elrayle commented Aug 8, 2019

Periodically Failing Tests resulting in failing master build #260

Periodically Failing Tests resulting in failing master build #260

Comments

tpendragon commented Jul 19, 2019

botimer commented Jul 30, 2019

botimer commented Aug 1, 2019

jrgriffiniii commented Aug 8, 2019

jrgriffiniii commented Aug 8, 2019

botimer commented Aug 8, 2019

botimer commented Aug 8, 2019

elrayle commented Aug 8, 2019