Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout topology watch and return better error messages for missing topology / namespaces #926

Merged
merged 8 commits into from
Sep 27, 2018

Conversation

richardartoul
Copy link
Contributor

@richardartoul richardartoul commented Sep 20, 2018

We weren't passing a timeout properly to the ServiceClient so:

watch, err := services.Watch(opts.ServiceID(), opts.QueryOptions())

would hang forever if there was no topology.

In addition, in the case where there is no topology or namespace the error message is cryptic (timeout) when it could be slightly more helpful

InstrumentOpts instrument.Options
HashingSeed uint32
HostID string
ResolutionTimeout time.Duration
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to be b/w compatible when making changes to publicly configurable fields.

  • Could you mark the old fields deprecated,
  • Add the new field,
  • Error if both new and old are set, and pick the set field otherwise

Rob had to do some of this in #890 -

// StorageMetricsType is the namespace type.
//
// Deprecated: Use "Type" field when specifying config instead, it is
// invalid to use both.
StorageMetricsType storage.MetricsType `yaml:"storageMetricsType"`
}
func (c ClusterStaticNamespaceConfiguration) metricsType() (storage.MetricsType, error) {
result := storage.DefaultMetricsType
if c.Type != storage.DefaultMetricsType && c.StorageMetricsType != storage.DefaultMetricsType {
// Don't allow both to not be default
return result, errBothNamespaceTypeNewAndDeprecatedFieldsSet
}
if c.Type != storage.DefaultMetricsType {
// New field value set
return c.Type, nil
}
if c.StorageMetricsType != storage.DefaultMetricsType {
// Deprecated field value set
return c.StorageMetricsType, nil
}
// Both are default
return result, nil
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spoke offline, very unlikely anyone ever set these fields are they are undocumented in are YMLs so gonna make the backward incompatible change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spoke offlien again, gonna remove timeouts altogether

@@ -164,7 +164,7 @@ func NewDatabase(
nsInit := opts.NamespaceInitializer()
nsReg, err := nsInit.Init()
if err != nil {
return nil, err
return nil, fmt.Errorf("database: namespace initialized failed: %v. make sure a namespace has been set")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Capitalize the new sentence. Also, include err in the Errorf.

@robskillington
Copy link
Collaborator

Hey, do you want to make sure this supports the case if bringing up a node and using it's coordinator endpoint to create a placement?

Before we set an indefinite timeout here we used to basically crash the node before you could create a placement, which was a fairly big issue.

Does this change re-introduce that behavior? (Just wanting to get to the bottom of how to best cover both cases)

@prateek
Copy link
Collaborator

prateek commented Sep 24, 2018

@robskillington Yep, we settled on removing the timeout all together for that reason. As long as we put appropriate log messages before waiting in all the places, it should fail the same as it does today. Sound reasonable?

@richardartoul richardartoul force-pushed the ra/timeout-topo-and-better-errors branch from 89a9567 to 3090ae1 Compare September 27, 2018 15:29
@richardartoul richardartoul force-pushed the ra/timeout-topo-and-better-errors branch from 3090ae1 to f504970 Compare September 27, 2018 15:29
@codecov
Copy link

codecov bot commented Sep 27, 2018

Codecov Report

Merging #926 into master will decrease coverage by 0.04%.
The diff coverage is 93.33%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #926      +/-   ##
==========================================
- Coverage   77.96%   77.91%   -0.05%     
==========================================
  Files         410      410              
  Lines       34399    34358      -41     
==========================================
- Hits        26818    26770      -48     
- Misses       5734     5741       +7     
  Partials     1847     1847
Flag Coverage Δ
#dbnode 81.45% <93.33%> (-0.05%) ⬇️
#m3ninx 75.25% <ø> (-0.08%) ⬇️
#query 64.43% <ø> (-0.04%) ⬇️
#x 80.55% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 835df4d...3482a5a. Read the comment docs.

@@ -185,19 +184,6 @@ func (t *dynamicTopology) MarkShardsAvailable(
return ps.MarkShardsAvailable(instanceID, shardIDs...)
}

func waitOnInit(w services.Watch, d time.Duration) error {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We finally removed this, noice!

InstrumentOpts instrument.Options
HashingSeed uint32
HostID string
NamespaceResolutionTimeout time.Duration
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't just delete these (b/w compatibility and all), need to mark deprecated instead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spoke offline and agreed its ok in this case because it was undocumented config and we're breaking the functionaltiy anyways

Copy link
Collaborator

@prateek prateek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@richardartoul richardartoul merged commit 95b5f09 into master Sep 27, 2018
@prateek prateek deleted the ra/timeout-topo-and-better-errors branch September 29, 2018 18:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants