You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am seeing errors like this when restarting a worker instance:
Error in rebalance: AssignedToNotFoundForShard
Investigating the root cause, it seems that the DynamoCheckpoint.ListActiveWorkers method returns this error when a shard has no assigned lease owner. The name of the error makes sense conceptually, but is this really an error? Why does the ListActiveWorkers method need to fail in this scenario? My (possibly naive) assumption would be that if a shard is unassigned, that does not affect the list of active workers. The method should return whatever set of active workers it finds, ignoring unassigned shards. For example:
// ListActiveWorkers returns a map of workers and their shardsfunc (checkpointer*DynamoCheckpoint) ListActiveWorkers(shardStatusmap[string]*par.ShardStatus) (map[string][]*par.ShardStatus, error) {
err:=checkpointer.syncLeases(shardStatus)
iferr!=nil {
returnnil, err
}
workers:=map[string][]*par.ShardStatus{}
for_, shard:=rangeshardStatus {
ifshard.GetCheckpoint() ==ShardEnd {
continue
}
leaseOwner:=shard.GetLeaseOwner()
ifleaseOwner=="" {
// Original code// checkpointer.log.Debugf("Shard Not Assigned Error. ShardID: %s, WorkerID: %s", shard.ID, checkpointer.kclConfig.WorkerID)// return nil, ErrShardNotAssignedcheckpointer.log.Debugf("ListActiveWorkers: Shard Not Assigned. ShardID: %s, WorkerID: %s", shard.ID, checkpointer.kclConfig.WorkerID)
continue
}
ifw, ok:=workers[leaseOwner]; ok {
workers[leaseOwner] =append(w, shard)
} else {
workers[leaseOwner] = []*par.ShardStatus{shard}
}
}
returnworkers, nil
}
Reproduction steps
Start multiple workers.
Restart a worker
Expected behavior
If the intention is to restrict rebalancing until all shards have leases, I don't think an error from rebalance is appropriate. If that is the case, this is not an error condition. At worst, I would argue this is a warning, but in my honest opinion, this would just be something like this in Worker.rebalance():
This may be intended functionality, but it seems odd to me, so I figured I'd open a bug report to ask. If this is intended, I'd appreciate some explanation of the error in question. It happens regularly during restarts, and has thus far seemed to be a red haring, and not a real error. So, it gives me a little fright every time I check up on the logs.
Also, if this is not something you have observed and you think I may be doing something wrong, please let me know. Happy to fix my code and close this issue if need be 😄
The text was updated successfully, but these errors were encountered:
Describe the bug
I am seeing errors like this when restarting a worker instance:
Investigating the root cause, it seems that the
DynamoCheckpoint.ListActiveWorkers
method returns this error when a shard has no assigned lease owner. The name of the error makes sense conceptually, but is this really an error? Why does theListActiveWorkers
method need to fail in this scenario? My (possibly naive) assumption would be that if a shard is unassigned, that does not affect the list of active workers. The method should return whatever set of active workers it finds, ignoring unassigned shards. For example:Reproduction steps
Expected behavior
If the intention is to restrict rebalancing until all shards have leases, I don't think an error from rebalance is appropriate. If that is the case, this is not an error condition. At worst, I would argue this is a warning, but in my honest opinion, this would just be something like this in
Worker.rebalance()
:Additional context
This may be intended functionality, but it seems odd to me, so I figured I'd open a bug report to ask. If this is intended, I'd appreciate some explanation of the error in question. It happens regularly during restarts, and has thus far seemed to be a red haring, and not a real error. So, it gives me a little fright every time I check up on the logs.
Also, if this is not something you have observed and you think I may be doing something wrong, please let me know. Happy to fix my code and close this issue if need be 😄
The text was updated successfully, but these errors were encountered: