-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a Ring to IndexGateway #5358
Add a Ring to IndexGateway #5358
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few nits
18071fa
to
7d5b156
Compare
Signed-off-by: Jordan Rushing <[email protected]>
- Implement missing methods for the IndexGateway to be used as a BasicLifecyclerDelegate. The methods are stored in a lifecycle file and follow the same approach of the Ruler ring - Make Gateway public and add missing parameters to the IndexGateway's initialization method
Signed-off-by: Jordan Rushing <[email protected]>
- Add memberlist as a dependency of the IndexGateway - Add IndexGateway support for the common configuration section - Modify Loki to register IndexGateway flags. This fixes the default flag values for the IndexGateway - Make IndexGateway configurations public. Without this, we can't use them.
- If IndexGateway is running with a ring, it uses the IdleService. Otherwise, it uses the BasicService - Implement IndexGateway support to handle ring HTTP pages
- Implement new IndexGatewayRing service - Add IndexGatewayRing as a Store dependency - Modify store to pass IndexGatewayRing as a parameter - Implement IndexGatewayClient ring mode
- Moves IndexGateway configuration to the IndexGatewayClient struct within the boltdb-shipper - Reuse the IndexGatewayClient everywhere - Implement IndexGateway gRPC pool
- Instead, add it to the store configuration and bubble it down to deeper modules.
Signed-off-by: JordanRushing <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
way to put a ring on it, look good.
@@ -486,6 +490,7 @@ func (t *Loki) setupModuleManager() error { | |||
mm.RegisterModule(Compactor, t.initCompactor) | |||
mm.RegisterModule(IndexGateway, t.initIndexGateway) | |||
mm.RegisterModule(QueryScheduler, t.initQueryScheduler) | |||
mm.RegisterModule(IndexGatewayRing, t.initIndexGatewayRing, modules.UserInvisibleModule) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the reason this needs to be a top level module (vs. having the logic to enable/disable the ring be inside the initIndexGateway
code) because it's a shared dependency between both the Store
and the IndexGateway
? Can those module be enabled/disabled independent of one another?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For context: there's an index gateway server and index gateway clients. The clients needs a read ring (provided by the IndexGatewayRing service), while the server needs the whole index gateway service. So, the server needs way more stuff than the other services (ex: the store) as they only need a ring, and sharing the whole server with the store was too much.
Now, to your question: yes, this had to be a top-level module, but not because the ring is shared between the gateway server and client, because although both use the ring, they use it differently (ex: the client only requires a read ring, the server requires a lifecycler+a normal ring). This had to be a top-level module to isolate a very lean read ring, used by the clients.
To your last question: Yes, they can be disabled/independent because you can run an index gateway ring without any clients connecting to it, and you can also have an index gateway client without any server registered in the ring and it will still run - but won't communicate with any index gateway.
return nil, err | ||
} | ||
|
||
t.Server.HTTP.Path("/indexgateway/ring").Methods("GET", "POST").Handler(gateway) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note, when this is merged, we'll have to update the enterprise gateway in GEL as well as the nginx configs in a few helm charts to include this new route.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some minor suggestions but otherwise the changes LGTM! I am wondering if we should have some tests in the gateway client in ring mode but we will have to mock most of the things so whether it would make a good test 🤔
Signed-off-by: JordanRushing <[email protected]>
- Add `replication_factor` flag to the IndexGateway config struct - Modify Index Gateway ring config struct to be inline with a new struct, since it doesn't expose a replication factor config - Modify dynamic config wrapper to reuse a common replication factor on the Index Gateway ring
- If we don't randomize access, we'll always access same Index Gateway instances in same order for the same tenant
- This is accomplished by using `NewIgnoreUnhealthyInstancesReplicationStrategy` - It is already done by the server
- Remove unnecessary/wrong comments - Only set replication factor at a single place
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking on this PR from a high level perspective, it is a great piece of work to make Loki tenant x horizontal scalable. So a big thanks for this!
if err != nil { | ||
level.Error(util_log.Logger).Log("msg", fmt.Sprintf("failed to get client for instance %s", addr), "err", err) | ||
continue | ||
} | ||
|
||
client := (genericClient.(indexgatewaypb.IndexGatewayClient)) | ||
if err := s.clientDoQueries(ctx, gatewayQueries, queryKeyQueryMap, callback, client); err != nil { | ||
level.Error(util_log.Logger).Log("msg", fmt.Sprintf("client do queries failed for instance %s", addr), "err", err) | ||
continue | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My suggestion here (or in any other place more suitable), it would be nice to have some sort of reporting on failures based on tenant IDs. For example we report per-tenant rate limits or distributor append failures. The index gateway is another great piece where looking on errors I should be able to determine which tenant is impacted by which index gateway replica. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hey that's a great idea. We already have plans to add metrics in a follow-up PR (and there are other things missing too), but I'll make sure to add this to our TODO list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just giving you a heads-up: I tried but looks like we can't determine which tenant ID was related to a failure here. This should actually be reported by the index gateway and the table manager. I'm sure we already have error logs for when something unexpected happens there, I'll check if we can add any new metrics there too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Signed-off-by: Jordan Rushing [email protected]
What this PR does / why we need it:
This PR introduces a ring to Loki's
IndexGateway
to provide better horizontal scalability via per-tenant hashing with changes to the associated Index files provided by per-tenant compaction introduced in #5026 which will reduce overall startup time for newIndexGateway
instances.IndexGateway
now includes a new configuration,RingMode
. When this mode is enabled eachIndexGateway
instance becomes responsible for a subset of tenant indices.In PRs to follow, we will update
TableManager
to reduce the amount of tenant data downloaded and add metrics.Which issue(s) this PR fixes:
N/A
Special notes for your reviewer:
N/A
Checklist
CHANGELOG.md
about the changes.