Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a Ring to IndexGateway #5358

Merged
merged 44 commits into from
Apr 13, 2022

Conversation

JordanRushing
Copy link
Contributor

@JordanRushing JordanRushing commented Feb 9, 2022

Signed-off-by: Jordan Rushing [email protected]

What this PR does / why we need it:

This PR introduces a ring to Loki's IndexGateway to provide better horizontal scalability via per-tenant hashing with changes to the associated Index files provided by per-tenant compaction introduced in #5026 which will reduce overall startup time for new IndexGateway instances.

IndexGateway now includes a new configuration, RingMode. When this mode is enabled each IndexGateway instance becomes responsible for a subset of tenant indices.

In PRs to follow, we will update TableManager to reduce the amount of tenant data downloaded and add metrics.

Which issue(s) this PR fixes:

N/A

Special notes for your reviewer:

N/A

Checklist

  • Documentation added
  • Tests updated
  • Add an entry in the CHANGELOG.md about the changes.

Copy link
Contributor

@DylanGuedes DylanGuedes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few nits

@pull-request-size pull-request-size bot added size/L and removed size/M labels Feb 10, 2022
@DylanGuedes DylanGuedes force-pushed the add-ring-to-index-gateway branch from 18071fa to 7d5b156 Compare February 24, 2022 20:00
JordanRushing and others added 21 commits March 1, 2022 15:00
- Implement missing methods for the IndexGateway to be used as a
  BasicLifecyclerDelegate. The methods are stored in a lifecycle file
  and follow the same approach of the Ruler ring
- Make Gateway public and add missing parameters to the IndexGateway's
  initialization method
Signed-off-by: Jordan Rushing <[email protected]>
- Add memberlist as a dependency of the IndexGateway
- Add IndexGateway support for the common configuration section
- Modify Loki to register IndexGateway flags. This fixes the default
  flag values for the IndexGateway
- Make IndexGateway configurations public. Without this, we can't use
  them.
- If IndexGateway is running with a ring, it uses the IdleService.
  Otherwise, it uses the BasicService
- Implement IndexGateway support to handle ring HTTP pages
- Implement new IndexGatewayRing service
- Add IndexGatewayRing as a Store dependency
- Modify store to pass IndexGatewayRing as a parameter
- Implement IndexGatewayClient ring mode
- Moves IndexGateway configuration to the IndexGatewayClient struct
  within the boltdb-shipper
- Reuse the IndexGatewayClient everywhere
- Implement IndexGateway gRPC pool
- Instead, add it to the store configuration and bubble it down to
  deeper modules.
@JordanRushing JordanRushing marked this pull request as ready for review March 30, 2022 18:39
@JordanRushing JordanRushing requested a review from a team as a code owner March 30, 2022 18:39
@JordanRushing JordanRushing changed the title Begin to add a Ring to IndexGateway Add a Ring to IndexGateway Mar 30, 2022
Copy link
Collaborator

@trevorwhitney trevorwhitney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

way to put a ring on it, look good.

@@ -486,6 +490,7 @@ func (t *Loki) setupModuleManager() error {
mm.RegisterModule(Compactor, t.initCompactor)
mm.RegisterModule(IndexGateway, t.initIndexGateway)
mm.RegisterModule(QueryScheduler, t.initQueryScheduler)
mm.RegisterModule(IndexGatewayRing, t.initIndexGatewayRing, modules.UserInvisibleModule)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the reason this needs to be a top level module (vs. having the logic to enable/disable the ring be inside the initIndexGateway code) because it's a shared dependency between both the Store and the IndexGateway? Can those module be enabled/disabled independent of one another?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For context: there's an index gateway server and index gateway clients. The clients needs a read ring (provided by the IndexGatewayRing service), while the server needs the whole index gateway service. So, the server needs way more stuff than the other services (ex: the store) as they only need a ring, and sharing the whole server with the store was too much.

Now, to your question: yes, this had to be a top-level module, but not because the ring is shared between the gateway server and client, because although both use the ring, they use it differently (ex: the client only requires a read ring, the server requires a lifecycler+a normal ring). This had to be a top-level module to isolate a very lean read ring, used by the clients.

To your last question: Yes, they can be disabled/independent because you can run an index gateway ring without any clients connecting to it, and you can also have an index gateway client without any server registered in the ring and it will still run - but won't communicate with any index gateway.

return nil, err
}

t.Server.HTTP.Path("/indexgateway/ring").Methods("GET", "POST").Handler(gateway)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note, when this is merged, we'll have to update the enterprise gateway in GEL as well as the nginx configs in a few helm charts to include this new route.

Copy link
Contributor

@sandeepsukhani sandeepsukhani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some minor suggestions but otherwise the changes LGTM! I am wondering if we should have some tests in the gateway client in ring mode but we will have to mock most of the things so whether it would make a good test 🤔

pkg/storage/stores/shipper/indexgateway/gateway.go Outdated Show resolved Hide resolved
pkg/storage/stores/shipper/gateway_client.go Show resolved Hide resolved
JordanRushing and others added 4 commits April 8, 2022 11:18
- Add `replication_factor` flag to the IndexGateway config struct
- Modify Index Gateway ring config struct to be inline with a new
  struct, since it doesn't expose a replication factor config
- Modify dynamic config wrapper to reuse a common replication factor on
  the Index Gateway ring
- If we don't randomize access, we'll always access same Index Gateway
  instances in same order for the same tenant
- Remove unnecessary/wrong comments
- Only set replication factor at a single place
Copy link
Collaborator

@periklis periklis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking on this PR from a high level perspective, it is a great piece of work to make Loki tenant x horizontal scalable. So a big thanks for this!

Comment on lines +244 to +253
if err != nil {
level.Error(util_log.Logger).Log("msg", fmt.Sprintf("failed to get client for instance %s", addr), "err", err)
continue
}

client := (genericClient.(indexgatewaypb.IndexGatewayClient))
if err := s.clientDoQueries(ctx, gatewayQueries, queryKeyQueryMap, callback, client); err != nil {
level.Error(util_log.Logger).Log("msg", fmt.Sprintf("client do queries failed for instance %s", addr), "err", err)
continue
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion here (or in any other place more suitable), it would be nice to have some sort of reporting on failures based on tenant IDs. For example we report per-tenant rate limits or distributor append failures. The index gateway is another great piece where looking on errors I should be able to determine which tenant is impacted by which index gateway replica. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey that's a great idea. We already have plans to add metrics in a follow-up PR (and there are other things missing too), but I'll make sure to add this to our TODO list.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just giving you a heads-up: I tried but looks like we can't determine which tenant ID was related to a failure here. This should actually be reported by the index gateway and the table manager. I'm sure we already have error logs for when something unexpected happens there, I'll check if we can add any new metrics there too.

Copy link
Contributor

@sandeepsukhani sandeepsukhani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants