Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

State of "Protobuf" format schema for clickhouse #8334

Closed
macobo opened this issue Jan 28, 2022 · 5 comments
Closed

State of "Protobuf" format schema for clickhouse #8334

macobo opened this issue Jan 28, 2022 · 5 comments
Labels
enhancement New feature or request performance Has to do with performance. For PRs, runs the clickhouse query performance suite team/infra Everything related to deploying PostHog

Comments

@macobo
Copy link
Contributor

macobo commented Jan 28, 2022

Is your feature request related to a problem?

In #2254 @fuziontech introduced protobuf for the clickhouse_events topic. Exact motivation if unknown by me, but I assume it has to do with cost savings around kafka traffic.

We never rolled out the feature to other topics and since then we've:

  • Introduced plugin-server
  • Introduced new topics that facilitate events service <-> plugin server communication (including for communicating events data)
  • Introduced dead letter queue and other topics
  • Are considering significantly changing the events table structure, which in turn requires updating the existing protobuf schema and test deploying these changes. This in turn might create potential posthog upgrade barriers.
  • Are considering supporting external clickhouse cloud providers (like altinity) which might not have support for using custom format schemas

All of this makes the protobuf solution we have right now less than ideal. It also doesn't really serve the potential cost-saving purpose as events are json-encoded in at least one topic already.

The discussion point here is - what shall we do about it?

Do we want to invest more and get protobuf on other topics? Do we want to make it entirely optional/configurable? Do we want to drop it completely?

cc @fuziontech @hazzadous @guidoiaquinti @tiina303 @yakkomajuri @marcushyett-ph for team-platform context

cc @EDsCODE and @timgl who might have additional context

Thank you for your feature request – we love each and every one!

@macobo macobo added enhancement New feature or request team/infra Everything related to deploying PostHog discussion performance Has to do with performance. For PRs, runs the clickhouse query performance suite labels Jan 28, 2022
@hazzadous
Copy link
Contributor

Personally I'd question the need for protobuf, and suspect it was a premature optimisation. The developer experience with protobuf, especially in python (and I suspect typescript) isn't amazing, and introduces annoying dependencies throughout the services e.g. clickhouse. JSON, with a sensible jsonschema/openapi schema I think should be enough for our purposes, but I may be missing some context.

@yakkomajuri
Copy link
Contributor

I'm also in favor of JSON, but yet another person without context into why we did protobuf in the first place. Only indication I see of it before is this #1742

@mariusandra
Copy link
Collaborator

mariusandra commented Jan 28, 2022

I tend to agree with premature optimization. @fuziontech can probably add more backstory here.

We've have two issues with protobuf:

  • this adds one extra compilation step for the plugin server, making yarn just a bit slower
  • it's the reason we can't just use clickhouse's ARM builds directly -> they come without protobuf built in

I think protocol buffers bring two advantages: speed and a strict schema validation. However

  • speed: with all the processing and JSON (de)serialization that's happening in the plugin server anyway, this is negligeable 🤷
  • validation: we get by just well with all other kafka tables, so do we really need it here 🤔

I'm in favor of getting rid of it if possible.

@macobo
Copy link
Contributor Author

macobo commented Mar 3, 2022

This has been causing a lot of pain trying to integrate with external systems (e.g. Altinity Cloud) even with the workaround env vars that were introduced.

Given @fuziontech hasn't pitched in, making the executive decision here to eventually remove protobuf due to the ongoing cost of maintaining it. Will formulate a migration plan once we're ready to put some focus into this.

@fuziontech
Copy link
Member

Nuke it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance Has to do with performance. For PRs, runs the clickhouse query performance suite team/infra Everything related to deploying PostHog
Projects
None yet
Development

No branches or pull requests

6 participants