Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blog: How we improved feature flag resiliency #5546

Merged
merged 20 commits into from
Sep 8, 2023
Merged

Conversation

neilkakkar
Copy link
Contributor

Changes

Please describe.

Add screenshots or screen recordings for visual / UI-focused changes.

Checklist

  • Titles are in sentence case
  • Feature names are in sentence case too
  • Words are spelled using American English
  • I have checked out our style guide
  • If I moved a page, I added a redirect in vercel.json

@vercel
Copy link

vercel bot commented Mar 16, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Updated (UTC)
posthog ✅ Ready (Inspect) Visit Preview Sep 8, 2023 9:33am

@neilkakkar neilkakkar marked this pull request as draft March 16, 2023 17:40
@neilkakkar
Copy link
Contributor Author

Okay, rough structure is there.

I could use some feedback now @andyvan-ph @joethreepwood @liyiy @EDsCODE on content, structure, and if this all makes sense.

And @ellie @hazzadous on accuracy of the small infra things I've mentioned here.

@joethreepwood joethreepwood changed the title First draft, needs a lot more work, but getting it done in chunks Blog: How we improved feature flag resiliency Mar 21, 2023
@neilkakkar neilkakkar requested a review from ivanagas March 22, 2023 14:24
Copy link
Contributor

@ivanagas ivanagas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bunch of edits.

I think it is important to tighten up the intro and get to the "meat" of the article faster (as it is really interesting and I want to make sure as many people as possible get to that point)

contents/blog/how-we-improved-feature-flags-resiliency.md Outdated Show resolved Hide resolved
contents/blog/how-we-improved-feature-flags-resiliency.md Outdated Show resolved Hide resolved
contents/blog/how-we-improved-feature-flags-resiliency.md Outdated Show resolved Hide resolved
contents/blog/how-we-improved-feature-flags-resiliency.md Outdated Show resolved Hide resolved
contents/blog/how-we-improved-feature-flags-resiliency.md Outdated Show resolved Hide resolved
contents/blog/how-we-improved-feature-flags-resiliency.md Outdated Show resolved Hide resolved
contents/blog/how-we-improved-feature-flags-resiliency.md Outdated Show resolved Hide resolved
contents/blog/how-we-improved-feature-flags-resiliency.md Outdated Show resolved Hide resolved
contents/blog/how-we-improved-feature-flags-resiliency.md Outdated Show resolved Hide resolved
contents/blog/how-we-improved-feature-flags-resiliency.md Outdated Show resolved Hide resolved
contents/blog/how-we-improved-feature-flags-resiliency.md Outdated Show resolved Hide resolved

So, when thinking about reliability, we want to prioritise defending against things that happen frequently, or have a high chance of occurring over time. This includes things like redis, postgres, or pgbouncer going down. Then, if we have the resources and nothing better to prioritise, we can focus on defending against asteroids.

Today, we can't yet defend against asteroids, nor the entire infrastructure going down, but for other things, like postgres, we've found ways to defend against this, leveraging our special problem constraints.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the light touch that a small joke about asteroids brings to the piece. I think it'd be even better if it was only contained in the last 1-2 sentences or only referenced 1-2x at most instead of 3-4x? 🤗

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOL agree I think I liked it so much I forgot I already included it 😂

Have removed and combined the above sentence into one.

contents/blog/how-we-improved-feature-flags-resiliency.md Outdated Show resolved Hide resolved
@liyiy
Copy link
Contributor

liyiy commented Mar 22, 2023

Overall awesome work! Really helps to paint a picture of all the work the team has done in the past couple of months to improve feature flags 🎉

@liyiy
Copy link
Contributor

liyiy commented Mar 22, 2023

I'm not sure what images we would put here, maybe some screenshots of latency improvements from our grafana as an example? Just to split up the text a bit and make it more (visual) reader-friendly

@neilkakkar neilkakkar marked this pull request as ready for review May 4, 2023 12:17
@neilkakkar
Copy link
Contributor Author

I think we could lift this with one or two diagrams / flow charts that visualize what you're writing about. Any thoughts on what these could be? (attached an example from a recent newsletter illustrating what I mean)

Hmmm, good question, thinking about this, but nothing great comes to mind. Maybe a flow chart of how things are setup and where the borkages happen?

Like, a sample request flow?

Request comes in ---> Django server for feature flags ---> fetch feature flag definitions from redis

  --- first option---> try evaluating without database --> return result if evaluated
  ---second option--> Pgbouncer (connection pooler) -----> Database to fetch person properties

And the arrows going to redis and pgbouncer are sources of problems & latency.

@ivanagas
Copy link
Contributor

Screen Shot 2023-08-31 at 8 48 13 AM

How's this? Could potentially do ones for Partial flag evaluation and database down if good

@neilkakkar
Copy link
Contributor Author

Ooh yes this is great!

@neilkakkar
Copy link
Contributor Author

I'd just tilt them to make the flow clearer.

PostHog server on the left, and top

SDK in the middle, both height & width wise,

and client at the right, and bottom.

Should read more naturally, with the arrows going something like:

image

@ivanagas
Copy link
Contributor

Screen Shot 2023-08-31 at 9 32 06 AM

Partial flag evaluation

@ivanagas
Copy link
Contributor

Screen Shot 2023-08-31 at 9 38 55 AM

@ivanagas
Copy link
Contributor

Screen Shot 2023-08-31 at 9 41 19 AM

Another option

@neilkakkar
Copy link
Contributor Author

yes 2nd option is very nice

@andyvan-ph
Copy link
Contributor

@neilkakkar + @ivanagas: Thanks both for the graphic stuff. Have added both and done another light polish pass on the copy.

I was doing a mental Hacker News pre-mortem and I think the only thing this is missing is... evidence. We say we've made flags faster / more reliable, but there's nothing to back that up.

@neilkakkar is there something we can near to the end here? I don't think it need to be super in-depth, it just needs something to prove it out.

@neilkakkar
Copy link
Contributor Author

hmm, we don't have our latency logs anymore, because this was > 3 weeks old. So we can't show before & after, but I guess we can show current latency times

and then the status page: https://status.posthog.com/uptime/1t4b8gf5psbc?page=3 and how incident rate has gone down.

Will add a blurb to the end when I'm back tomorrow, thanks!

@neilkakkar
Copy link
Contributor Author

just added this section as an appendix

@andyvan-ph andyvan-ph enabled auto-merge (squash) September 8, 2023 09:17
@andyvan-ph andyvan-ph merged commit 75930df into master Sep 8, 2023
@andyvan-ph andyvan-ph deleted the neilkakkar-patch-2 branch September 8, 2023 09:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants