Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate path connections #6041

Closed
EDsCODE opened this issue Sep 20, 2021 · 12 comments
Closed

Validate path connections #6041

EDsCODE opened this issue Sep 20, 2021 · 12 comments
Assignees
Labels
bug Something isn't working right feature/paths Feature Tag: Paths

Comments

@EDsCODE
Copy link
Member

EDsCODE commented Sep 20, 2021

Bug description

Please describe.
When calculating paths, we aggregate many node and link data, often much more than would be reasonable to display. We perform a simple limit in our queries to cap the number of links we're accounting for. As a result, we sometimes might be cutting off a set of links that are dependent on one another. For example, there might be some links that go from $pageview -> insight viewed -> viewed dashboard however, our limit cuts off the data for $pageview -> insight viewed.

If this affects the front-end, screenshots would be of great help.

Expected behavior

The limited data we return should be complete so that paths aren't stranded.

How to reproduce

Internal graph link here

Notice how a start point is defined but there are start points on the visualization that are unrelated. The sankey is rendering stranded links that start at the 2nd or 3rd step but don't have a 1st step

Thank you for your bug report – we love squashing them!

@EDsCODE EDsCODE added bug Something isn't working right team-core-analytics labels Sep 20, 2021
@EDsCODE EDsCODE added the feature/paths Feature Tag: Paths label Sep 20, 2021
@neilkakkar
Copy link
Contributor

neilkakkar commented Sep 21, 2021

I was exploring if we can find smarter defaults, instead of trying to validate graphs.

https://metabase.posthog.net/question/143

So, I analyzed data for 1 month, and over all paths for PostHog, we have ~28,000 edges. It's very much a power law distribution. Average edge weight is ~7, but 95%ile is 11, 99.5%ile is 160, 99.9%ile is 1000.

So, we can't return all the data when there's no start or end point.

Out of these, there are ~7,500 starting points. Edit: ~4000 unique starting points

Next up, I'll look into the start points here^ to figure out the same data for them, and if it reduces the sample space enough, think we could have a high enough default for start and end points.

@neilkakkar
Copy link
Contributor

neilkakkar commented Sep 21, 2021

Hmm, things aren't looking too great. Chose the largest ones, and then a few random ones:

Event Edges Avg weight per edge Quantiles: 10%, 20%, 30%, 50%, 80%, 95%, 99.5%, 99.9%
$autocapture 19447 5.1789993315164295 [1,1,1,1,2,9,132,685.349000000012]
$identify 17083 6.739038810513376 [1,1,1,1,3,13,149.3150000000005,670.439000000014]
https://posthog.com 1427 28.142256482130342 [1,1,1,1,5,48.700000000000045,1207.5799999999745,3464.036000000001]
user logged in 2179 4.343276732446077 [1,1,1,1,2,4,127.55000000000064,597.5500000000029]
$capture_failed_request 4723 3.8816430235020114 [1,1,1,1,2,7,106.78000000000065,360.6680000000015]
https://posthog.com/careers 374 11.366310160427808 [1,1,1,1,4,31.049999999999898,602.9449999999999,866.697000000004]
timezone component viewed 3097 2.2156926057474977 [1,1,1,1,1,4,32.51999999999998,78.80799999999999]
Palette shown 654 2.02 [1.0,1.0,1.0,1.0,1.0,4.0,19.20500000000004,106.03399999999556]
redacted 41 5.853658536585366 [1,1,2,3,6,15,51.79999999999991,56.760000000000026]
redacted 3 1 [1,1,1,1,1,1,1,1]
redacted 70 2.414285714285714 [1,1,1,1,2.200000000000003,8.549999999999997,25.44500000000002,30.68899999999995]
redacted 13 2.4615384615384617 [1,1,1,2,4,5.199999999999996,6.8199999999999985,6.963999999999999]
https://app.posthog.com' 264 31.348484848484848 [1,1,1,2,14.400000000000006,147.09999999999997,764.4100000000005,1261.515000000014]

This one fans out a LOT: link - there's supposed to be 374 edges starting at /careers/ but yeah, 50% of those have literally just one event.

Same for this one.

These are probably the worst of worst cases^^^. But assuming 10x bigger company, they start looking more reasonable.

@neilkakkar
Copy link
Contributor

Another idea: We can leverage the shape of the distribution. Instead of having a limit on number of edges, how about we have a limit on [absolute|relative] edge weights?

So something like, if the number of people who did edge A->B is less than 1% / 25 people, discard the edge from final result. It doesn't directly solve the above problem, but gives us better initial data to work with: If we have N edges with say, the same weight, and half of those fall in the limit, it would be a shame to discard the other half, as they're as useful (or useless) as the first half.

... And then we can consider completing this graph^^.

@neilkakkar
Copy link
Contributor

neilkakkar commented Sep 23, 2021

Things turned a bit tricky. I'm exploring three different ways to solve this problem

Complete Dangling Edges

Ensures that whatever edges make it to the cut-off, the remaining edges are added.

Pro: Paths don't look wrong.

Cons: Some very low weight edges show up, which can be hard to visualise / fill graph with useless information.

Implementation notes: This makes things hard. Here's a failed attempt: https://metabase.posthog.net/question/150 that makes the wrong tradeoffs. We don't want to bound above the maximum edge weights. I suspect any solution that goes this way will significantly slow down our queries, since this requires some sort of graph traversal.

Still trying to figure out if there's a better way around to solve this.

Delete Dangling Edges

Ensures we validate edges before returning, so no dangling edges remain.

Pro: Paths don't look wrong.

Cons: Some high weight edges are removed, which might be carrying useful information.

Implementation notes: relatively straightforward to do outside of SQL. And up to ~1000 edges, has negligible effect on performance.

Defer control to users

The crux between Solution (1) and (2) is the amount of information we show. Depending on the case, it can be useful to see the extra information, going further in depth, and in other cases, better to get rid of all low weight nodes.

More importantly, it's hard to get the visualisation just right, a priori.

This solution gives users these advanced manipulation options. There's two controls:

(a) Control maximum number of edges. And
(b) Control minimum (& max) edge weight. "Make the graph display whatever you want it to display".

Note: The edge weight represents the number of people on that path.

Overall Solution

The overall solution I'm leaning towards right now is: based on the above calculations, provide more meaningful defaults. The 95%ile has edge weights ~10 for the more popular cases, which translates to ~100 edges. Make these the default. (that's 5x more edges than right now, and increase these further if steps go above 5)

This won't solve the graph looking weird in some cases, so delete dangling nodes (only when start or end point are defined), and tell the user that the graph is incomplete. And encourage using the advanced options I mentioned above for getting more indepth information.


I think I need to myself play with these advanced options to figure out if there's heuristics we can find for even better default values.

It does make things a bit more complicated for users, but hopefully most users are happy with the default.

I do think it's important to allow this customization so users can drill down and up the graph. It gives a new dimension: allows not just number of step manipulation, but things like, "oh, I notice this specific segment of slightly unpopular paths (~200 edge weight) seem weird. Let me set edge weights between 100 and 300 to explore these more in depth, then find the specific people doing this, and see if I can figure out why they're doing things like this" etc. etc.

Whether it's worth doing is an open question, I guess.

cc: @paolodamico @clarkus @marcushyett-ph @EDsCODE @liyiy for more input :)

@marcushyett-ph
Copy link
Contributor

I'll let @paolodamico and @clarkus chime in as they have the most context. But generally providing the best defaults we can sounds like a good approach to me.

I have a question related specifically to the terminology used, edge weight etc. Do we have a more user friendly term in mind for how to describe this? As it feels pretty technical and might be hard for users to adopt.

@neilkakkar
Copy link
Contributor

Definitely. It's the "count of users on a path". So, min edge weight is something like: "Minimum count of users on a Path"

@paolodamico
Copy link
Contributor

Hey @neilkakkar! In general, 100% agree with the overall approach of sensible defaults and advanced customization. Some questions,

  • Can you clarify what you mean by deleting dangling edges?
  • Can you clarify how edge weight would be understood from a user's perspective? Would I let you know how many min/max users should be on a path? Would the number of users be counted from the root step?
  • Would users need/can control both (a) & (b)?
  • I'm having a hard time understanding a users for wanting to control the maximum weight? Wouldn't you always want to definitely see paths with larger weights?
  • Unless is not technically complex, I think we could start just with sensible defaults and then get feedback from users to understand better what and how advanced controls should behave. I'm thinking we'll get to a significantly better state by getting a lot of real data points from users.

@neilkakkar
Copy link
Contributor

neilkakkar commented Sep 27, 2021

Dangling edges: Link

image
Notice how "/events 38" goes further ahead than the next row, and same for "/events 30". In the first place, if these were the first event that happened, shouldn't they have been a part of "/events 138" ?

Leaving them alone makes the graph look wrong.

This is same as the issue in the original link in the issue: The start point is gone, these are intermediate edges, and thus dangling. Deleting dangling edges means getting rid of these in the final visualisation.

Can you clarify how edge weight would be understood from a user's perspective? Would I let you know how many min/max users should be on a path? Would the number of users be counted from the root step?

edge weight is indeed count of users on a single path item (I'm not yet sure of the right terminology to use, judging by the confusion exposed on the PR). It doesn't mean number of people on the entire path, but between any two consecutive Path items.

Would users need/can control both (a) & (b)?

Good question. We could remove some controls, but removing any of these feels incomplete to me. Since: (a) No. of edges controls how dense the graph gets. (b) Min-Max controls what kind of edges show up.

I'm having a hard time understanding a users for wanting to control the maximum weight? Wouldn't you always want to definitely see paths with larger weights?

Say you're interested in where people drop off, and say it's a very successful product: most people convert.. (Or vice versa, case is identical). Most path items on the happy path then have a high weight - and these are the ones you don't want to see, since they are noise. Setting a max weight effectively removes all of them, and helps you visualise where the dropoffs really go.

Something similar can be achieved with excluding the popular events, but it's not the same, since you want to know if these "dropoffs" take some other route to the popular events. (Max weight would remove the popular paths, but not the small weight traversals to the popular items).

It made sense to me, but it's 100% an advanced use case - and not very obvious. But since these are advanced features anyway....

Unless is not technically complex, I think we could start just with sensible defaults and then get feedback from users to understand better what and how advanced controls should behave. I'm thinking we'll get to a significantly better state by getting a lot of real data points from users.

As in, don't show any advanced options at all? Or just show them populated with defaults?

The latter makes sense to me. The former not so much, because then the users wouldn't know how to control these advanced options at all?

100% agreed on getting real data points.

@clarkus
Copy link
Contributor

clarkus commented Sep 27, 2021

The weight concept is new to me, so I'm catching up a bit on this. It seems like this might be the core reason a user would want to adjust weight for a paths insight:

Say you're interested in where people drop off, and say it's a very successful product: most people convert.. (Or vice versa, case is identical). Most path items on the happy path then have a high weight - and these are the ones you don't want to see, since they are noise. Setting a max weight effectively removes all of them, and helps you visualise where the dropoffs really go.

Setting a maximum weight can optimize for analyzing dropoffs. Is the converse true for minimum weights? If so, that might be a good way to communicate the value of the feature to users. I think default make a ton of sense for this, but maybe there's some easy mode where the user just selects an "optimize for dropoffs" control or something similar?

@neilkakkar
Copy link
Contributor

If you want to play around with edge weights, they're behind the new-paths-ui-edge-weights Feature Flag. Very interesting to play around with

@neilkakkar
Copy link
Contributor

Validation is done, so I'll close this.

@posthog-contributions-bot
Copy link
Contributor

This issue has 1909 words. Issues this long are hard to read or contribute to, and tend to take very long to reach a conclusion. Instead, why not:

  1. Write some code and submit a pull request! Code wins arguments
  2. Have a sync meeting to reach a conclusion
  3. Create a Request for Comments and submit a PR with it to the meta repo or product internal repo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working right feature/paths Feature Tag: Paths
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants