Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV exports (no UI) #3836

Merged
merged 2 commits into from
Mar 12, 2024
Merged

CSV exports (no UI) #3836

merged 2 commits into from
Mar 12, 2024

Conversation

ruslandoga
Copy link
Contributor

@ruslandoga ruslandoga commented Feb 27, 2024

Changes

This PR adds CSV exports logic without any web or UI changes.

Tests

  • Automated tests have been added

Changelog

  • Entry has been added to changelog

Documentation

  • Docs have been updated

Dark mode

  • This PR does not change the UI


import Ecto.Query

# TODO time_on_page
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the last metrics that needs work.

@ruslandoga ruslandoga changed the title CSV exports (logic, no UI) CSV exports (no UI) Feb 28, 2024
@ruslandoga ruslandoga force-pushed the csv-exports-logic branch 2 times, most recently from e421301 to 86dc8ed Compare March 11, 2024 08:59
@ruslandoga ruslandoga requested a review from zoldar March 11, 2024 09:00
@ruslandoga ruslandoga marked this pull request as ready for review March 11, 2024 09:00
@@ -0,0 +1,57 @@
defmodule Plausible.Workers.ExportCSV do
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to open a new PR which would add a UI action on the exports button that starts this job.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -11,7 +11,7 @@
"cachex": {:hex, :cachex, "3.6.0", "14a1bfbeee060dd9bec25a5b6f4e4691e3670ebda28c8ba2884b12fe30b36bf8", [:mix], [{:eternal, "~> 1.2", [hex: :eternal, repo: "hexpm", optional: false]}, {:jumper, "~> 1.0", [hex: :jumper, repo: "hexpm", optional: false]}, {:sleeplocks, "~> 1.1", [hex: :sleeplocks, repo: "hexpm", optional: false]}, {:unsafe, "~> 1.0", [hex: :unsafe, repo: "hexpm", optional: false]}], "hexpm", "ebf24e373883bc8e0c8d894a63bbe102ae13d918f790121f5cfe6e485cc8e2e2"},
"castore": {:hex, :castore, "1.0.5", "9eeebb394cc9a0f3ae56b813459f990abb0a3dedee1be6b27fdb50301930502f", [:mix], [], "hexpm", "8d7c597c3e4a64c395980882d4bca3cebb8d74197c590dc272cfd3b6a6310578"},
"certifi": {:hex, :certifi, "2.12.0", "2d1cca2ec95f59643862af91f001478c9863c2ac9cb6e2f89780bfd8de987329", [:rebar3], [], "hexpm", "ee68d85df22e554040cdb4be100f33873ac6051387baf6a8f6ce82272340ff1c"},
"ch": {:hex, :ch, "0.2.4", "d510fbb5542d009f7c5b00bb1ecab73307b6066d9fb9b220600257d462cba67f", [:mix], [{:db_connection, "~> 2.0", [hex: :db_connection, repo: "hexpm", optional: false]}, {:decimal, "~> 2.0", [hex: :decimal, repo: "hexpm", optional: false]}, {:ecto, "~> 3.5", [hex: :ecto, repo: "hexpm", optional: true]}, {:jason, "~> 1.0", [hex: :jason, repo: "hexpm", optional: false]}, {:mint, "~> 1.0", [hex: :mint, repo: "hexpm", optional: false]}], "hexpm", "8f065d15aaf912ae8da56c9ca5298fb2d1a09108d006de589bcf8c2b39a7e2bb"},
"ch": {:hex, :ch, "0.2.5", "b8d70689951bd14c8c8791dc72cdc957ba489ceae723e79cf1a91d95b6b855ae", [:mix], [{:db_connection, "~> 2.0", [hex: :db_connection, repo: "hexpm", optional: false]}, {:decimal, "~> 2.0", [hex: :decimal, repo: "hexpm", optional: false]}, {:ecto, "~> 3.5", [hex: :ecto, repo: "hexpm", optional: true]}, {:jason, "~> 1.0", [hex: :jason, repo: "hexpm", optional: false]}, {:mint, "~> 1.0", [hex: :mint, repo: "hexpm", optional: false]}], "hexpm", "97de104c8f513a23c6d673da37741f68ae743f6cdb654b96a728d382e2fba4de"},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This version improves streams in Ch.

@ruslandoga ruslandoga mentioned this pull request Mar 11, 2024
4 tasks
fragment("toUInt64(round(countIf(?='pageview')*any(_sample_factor)))", e.name),
:pageviews
),
# NOTE: are exits pageviews or any events?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#3870 might be relevant here. Currently, exit_page is set on all events, not only pageviews but that seems to be incorrect and leads to wrong calculation of other derivative metrics under some circumstances. We'll most likely switch to recording exit_pages only on events that are strictly pageviews.

bounces: bounces(s),
visits: visits(s),
visit_duration: visit_duration(s)
# NOTE: can we use just sessions_v2 table in this query? sum(pageviews) and visitors(s)?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While investigating issue that led to #3870, we have found cases where session.pageviews is inconsistent with actual related events - session had 0 for for pageviews but there was a pageview event associated with it, along with numerous other custom events. That might have been some one-off issue though as session updates don't do anything fancy here

pageviews:
if(event.name == "pageview", do: session.pageviews + 1, else: session.pageviews),
events: session.events + 1
. It might have been caused by cached session not being in sync with what's stored in the DB at the time of the update but can't tell for sure yet.

visitors_sessions_q =
from s in "sessions_v2",
# NOTE: no smapling is used right now
# hints: ["SAMPLE", unsafe_fragment(^@sample)],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is sampling skipped because of visit duration calculation or just for better accuracy of exports in general?

Copy link
Contributor Author

@ruslandoga ruslandoga Mar 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first it was to get better accuracy... I'll re-enable it for full builds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from e in "events_v2",
# NOTE: no smapling is used right now
# hints: ["SAMPLE", unsafe_fragment(^@sample)],
where: e.site_id == ^site_id,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering it that shouldn't be limited to pageview events 🤔 If there's a custom event setup that for instance records custom events like "scrolled to the bottom", they might skew the time on page metric calculated like this. That's probably something that should be considered in the actual calculation logic at

defp window_breakdown_time_on_page(site, query, pages) do
first. It's also possible that I'm reading it completely wrong 😅

)
|> Plausible.S3.export_upload_multipart(s3_path)
end,
timeout: :infinity
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually, we will probably want some upper time limit on this 😅

# 5 MiB is the smallest chunk size AWS S3 supports
|> chunk_into_parts(5 * 1024 * 1024)
|> ExAws.S3.upload(bucket, s3_path,
content_disposition: ~s|attachment; filename="Plausible.zip"|,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually, it probably would be good to make the downloaded filename a bit more unique - we could at least embed current timestamp in it (ideally, a sanitized domain name the export is for would be nice too). But that's something to flesh out later on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made some attempts at a better filename in #3921

Copy link
Contributor

@zoldar zoldar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any chance to add at least one general integration test for the import routine? There was a helper function for running import/export cycle which could potentially be turned into one, I recall? Other than this, IMHO we could go ahead with this and work on further refinements in follow-ups as it's going to be hidden either way.

@ruslandoga
Copy link
Contributor Author

I'm going to add some tests next.

@@ -401,6 +397,157 @@ defmodule Plausible.Imported.CSVImporterTest do
end
end

describe "export -> import" do
Copy link
Contributor Author

@ruslandoga ruslandoga Mar 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a basic e2e test here and I'll add more in subsequent PRs (e.g. #3875, #3845, export queries fixes)

@ruslandoga ruslandoga requested a review from zoldar March 12, 2024 07:33
@zoldar zoldar merged commit 5a3072c into plausible:master Mar 12, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants