Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove state/year/quarter specific parquet paths from the S3 logs during ETL #190

Open
1 task
zaneselvans opened this issue Sep 30, 2024 · 0 comments
Open
1 task

Comments

@zaneselvans
Copy link
Member

Overview

  • Once upon a time we accidentally deployed all of the pre-consolidation EPA CEMS parquet files to our S3 bucket, and there are more than 1000 of them.
  • The paths to these objects now show up as possible parquet tables in the usage metrics, which clogs up the dashboard display/legend.
  • These paths should never have appeared and I think in most cases were never downloaded, so we can remove them from the logging data during the ETL and have a cleaner, simpler output to work with.
  • Currently these paths are being filtered out of the dashboard display because they have 0 downloads, but if we go back to showing all valid paths, they'll reappear.
  • They do appear in the dropdown on the lefthand side of the User Metrics dashboard (which is why there are 1600+ tables)
  • It's possible that there was an intentional deployment of a state-year partitioned version of the data at some point way back when before we settled on the current output format, but there shouldn't be any under nightly or stable or any of the existing versioned releases.
  • We should also keep an eye out for other accidentally deployed files. I think it was just the partitioned parquet files that were a problem, but it's possible there were others.

Success Criteria

No more zombie EPA CEMS parquet paths show up where they shouldn't.

  • post-ETL data
  • the dropdowns in the sidebar
  • the data visualizations

Next steps

Preview Give feedback
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant