-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hard to read the pipeline #120
Comments
Reference for types: https://docs.censoredplanet.org/dns.html For inputs, we can define some types like |
I talked to @ohnorobo. We'll update the pipeline to remove the redundant duplication of the observations, which should significantly improve the pipeline performance and address some of the modeling concerns. She described the change in a TODO here: censoredplanet-analysis/pipeline/metadata/satellite.py Lines 471 to 492 in 13a163a
|
I noticed that we are ingesting files with different types with the same functions in the pipeline. Furthermore, the types are not only different, but sometimes only slightly different, causing even more confusion. Add to that the fact that all the different data types are simply called "row" and it becomes a huge effort to make sense of how data is flowing.
We have to clean that up so we can make sense of what's going on. Especially with Satellite, which has many types.
Split flows
First and foremost, this call has to change:
censoredplanet-analysis/pipeline/beam_tables.py
Line 610 in 777f21b
We should not create a single PCollection with different types. Instead, each file should be its own PCollection.
Then process_satellite_lines should be removed in favor of different flows that process and join the different datasets.
The partition logic can be gone.
As a rule of thumb, consider all logic selection based on filenames like this harmful:
censoredplanet-analysis/pipeline/metadata/flatten_satellite.py
Line 211 in 777f21b
Another similar practice that is also harmful is to detect the source file based on the presence of fields:
censoredplanet-analysis/pipeline/metadata/satellite.py
Line 124 in 777f21b
Those practices can be replaced by creating separate PCollections for each input types.
This cleanup can be incremental. For instance, it seems that extracting the blockpage logic is quite easy. Just call _process_satellite_blockpages on a PCollection with the blockpage files only.
Then you can extract each tag input file to their own flows.
I have the impression that this cleanup will speed up the pipeline, as you can have more lightweight workers for many parts of the flow, instead of having to load all the data for all the flows. It will also shuffle less data for joins (each flow can be sorted separately).
Define and clarify row types
Another significant improvement is to name each data type and create type aliases for the existing
Row
(e.g.SatelliteScan
orResolverTags
). We should not see theRow
type anywhere. Note that they can all be a generic dictionary type, but the type annotations will help understand what goes where. We should also rename the variables to reflect their type./cc @ohnorobo @avirkud
TODOs
The text was updated successfully, but these errors were encountered: