Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write input data to Parquet file #23

Open
aufdenkampe opened this issue Feb 10, 2021 · 1 comment
Open

Write input data to Parquet file #23

aufdenkampe opened this issue Feb 10, 2021 · 1 comment
Assignees

Comments

@aufdenkampe
Copy link
Member

We've decided to write all input data to a Parquet file, which is a high-performance binary data storage format designed for big-data and cloud-computing.

Parquet is tightly integrated with Pandas, and is designed to manage complex hierarchical, nested data structures, similar to HDF5.

Our intent is to support both HDF5 and Parquet for storage of input and output data.

@ptomasula
Copy link
Member

ptomasula commented Feb 10, 2021

@aufdenkampe @steveskrip @htaolimno Switching the code to support Parquet files may prove to be a larger under taking than initially expected. It seems the main method takes HDF files as an argument then subsequently opens that HDF as an HDFStore which is passed from the main method to the various sub-processes. Making the switch will require that all of those various methods are updated to use a different file format as well.

I'm also not super keen on having a single file format be the only one supported by the business logic. I think there's a strong argument to be made for having a pandas DataFrame be level at which the main code interfaces with the input data and we write the appropriate utilities to read other files and store them as a uniformly formatted pandas DataFrame. I'd like to think that through some more. One immediate issue that comes to mind is holding all of the input data as a single DataFrame in memory could be a problem. Maybe the solution is to write a method that can pull out just the TS necessary for the specific operation (similar to this line) but not specific to a file format. Open to other suggestions on how to best handle this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants