-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reading Fortran data using python interface #1665
Comments
The way Python APIs are designed is to follow |
It'd certainly be nice to figure out the dimensions of a dataset without looking at the source code that has written it, but unfortunately, that's exactly what's currently not possible. If I give you the bp file that the sample code created, how can you tell whether it's a 10x5 or a 5x10 matrix? You can use bpls, and it'll tell you 10x5, but that's only true if you know that the code that wrote it is was C or Python. If it was written by Fortran or Matlab, it was actually a 5x10 matrix, even though bpls says differently. And that's easily fixable. At the minimum, bpls could say
though I think it'd be better if bpls showed it correctly (ie, what it was when written, not transposed). Python could (and should, IMO) just load it as a 5x10 dataset, because that's what was written, no performance penalty. If someone runs a C program to create a 512x128 image and gives it to you, you'll get a 512x128 image in python, as you'd expect. If someone else creates their images in Fortran and gives you their data file, python/adios2 will present it as a 128x512 image. Sure, from your view that's correct, because you define correct as "what bpls shows" rather than what the original dataset was, but that's probably not what most people expect. People in general don't really care about thinking about row-major vs col-major, transposing dims and indices or whatever. What they care about is that when they write a MxN dataset and read it back they get a MxN dataset, no matter how it's stored on either end or in the file. You can google "hdf5 transpose" or "hdf5 reverse dims" and you'll easily find 100s of people being confused by their design. I think one should learn from that. |
adios2 preserves fast and slow index information, on top of that the C++ read API tells you if the dimensions were reversed. I'd argue that most people would expect handling numpy array as row-major that's why it's their default, the order options are very confusing, order : {‘K’, ‘A’, ‘C’, ‘F’} I have to admit. BTW, I am not denying it's confusing (it's since underlying languages are different). I am trying examples in #1661 to understand better the interactions. |
So this particular thread's suggestion is that we should not reverse dimensions automatically because, "What they care about is that when they write a MxN dataset and read it back they get a MxN dataset, no matter how it's stored on either end or in the file." In adios design, in general since the beginning, we consciously did not want to allow for any feature that slows down adios all the time for the sake of some user who wants that feature. For this case, we have a fast (and maybe ugly) way of reading data. And the rules have been fairly simple: you write in Fortran/Matlab and read in C/python or vice versa, you see your dimensions reversed. The suggestion that we should represent the dimensions in all languages as it was written goes against our design philosophy. Scalable performance comes before comfort requirements. So this is for this thread. However, the request and discussion is there to allow for defining the layout and querying the layout, which will need some changes and that discussion is going on in the other thread. That will allow more control for those users who need that control. |
So what I'm proposing is actually like 'K', that is, preserve the original order (on disk / in the BP3 stream). If the data was created in Fortran, it was written in col-major and would be read and presented as col major. And vice versa. |
@pnorbert: What I'm proposing is to not reverse the dimensions if the language supports both data layouts. I'm not proposing to transpose data ever, because it's too expensive (and because it's a change in behavior). So if you write row-major data from C and read it in Fortran, which only supports col-major, the only option is to reverse dims, and one just has to live with that. If you write col-major data from Fortran and read it in Python, though, you can make a ndarray that points to the original (untransposed) data and tell python that it's in column major order, so you get the data back with the original index order. That alleviates a lot of confusion when using the python API as a user. Python's default layout is 'K', which is whatever layout the data is already in. In fact, python doesn't even have a function to query the underlying data layout from an ndarray. You can use For C++, nothing would change by default. But querying the underlying layout, a program like bpls could go and say "I'm presenting the data layout as if it were row-major" because I've been written in C++, but I'll let you know that it's actually transposed from what it originally was. Or it could actually still read it back as row-major, but transpose it when displaying it. Because people who use bpls probably don't care about what language bpls is written in, they care about the data that's in the bp file. |
@pnorbert thanks for explanation. Yeah, I am trying to understand in #1661 the different options for @germasch proposed APIs. I am taking this as an opportunity to see if we have an opportunity (pun intended). First starting to understand how well-established libraries handle this, basically boost, numpy order, Eigen. |
I'd add xtensor and kokkos to the list. I'm not familiar with the boost types, but I believe all the other ones support both row-major and col-major. |
Okay, I see. I am a backward C programmer who assumed python and hence numpy was a row-major language. Now I accept hardcore Numpy users may want to use it more flexibly. So how do we make this simpler with the API just for Python, simpler than for C++ in the other thread? You need to inquire the adios variable first to know the layout in the input, then allocate a numpy array (with your chosen layout) and then read the data with adios. In the high-level API, in a serial python script, an extra flag in f.read() could do the trick:
Maybe we could get the user intention in a layout flag in adios2.open()? |
Yes,
is exactly what I was going to propose except that I'd call it I think there's an open issue about what the default should be, because I think the natural thing would be |
@germasch agree, I'd stick with numpy's convention, so users (and I mean users, not application developers) don't have to add to their learning curve. |
@pnorbert adding a stream-level setting is a good suggestion, the high-level Python API allows for a call to set_parameter (
|
@williamfgc What about the low-level API? I tried to address both with a flag in open(). |
@pnorbert that's a good question and I don't have a good answer, there are a few options:
|
When reading data written by a Fortran code using python, indices end up being reversed:
Sample Fortran code (it's based on the included heatmap test)
Output:
[kai@macbook build-fortran (pr/fix-mgard *)]$ bin/test_fortran sample temperature(2,8) = 0.13062623044039723
Reading the bp file from Python:
gives
To access the Fortran data element
t(2, 8)
from Python, one has to reverse the index order:This is related to #1661 , but actually fixable without API additions.
The text was updated successfully, but these errors were encountered: