Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading Fortran data using python interface #1665

Open
germasch opened this issue Aug 9, 2019 · 14 comments
Open

Reading Fortran data using python interface #1665

germasch opened this issue Aug 9, 2019 · 14 comments

Comments

@germasch
Copy link
Contributor

germasch commented Aug 9, 2019

When reading data written by a Fortran code using python, indices end up being reversed:

Sample Fortran code (it's based on the included heatmap test)

program TestBPWriteReadHeatMap2D
  use mpi
  use adios2

  implicit none

  type(adios2_adios) :: adios
  type(adios2_io) :: ioPut
  type(adios2_engine) :: bpWriter
  type(adios2_variable) :: var

  real(kind=8) :: temperature(0:4, 0:9)

  integer(kind=8), parameter :: ishape(2) = (/5, 10/)
  integer(kind=8), parameter :: istart(2) = (/0, 0/)
  integer :: ierr

  call mpi_init(ierr)
  call adios2_init(adios, MPI_COMM_WORLD, adios2_debug_mode_on, ierr)
  
  call random_number(temperature)
  print*, 'sample temperature(2,8) =', temperature(2, 8)

  call adios2_declare_io(ioPut, adios, 'HeatMapWrite', ierr)
  call adios2_define_variable(var, ioPut, 'temperature', adios2_type_dp, &
                              2, ishape, istart, ishape, &
                              adios2_constant_dims, ierr)
  call adios2_open(bpWriter, ioPut, 'HeatMap2D_f.bp', adios2_mode_write, &
                   ierr)
  call adios2_put(bpWriter, var, temperature, ierr)
  call adios2_close(bpWriter, ierr)

  call adios2_finalize(adios, ierr)
  call mpi_finalize(ierr)

end program TestBPWriteReadHeatMap2D

Output:

[kai@macbook build-fortran (pr/fix-mgard *)]$ bin/test_fortran
 sample temperature(2,8) =  0.13062623044039723

Reading the bp file from Python:

import adios2
fh = adios2.open("/Users/kai/build/ADIOS2/build-fortran/HeatMap2D_f.bp", "r")
t = fh.read('temperature')
print(t[2, 8])

gives

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-10-cf0fcf0017a8> in <module>
      2 fh = adios2.open("/Users/kai/build/ADIOS2/build-fortran/HeatMap2D_f.bp", "r")
      3 t = fh.read('temperature')
----> 4 print(t[2, 8])

IndexError: index 8 is out of bounds for axis 1 with size 5

To access the Fortran data element t(2, 8) from Python, one has to reverse the index order:

>>> print(t[8, 2])
0.13062623044039723

This is related to #1661 , but actually fixable without API additions.

@williamfgc
Copy link
Contributor

The way Python APIs are designed is to follow bpls. Rarely data users will look into the source code that generate the files (in fact, it could be some commit somewhere). bpls is assumed to be the starting point or a Python inspection function. Our ecosystem tools are natural to those languages (C++, Python). Changing this would imply changing bpls as well.

@germasch
Copy link
Contributor Author

germasch commented Aug 9, 2019

It'd certainly be nice to figure out the dimensions of a dataset without looking at the source code that has written it, but unfortunately, that's exactly what's currently not possible. If I give you the bp file that the sample code created, how can you tell whether it's a 10x5 or a 5x10 matrix? You can use bpls, and it'll tell you 10x5, but that's only true if you know that the code that wrote it is was C or Python. If it was written by Fortran or Matlab, it was actually a 5x10 matrix, even though bpls says differently. And that's easily fixable. At the minimum, bpls could say

double   temperature  {10, 5} /* transposed from original {5, 10} */

though I think it'd be better if bpls showed it correctly (ie, what it was when written, not transposed).

Python could (and should, IMO) just load it as a 5x10 dataset, because that's what was written, no performance penalty. If someone runs a C program to create a 512x128 image and gives it to you, you'll get a 512x128 image in python, as you'd expect. If someone else creates their images in Fortran and gives you their data file, python/adios2 will present it as a 128x512 image. Sure, from your view that's correct, because you define correct as "what bpls shows" rather than what the original dataset was, but that's probably not what most people expect. People in general don't really care about thinking about row-major vs col-major, transposing dims and indices or whatever. What they care about is that when they write a MxN dataset and read it back they get a MxN dataset, no matter how it's stored on either end or in the file.

You can google "hdf5 transpose" or "hdf5 reverse dims" and you'll easily find 100s of people being confused by their design. I think one should learn from that.

@williamfgc
Copy link
Contributor

williamfgc commented Aug 9, 2019

Sure, from your view that's correct, because you define correct as "what bpls shows" rather than what the original dataset was, but that's probably not what most people expect.

adios2 preserves fast and slow index information, on top of that the C++ read API tells you if the dimensions were reversed. I'd argue that most people would expect handling numpy array as row-major that's why it's their default, the order options are very confusing, order : {‘K’, ‘A’, ‘C’, ‘F’} I have to admit.

BTW, I am not denying it's confusing (it's since underlying languages are different). I am trying examples in #1661 to understand better the interactions.

@pnorbert
Copy link
Contributor

pnorbert commented Aug 9, 2019

So this particular thread's suggestion is that we should not reverse dimensions automatically because, "What they care about is that when they write a MxN dataset and read it back they get a MxN dataset, no matter how it's stored on either end or in the file."

In adios design, in general since the beginning, we consciously did not want to allow for any feature that slows down adios all the time for the sake of some user who wants that feature. For this case, we have a fast (and maybe ugly) way of reading data. And the rules have been fairly simple: you write in Fortran/Matlab and read in C/python or vice versa, you see your dimensions reversed.

The suggestion that we should represent the dimensions in all languages as it was written goes against our design philosophy. Scalable performance comes before comfort requirements.

So this is for this thread. However, the request and discussion is there to allow for defining the layout and querying the layout, which will need some changes and that discussion is going on in the other thread. That will allow more control for those users who need that control.

@germasch
Copy link
Contributor Author

germasch commented Aug 9, 2019

So what I'm proposing is actually like 'K', that is, preserve the original order (on disk / in the BP3 stream). If the data was created in Fortran, it was written in col-major and would be read and presented as col major. And vice versa.

@germasch
Copy link
Contributor Author

germasch commented Aug 9, 2019

@pnorbert: What I'm proposing is to not reverse the dimensions if the language supports both data layouts. I'm not proposing to transpose data ever, because it's too expensive (and because it's a change in behavior).

So if you write row-major data from C and read it in Fortran, which only supports col-major, the only option is to reverse dims, and one just has to live with that.

If you write col-major data from Fortran and read it in Python, though, you can make a ndarray that points to the original (untransposed) data and tell python that it's in column major order, so you get the data back with the original index order. That alleviates a lot of confusion when using the python API as a user. Python's default layout is 'K', which is whatever layout the data is already in. In fact, python doesn't even have a function to query the underlying data layout from an ndarray. You can use .strides to draw your own conclusions, though it could be row-major, col-major or many other things.

For C++, nothing would change by default. But querying the underlying layout, a program like bpls could go and say "I'm presenting the data layout as if it were row-major" because I've been written in C++, but I'll let you know that it's actually transposed from what it originally was. Or it could actually still read it back as row-major, but transpose it when displaying it. Because people who use bpls probably don't care about what language bpls is written in, they care about the data that's in the bp file.

@williamfgc
Copy link
Contributor

williamfgc commented Aug 9, 2019

@pnorbert thanks for explanation. Yeah, I am trying to understand in #1661 the different options for @germasch proposed APIs. I am taking this as an opportunity to see if we have an opportunity (pun intended). First starting to understand how well-established libraries handle this, basically boost, numpy order, Eigen.

@germasch
Copy link
Contributor Author

germasch commented Aug 9, 2019

I'd add xtensor and kokkos to the list. I'm not familiar with the boost types, but I believe all the other ones support both row-major and col-major.

@pnorbert
Copy link
Contributor

pnorbert commented Aug 9, 2019

Okay, I see. I am a backward C programmer who assumed python and hence numpy was a row-major language. Now I accept hardcore Numpy users may want to use it more flexibly.

So how do we make this simpler with the API just for Python, simpler than for C++ in the other thread? You need to inquire the adios variable first to know the layout in the input, then allocate a numpy array (with your chosen layout) and then read the data with adios.

In the high-level API, in a serial python script, an extra flag in f.read() could do the trick:

fh = adios2.open("/Users/kai/build/ADIOS2/build-fortran/HeatMap2D_f.bp", "r")
t = fh.read('temperature', layout='K')

Maybe we could get the user intention in a layout flag in adios2.open()?

@germasch
Copy link
Contributor Author

germasch commented Aug 9, 2019

Yes,

t = fh.read('temperature', layout='K') 

is exactly what I was going to propose except that I'd call it order='K' for consistency with numpy.

I think there's an open issue about what the default should be, because I think the natural thing would be K as that's what people who don't think about layouts would naturally expect. But it's a change from current behavior, which I suppose would be specified as order='C'. And changing behavior is not a good thing...

@williamfgc
Copy link
Contributor

williamfgc commented Aug 9, 2019

@germasch agree, I'd stick with numpy's convention, so users (and I mean users, not application developers) don't have to add to their learning curve.

@williamfgc
Copy link
Contributor

williamfgc commented Aug 9, 2019

Maybe we could get the user intention in a layout flag in adios2.open()?

@pnorbert adding a stream-level setting is a good suggestion, the high-level Python API allows for a call to set_parameter (open is deferred until the first read).

fh = adios2.open("/Users/kai/build/ADIOS2/build-fortran/HeatMap2D_f.bp", "r")
f.set_parameter('order', 'K')
t = fh.read('temperature')

@pnorbert
Copy link
Contributor

pnorbert commented Aug 9, 2019

@williamfgc What about the low-level API? I tried to address both with a flag in open().

@williamfgc
Copy link
Contributor

@pnorbert that's a good question and I don't have a good answer, there are a few options:

  1. Use adios2::Operator of type "order" and use either IO.AddOperation or Variable.AddOperation for the granularity. This option won't change current APIs.
  2. Use IO::SetParameter and Variable::SetOrder (this would be new).
  3. Add IO::SetOrder (this would be new) and Variable::SetOrder (this would be new).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants