-
-
Notifications
You must be signed in to change notification settings - Fork 298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Best practices for zarr and GCS streaming applications #595
Comments
Thanks for the great question @skgbanga. I will try to provide some answers, but first I just wanted to tag this as closely related to pangeo-forge/pangeo-forge-recipes#11. |
After reviewing your gcs code, I think I understand a the main issue which is making it slow.
You are doing a write operation for each row. But zarr does not currently support partial chunk reads / writes (see #521 and #584; this is work in progress). So for each of these
Your data is tiny, so throughput is not an issue, but latency is. Each of these i/o ops will occur ~200 ms of latency, (+/- depending on how far you are from your GCS region). 0.2 s * 2 * 100 = 40 s. So we are in the right ballpark. Instead, I would recommend batching your writes together to cover a full chunk. We are also working on adding some async features (e.g. #536), which should help with these latency issues. However, I don't think this would help you, since your writes need to be synchronous. Edit: if you batch your writes together, you should be able to reduce the time by a factor of |
@rabernat Thanks for the quick reply!
So my initial numbers were from a machine which accessed google infrastructure via internet. I have now moved to a cloud host which can access GCS via google's internal 10G(?) network. To show the difference:
So a ~3x difference in ping times. (ping is not a great metric, but gives some idea). On this host, my original numbers transform to:
Much better than original 54 secs. Now on to your suggestion about this line:
I tried doing what you suggested: @timer
def iterate(store):
z = zarr.create(store=store, shape=(chunk_size, xs), chunks=(chunk_size, None), dtype="float")
rows = []
num_chunks = 1
for i in range(n):
row = np.arange(xs, dtype="float")
rows.append(row)
if (i + 1) % chunk_size == 0:
start = (num_chunks - 1) * chunk_size
end = num_chunks * chunk_size
z[start:end, :] = np.array(rows)
rows = []
num_chunks += 1
z.resize(num_chunks * chunk_size, xs)
assert rows is not None # TODO handle this case
z.resize(n, xs) With the above code, the numbers are:
So that's fantastic. Note that in the minuscule sample data above, we have 100 rows arranged in 10 chunks of 10 rows each. If I just have a single chunk of 100 rows (note that at this point, it is not really streaming), the time drops to So the biggest guideline when it comes to streaming/zarr/gcs seems to be to write data in chunks. I will say that this is slightly non intuitive since typically streaming backends also support some sort of caching before doing the 'flush', but it seems that zarr unconditionally forwards the call to setitem which calls gcs python api to do the actual write. Thanks a lot for your help! I am going to test this with a real workload now, and reopen this issue if I hit more roadblocks. Also will be great if all the knowledge on this subject can be consolidated and put into a section in the documentation. Cheers. PS: I am also going to check the other issues you linked and see if I can leverage those. |
I'm curious if you can verify whether the data written to GCS matches your original source data exactly. pangeo-forge/pangeo-forge-recipes#11 describes a bug where a zarr appending workflow with GCS does not work as expected, but we haven't been able to track down the cause of that problem. |
@rabernat I can verify that the data is exactly equal between the two arrays. But note that I am building a chunk locally, and then writing exactly that chunk to gcs. (pangeo-forge/pangeo-forge-recipes#11 example is doing a general append so is probably exposing more edge cases?) While we are at the subject of equality, I find the following behavior quite unintuitive:
Equality operator for def __eq__(self, other):
return (
isinstance(other, Array) and
self.store == other.store and
self.read_only == other.read_only and
self.path == other.path and
not self._is_view
# N.B., no need to compare other properties, should be covered by
# store comparison
) I am coming from a C++ background, and this seems like comparing two vectors (https://en.cppreference.com/w/cpp/container/vector) based on allocators and not the actual data. I think two array's should compare equal iff they have the same data irrespective of where that data is coming from. |
Hello,
We are exploring zarr as a potential file format for our application. Our application is a streaming application which generates rows of data, which are continuously being appended to a 2D matrix.
I couldn't find 'best guidelines' when it comes to streaming and zarr and gcs. (for that matter, any cloud storage). Please point me in the right direction if there already exists something like this.
To evaluate zarr, I wrote a small script (Kudos on good docs! I was able to write this small app in very little time).
Note that this is NOT optimized at all. The point of this issue/post is to figure out the best practices for such an application.
Results:
Above is 100 * 2 matrix, so 200 floats.
As you can see, this naive method of appending rows to zarr is clearly not the right way to do. (for reference, if I manually upload
example.zarr
to gcloud usinggsutil
it takes ~1.6secs). My guess is that everytime I doz[i, :] = row
, it does a gcs write and that is destroying the performance.So my major question is:
PS: A quickly look at
strace ./foo.py --gcs
showed a lot this:I know zarr supports parallel writes to archive. Are these
futex
calls because of those?The text was updated successfully, but these errors were encountered: