-
-
Notifications
You must be signed in to change notification settings - Fork 376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions about the train_x_lpd_5_phr.npz file #100
Comments
Hi, Thanks for the feedback. I just checked the size of the training data. The dense array takes 7.91 GB. However, the decompressed NPZ file also needs to take 5.82 GB in RAM. As a result, the total required RAM would be roughly 14 GB. A temporary workaround would be casting the nonzero array ( def load_data_from_npz(filename):
"""Load and return the training data from a npz file (sparse format)."""
with np.load(filename) as f:
data = np.zeros(f["shape"], np.bool_)
data[[x for x in f["nonzero"].astype(np.uint32)]] = True
return data For building the data pipeline using for i, x in enumerate(data):
np.save("./data/{}.npy".format(i), x) But still, you need to fit the whole sparse array in the RAM first to do so. Another way is to slice the nonzero array. Something like with np.load(filename) as f:
n_samples = f["shape"][0]
start = 0
for i in range(n_samples):
end = np.searchsorted(f["nonzero"][0], i, "right")
data = np.zeros(f["shape"][1:], np.bool_)
data[[x for x in f["nonzero"][:, start:end]]] = 1
np.save("./data/{}.npy".format(i), data)
start = end This would save each sample into a NPY file, while the disadvantage is that it would probably take a great amount of space on your hard disk. A better way would be directly using a similar approach to create a tensorflow dataset that compiles the desired samples on the fly by slicing the nonzero array. def get_sample(i):
start = np.searchsorted(f["nonzero"][0], i)
end = np.searchsorted(f["nonzero"][0], i, "right")
data = np.zeros(f["shape"][1:], np.bool_)
data[[x for x in f["nonzero"][:, start:end]]] = 1
return data Not sure how to use this with Hope this helps. |
I ended up doing this. In the In my application I sub-sampled the dataset hence I did not use memory mapping but in theory this approach should work with memory-mapped-arrays as well. Good luck with your research |
Hi,
first of all thank you very much for releasing your source code.
I'd like to use the LPD5-cleansed data that you provide for my project but I am facing some issues with the
train_x_lpd_5_phr.npz
file.By running your code
I get my RAM (12GB) maxed out (starting from ~11.6GB free) but, as @salu133445 said in #46, it should take only ~5GB.
Even trying to create a SharedArray produces the same results (since the code is almost identical).
How can i access this data?
My ultimate goal is to create a tensorflow (2) dataset so I'm not really interested in having the data in a dense format but, as far as I know, having a
.npy
file containing the whole dataset lets me build a tf dataset that can handle offline data access if the whole tensor does not fit inside the available RAM.I tried both on google colab (12GB RAM / no swap) and on my laptop (8GB RAM / 4 GB swap) and I got the same results.
Thank you!
The text was updated successfully, but these errors were encountered: