Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about the train_x_lpd_5_phr.npz file #100

Closed
w00zie opened this issue May 27, 2020 · 2 comments
Closed

Questions about the train_x_lpd_5_phr.npz file #100

w00zie opened this issue May 27, 2020 · 2 comments
Labels

Comments

@w00zie
Copy link

w00zie commented May 27, 2020

Hi,

first of all thank you very much for releasing your source code.

I'd like to use the LPD5-cleansed data that you provide for my project but I am facing some issues with the train_x_lpd_5_phr.npz file.
By running your code

def load_data_from_npz(filename):
    """Load and return the training data from a npz file (sparse format)."""
    with np.load(filename) as f:
        data = np.zeros(f['shape'], np.bool_)
        data[tuple(x for x in f['nonzero'])] = True
    return data

I get my RAM (12GB) maxed out (starting from ~11.6GB free) but, as @salu133445 said in #46, it should take only ~5GB.
Even trying to create a SharedArray produces the same results (since the code is almost identical).

How can i access this data?
My ultimate goal is to create a tensorflow (2) dataset so I'm not really interested in having the data in a dense format but, as far as I know, having a .npy file containing the whole dataset lets me build a tf dataset that can handle offline data access if the whole tensor does not fit inside the available RAM.

I tried both on google colab (12GB RAM / no swap) and on my laptop (8GB RAM / 4 GB swap) and I got the same results.

Thank you!

@salu133445
Copy link
Owner

salu133445 commented May 27, 2020

Hi,

Thanks for the feedback. I just checked the size of the training data. The dense array takes 7.91 GB. However, the decompressed NPZ file also needs to take 5.82 GB in RAM. As a result, the total required RAM would be roughly 14 GB. A temporary workaround would be casting the nonzero array (f["nonzero"]) to np.uint32, which is safe as the range of values is 0 to 102377.

def load_data_from_npz(filename):
    """Load and return the training data from a npz file (sparse format)."""
    with np.load(filename) as f:
        data = np.zeros(f["shape"], np.bool_)
        data[[x for x in f["nonzero"].astype(np.uint32)]] = True
    return data

For building the data pipeline using tf.data, I guess the easiest way is to store the slices of the dense array to hard disk, each containing one sample for example, and load it back later. Something like

for i, x in enumerate(data):
    np.save("./data/{}.npy".format(i), x)

But still, you need to fit the whole sparse array in the RAM first to do so. Another way is to slice the nonzero array. Something like

with np.load(filename) as f:
    n_samples = f["shape"][0]
    start = 0
    for i in range(n_samples):
        end = np.searchsorted(f["nonzero"][0], i, "right")
        data = np.zeros(f["shape"][1:], np.bool_)
        data[[x for x in f["nonzero"][:, start:end]]] = 1
        np.save("./data/{}.npy".format(i), data)
        start = end

This would save each sample into a NPY file, while the disadvantage is that it would probably take a great amount of space on your hard disk. A better way would be directly using a similar approach to create a tensorflow dataset that compiles the desired samples on the fly by slicing the nonzero array.

def get_sample(i):
    start = np.searchsorted(f["nonzero"][0], i)
    end = np.searchsorted(f["nonzero"][0], i, "right")
    data = np.zeros(f["shape"][1:], np.bool_)
    data[[x for x in f["nonzero"][:, start:end]]] = 1
    return data

Not sure how to use this with tf.data though. If you figure out how to do so, I would appreciate if you share your approach here as I believe many people, including me, would be interested.

Hope this helps.

@w00zie
Copy link
Author

w00zie commented Sep 10, 2020

I ended up doing this. In the main section I tested out the running time of this approach with a small portion of the lpd5-cleansed dataset (a 2048-samples dense npy array): I found out that with appropriate pre-fetching and caching it is possible to reach decent batching times.

In my application I sub-sampled the dataset hence I did not use memory mapping but in theory this approach should work with memory-mapped-arrays as well.
Let me know if this (somehow) helped, I'd be glad to give something back.

Good luck with your research

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants