Questions about the train_x_lpd_5_phr.npz file #100

w00zie · 2020-05-27T17:28:21Z

Hi,

first of all thank you very much for releasing your source code.

I'd like to use the LPD5-cleansed data that you provide for my project but I am facing some issues with the train_x_lpd_5_phr.npz file.
By running your code

def load_data_from_npz(filename):
    """Load and return the training data from a npz file (sparse format)."""
    with np.load(filename) as f:
        data = np.zeros(f['shape'], np.bool_)
        data[tuple(x for x in f['nonzero'])] = True
    return data

I get my RAM (12GB) maxed out (starting from ~11.6GB free) but, as @salu133445 said in #46, it should take only ~5GB.
Even trying to create a SharedArray produces the same results (since the code is almost identical).

How can i access this data?
My ultimate goal is to create a tensorflow (2) dataset so I'm not really interested in having the data in a dense format but, as far as I know, having a .npy file containing the whole dataset lets me build a tf dataset that can handle offline data access if the whole tensor does not fit inside the available RAM.

I tried both on google colab (12GB RAM / no swap) and on my laptop (8GB RAM / 4 GB swap) and I got the same results.

Thank you!

The text was updated successfully, but these errors were encountered:

salu133445 · 2020-05-27T20:13:21Z

Hi,

Thanks for the feedback. I just checked the size of the training data. The dense array takes 7.91 GB. However, the decompressed NPZ file also needs to take 5.82 GB in RAM. As a result, the total required RAM would be roughly 14 GB. A temporary workaround would be casting the nonzero array (f["nonzero"]) to np.uint32, which is safe as the range of values is 0 to 102377.

def load_data_from_npz(filename):
    """Load and return the training data from a npz file (sparse format)."""
    with np.load(filename) as f:
        data = np.zeros(f["shape"], np.bool_)
        data[[x for x in f["nonzero"].astype(np.uint32)]] = True
    return data

For building the data pipeline using tf.data, I guess the easiest way is to store the slices of the dense array to hard disk, each containing one sample for example, and load it back later. Something like

for i, x in enumerate(data):
    np.save("./data/{}.npy".format(i), x)

But still, you need to fit the whole sparse array in the RAM first to do so. Another way is to slice the nonzero array. Something like

with np.load(filename) as f:
    n_samples = f["shape"][0]
    start = 0
    for i in range(n_samples):
        end = np.searchsorted(f["nonzero"][0], i, "right")
        data = np.zeros(f["shape"][1:], np.bool_)
        data[[x for x in f["nonzero"][:, start:end]]] = 1
        np.save("./data/{}.npy".format(i), data)
        start = end

This would save each sample into a NPY file, while the disadvantage is that it would probably take a great amount of space on your hard disk. A better way would be directly using a similar approach to create a tensorflow dataset that compiles the desired samples on the fly by slicing the nonzero array.

def get_sample(i):
    start = np.searchsorted(f["nonzero"][0], i)
    end = np.searchsorted(f["nonzero"][0], i, "right")
    data = np.zeros(f["shape"][1:], np.bool_)
    data[[x for x in f["nonzero"][:, start:end]]] = 1
    return data

Not sure how to use this with tf.data though. If you figure out how to do so, I would appreciate if you share your approach here as I believe many people, including me, would be interested.

Hope this helps.

w00zie · 2020-09-10T14:59:30Z

I ended up doing this. In the main section I tested out the running time of this approach with a small portion of the lpd5-cleansed dataset (a 2048-samples dense npy array): I found out that with appropriate pre-fetching and caching it is possible to reach decent batching times.

In my application I sub-sampled the dataset hence I did not use memory mapping but in theory this approach should work with memory-mapped-arrays as well.
Let me know if this (somehow) helped, I'd be glad to give something back.

Good luck with your research

salu133445 closed this as completed Jun 26, 2020

salu133445 added the question label Nov 11, 2020

salu133445 mentioned this issue Dec 23, 2020

ValueError: Expect BinaryTrack or StandardTrack, but got <class 'pypianoroll.track.Track'>. salu133445/bmusegan#8

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about the train_x_lpd_5_phr.npz file #100

Questions about the train_x_lpd_5_phr.npz file #100

w00zie commented May 27, 2020

salu133445 commented May 27, 2020 •

edited

Loading

w00zie commented Sep 10, 2020

Questions about the train_x_lpd_5_phr.npz file #100

Questions about the train_x_lpd_5_phr.npz file #100

Comments

w00zie commented May 27, 2020

salu133445 commented May 27, 2020 • edited Loading

w00zie commented Sep 10, 2020

salu133445 commented May 27, 2020 •

edited

Loading