Add data count to tfrecords #321

hvgazula · 2024-04-10T23:07:20Z

We decided to add an extra feature to each record/example labeled "data_count". While we do this, we also need to add logic to adjust the number of volumes in each epoch (in case drop_reminder is set to True during batching). This is also important because the bayesian meshnet requires the number of examples upfront. See

nobrainer/nobrainer/models/bayesian_meshnet.py

Line 20 in 976691d

no_examples=3000,

hvgazula · 2024-04-10T23:20:29Z

@satra how about saving the indices in the filename itself..something like kwyk-train-{00000..00150}.tfrecord,kwyk-train-{00151..00300}.tfrecord and so on..

satra · 2024-04-11T00:10:36Z

sharded representations mean filenames won't carry appropriate indices. there is a default shard size included, but it can be overwritten.

nobrainer/nobrainer/dataset.py

Line 155 in 976691d

shard_size=300,

hvgazula · 2024-04-11T03:54:12Z

I think I understand your idea of "shard" but just to make sure, do you agree that the "shards" created by the API are merely, the files globbed (no randomness) and then split into 300 (aka shard_size) each (using array_split) and then serialized sequentially? If you agree with my explanation, it only means the sharded representations can be tweaked to carry the appropriate indices. I gave you an example using 150..but more generally, the following snippet (in tfrecord.py):

n_examples = len(features_labels)
n_shards = math.ceil(n_examples / examples_per_shard)
shards = np.array_split(features_labels, n_shards)

will be replaced with

n_examples = len(features_labels)
n_shards = math.ceil(n_examples / examples_per_shard)
shards = np.array_split(list(zip(enumerate(features_labels))), n_shards)

where the first element of the first and last items in the list will give the appropriate indices for the filename and this is tied to the shard_size specified at the time of creation (so no loss of generality).

PS: the zip. enumerate snippet I wrote was only for demo purposes

satra · 2024-04-11T09:25:59Z

yes, shards break a binary data stream into accessible pieces without changing the overall structure.

however, nobrainer has a notion of volumes and blocks. if you break a volume into blocks, what matters from the dataset perspective is not the volume index but the block index. hence, len(filenames) is less important than len(blocks).

i'm still not seeing why we want to stick semantics in filename when it can be accessed internally using the metadata, and one that can be accessed directly through the tfrecord.

hvgazula · 2024-04-17T13:02:51Z

The only problem with this approach is the count is tied to the original dataset. That is, if I want to use a subset of the dataset for testing purposes I have to create the shards from scratch again. Neverthelesss, I will go ahead and add the full datacount (and optionally the volumes in that shard).

satra · 2024-04-17T13:40:35Z

just create another dataset for now. yes, in the ideal world (an MVP+1 problem), we would be able to select any subset for train/eval from a dataset or have something that trims a dataset.

hvgazula self-assigned this Apr 10, 2024

hvgazula mentioned this issue Apr 17, 2024

Refactor code to identify correct steps per epoch #325

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add data count to tfrecords #321

Add data count to tfrecords #321

hvgazula commented Apr 10, 2024

hvgazula commented Apr 10, 2024

satra commented Apr 11, 2024

hvgazula commented Apr 11, 2024 •

edited

Loading

satra commented Apr 11, 2024

hvgazula commented Apr 17, 2024 •

edited

Loading

satra commented Apr 17, 2024

Add data count to tfrecords #321

Add data count to tfrecords #321

Comments

hvgazula commented Apr 10, 2024

hvgazula commented Apr 10, 2024

satra commented Apr 11, 2024

hvgazula commented Apr 11, 2024 • edited Loading

satra commented Apr 11, 2024

hvgazula commented Apr 17, 2024 • edited Loading

satra commented Apr 17, 2024

hvgazula commented Apr 11, 2024 •

edited

Loading

hvgazula commented Apr 17, 2024 •

edited

Loading