-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add data count to tfrecords #321
Comments
@satra how about saving the indices in the filename itself..something like |
sharded representations mean filenames won't carry appropriate indices. there is a default shard size included, but it can be overwritten. nobrainer/nobrainer/dataset.py Line 155 in 976691d
|
I think I understand your idea of "shard" but just to make sure, do you agree that the "shards" created by the API are merely, the files globbed (no randomness) and then split into 300 (aka shard_size) each (using array_split) and then serialized sequentially? If you agree with my explanation, it only means the sharded representations can be tweaked to carry the appropriate indices. I gave you an example using 150..but more generally, the following snippet (in tfrecord.py):
will be replaced with
where the first element of the first and last items in the list will give the appropriate indices for the filename and this is tied to the shard_size specified at the time of creation (so no loss of generality). PS: the zip. enumerate snippet I wrote was only for demo purposes |
yes, shards break a binary data stream into accessible pieces without changing the overall structure. however, nobrainer has a notion of volumes and blocks. if you break a volume into blocks, what matters from the dataset perspective is not the volume index but the block index. hence, len(filenames) is less important than len(blocks). i'm still not seeing why we want to stick semantics in filename when it can be accessed internally using the metadata, and one that can be accessed directly through the tfrecord. |
The only problem with this approach is the count is tied to the original dataset. That is, if I want to use a subset of the dataset for testing purposes I have to create the shards from scratch again. Neverthelesss, I will go ahead and add the full datacount (and optionally the volumes in that shard). |
just create another dataset for now. yes, in the ideal world (an MVP+1 problem), we would be able to select any subset for train/eval from a dataset or have something that trims a dataset. |
We decided to add an extra feature to each record/example labeled "data_count". While we do this, we also need to add logic to adjust the number of volumes in each epoch (in case drop_reminder is set to True during batching). This is also important because the bayesian meshnet requires the number of examples upfront. See
nobrainer/nobrainer/models/bayesian_meshnet.py
Line 20 in 976691d
The text was updated successfully, but these errors were encountered: