Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add data count to tfrecords #321

Open
hvgazula opened this issue Apr 10, 2024 · 6 comments
Open

Add data count to tfrecords #321

hvgazula opened this issue Apr 10, 2024 · 6 comments
Assignees

Comments

@hvgazula
Copy link
Contributor

We decided to add an extra feature to each record/example labeled "data_count". While we do this, we also need to add logic to adjust the number of volumes in each epoch (in case drop_reminder is set to True during batching). This is also important because the bayesian meshnet requires the number of examples upfront. See

@hvgazula hvgazula self-assigned this Apr 10, 2024
@hvgazula
Copy link
Contributor Author

@satra how about saving the indices in the filename itself..something like kwyk-train-{00000..00150}.tfrecord,kwyk-train-{00151..00300}.tfrecord and so on..

@satra
Copy link
Contributor

satra commented Apr 11, 2024

sharded representations mean filenames won't carry appropriate indices. there is a default shard size included, but it can be overwritten.

shard_size=300,

@hvgazula
Copy link
Contributor Author

hvgazula commented Apr 11, 2024

I think I understand your idea of "shard" but just to make sure, do you agree that the "shards" created by the API are merely, the files globbed (no randomness) and then split into 300 (aka shard_size) each (using array_split) and then serialized sequentially? If you agree with my explanation, it only means the sharded representations can be tweaked to carry the appropriate indices. I gave you an example using 150..but more generally, the following snippet (in tfrecord.py):

n_examples = len(features_labels)
n_shards = math.ceil(n_examples / examples_per_shard)
shards = np.array_split(features_labels, n_shards)

will be replaced with

n_examples = len(features_labels)
n_shards = math.ceil(n_examples / examples_per_shard)
shards = np.array_split(list(zip(enumerate(features_labels))), n_shards)

where the first element of the first and last items in the list will give the appropriate indices for the filename and this is tied to the shard_size specified at the time of creation (so no loss of generality).

PS: the zip. enumerate snippet I wrote was only for demo purposes

@satra
Copy link
Contributor

satra commented Apr 11, 2024

yes, shards break a binary data stream into accessible pieces without changing the overall structure.

however, nobrainer has a notion of volumes and blocks. if you break a volume into blocks, what matters from the dataset perspective is not the volume index but the block index. hence, len(filenames) is less important than len(blocks).

i'm still not seeing why we want to stick semantics in filename when it can be accessed internally using the metadata, and one that can be accessed directly through the tfrecord.

@hvgazula
Copy link
Contributor Author

hvgazula commented Apr 17, 2024

The only problem with this approach is the count is tied to the original dataset. That is, if I want to use a subset of the dataset for testing purposes I have to create the shards from scratch again. Neverthelesss, I will go ahead and add the full datacount (and optionally the volumes in that shard).

@satra
Copy link
Contributor

satra commented Apr 17, 2024

just create another dataset for now. yes, in the ideal world (an MVP+1 problem), we would be able to select any subset for train/eval from a dataset or have something that trims a dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants