Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot subclass dataset_ops.DatasetV2 #61394

Closed
AyushExel opened this issue Jul 26, 2023 · 5 comments
Closed

Cannot subclass dataset_ops.DatasetV2 #61394

AyushExel opened this issue Jul 26, 2023 · 5 comments
Assignees
Labels
comp:data tf.data related issues comp:ops OPs related issues stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response Status - Awaiting response from author type:feature Feature requests

Comments

@AyushExel
Copy link

AyushExel commented Jul 26, 2023

Issue type

Support

Have you reproduced the bug with TensorFlow Nightly?

No

Source

binary

TensorFlow version

2.x

Custom code

Yes

OS platform and distribution

Mac OS 13.0

Mobile device

No response

Python version

3.9

Bazel version

No response

GCC/compiler version

No response

CUDA/cuDNN version

No response

GPU model and memory

No response

Current behavior?

Hi, I'm from LanceDB team and we're trying to build native support for tf.data. See WIP PR here lancedb/lance#1087 .
Ideally, we'd like to simply subclass tf.dataset_ops.DatasetV2 so that all the metadata needed to recreate the dataset can be pushed down to our file format that enabled parallelism elegantly.
So, it'd be something like this

class LanceTfDataset(dataset_ops.DatasetV2)
    def __init__(self):
       ...
       variant_tensor = tf.Tensor(self, (), dtype=tf.Variant)
       super().__init__(variant_tensor)

The above code complains that can not create LanceTfDataset to tf.Tensor/variant.

Issue - what exactly is variant_tensor and how do we go about creating one? I read through the docs but couldn't find anything concrete. There was a mention that variant_tensor is a special tensor that tell about the type of the dataset and that it's equivalent to tf.Variant, but the above code doesn't work.

Having a version of tf.dataset that we can use to capture extra metadata would allow us to improve the interface as well:
so instead of lance.tf.data.from_dataset(uri, columns, filter, batch_size) we can just have from_lance(uri).filter(..).batch_size(...).shuffle().

So what's the way to go about subclassing tf Dataset?

Standalone code to reproduce the issue

class LanceTfDataset(dataset_ops.DatasetV2)
    def __init__(self):
       ...
       variant_tensor = tf.Tensor(self, (), dtype=tf.Variant)
       super().__init__(variant_tensor)

Relevant log output

No response

@google-ml-butler google-ml-butler bot added the type:support Support issues label Jul 26, 2023
@tilakrayal tilakrayal added comp:ops OPs related issues type:feature Feature requests comp:data tf.data related issues and removed type:support Support issues labels Jul 27, 2023
@tilakrayal
Copy link
Contributor

@AyushExel,
A Variant Tensor can be a Tensor of any data type.

a = 1
b = 2.0
c = (1, 2)
d = {"a": (2, 2), "b": 3}
e = tf.data.Dataset.from_element(10)

Could you please find the explanation about Variant Tensor or DT_Variant in the following doc.

https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/framework/variant.h#L54

// This is an implementation of a type-erased container that can store an
// object of any type. The implementation is very similar to std::any, but has
// restrictions on the types of objects that can be stored, and eschews some of
// the fancier constructors available for std::any. An object of
// tensorflow::Variant is intended to be used as the value that will be stored
// in a tensorflow::Tensor object when its type is DT_VARIANT.
//
// tensorflow::Variant can store an object of a class that satisfies the
// following constraints:
//

@tilakrayal tilakrayal added the stat:awaiting response Status - Awaiting response from author label Jul 27, 2023
@AyushExel
Copy link
Author

@tilakrayal well so subclassing should work on initializing it the way I do in the example right? But that doesn't work. Then how can I subclass Datasetv2

@google-ml-butler google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label Jul 27, 2023
@sachinprasadhs
Copy link
Contributor

Hi,

Please find the below implementation of subclassing dataset_ops.DatasetV2.

if not isinstance(choice_dataset, data_types.DatasetV2):

class _ZipDataset(dataset_ops.DatasetV2):

@sachinprasadhs sachinprasadhs added the stat:awaiting response Status - Awaiting response from author label Jul 31, 2023
@github-actions
Copy link

github-actions bot commented Aug 8, 2023

This issue is stale because it has been open for 7 days with no activity. It will be closed if no further activity occurs. Thank you.

@github-actions github-actions bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label Aug 8, 2023
@github-actions
Copy link

This issue was closed because it has been inactive for 7 days since being marked as stale. Please reopen if you'd like to work on this further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:data tf.data related issues comp:ops OPs related issues stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response Status - Awaiting response from author type:feature Feature requests
Projects
None yet
Development

No branches or pull requests

3 participants