consideration of croissant or conversion to for ML ready representation #2096

satra · 2025-01-23T02:12:47Z

satra
Jan 23, 2025

https://research.google/blog/croissant-a-metadata-format-for-ml-ready-datasets/ is a metadata layer that's being touted across several channels for ML ready datasets. given the discussion about behavior standards and ML readiness in BBQS, any thoughts on utility or futility of such an approach for video data?

talmo · 2025-01-24T00:35:31Z

talmo
Jan 24, 2025
Maintainer

Hi @satra,

We took a look at croissant last year when it came out. It definitely seems appealing, but we opted not to use it for a few reasons.

1. TensorFlow Datasets native

As we're currently in the process of migrating to PyTorch due to the highly opinionated and often difficult to navigate decisions Google has made with dependencies for their MLops tools, the idea of depending on TensorFlow Datasets seemed counterproductive.

The more standalone mlcroissant has a less aggressive dependency stack, so we'd be open to revisiting that, but optimizing for dataloader throughput will still be painful.

They support PyTorch DataPipes which is fantastic, except for the fact that the PyTorch team is deprecating DataPipes in favor of a polylithic dataloader. We're likely to revisit this whole story once that's done since we're on dataloader framework refactor v3 already 😓.

2. Pose models are usually trained on images, not videos... sort of...

Most current pose estimation pipelines rely on single frames, not contiguous sequences. This is because consecutive frames are heavily autocorrelated, so it's a lot less efficient to annotate multiple frames in a row as compared to sampling across longer time gaps. With importance sampling, you can optimize for diversity in the images and minimize the number of frames you need to label. The problem is that you need to have access to millions (or billions) of images to sample down to thousands that will actually form the training set.

Our dataloaders are pretty optimized for random access to source video data. This allows us to bypass the need for an intermediate frame extraction step where we pull the images out of MP4s, as is often done in computer vision pipelines. It also means that we don't need to worry about as much about provenance metadata since the data is left in situ and we just extract what we need on the fly.

Practically, in our current workflow, it would often take longer to pull out the frames, serialize them to disk, and then load them into memory before we start training than it does to just decode the frames on the fly (and cache them) during the first pass through the dataset.

3. But what if we do want to properly use video?

While we could take advantage of adjacent frames for context (even without labels, as Lightning Pose does), storing image sequences efficiently is tricky.

It's basically impossible to beat H.264/H.265, and a tremendously futile effort to try given how much engineering has gone into those algorithms over the past few decades. The issue with dataloaders for video is their lack of proper compatibility with how these videos are stored. Right now, most folks just save out individual frames as JPEG bytestrings (and often recompress them using GZIP, for no reason other than to waste CPU cycles).

This is changing now with torchvision.io.VideoReader which implements proper video decoding in a dataloader pipeline (including with GPU acceleration).

If we could store the bytestream that makes up the H.264 packets and we could decode it out in memory so we could get similar performance than decoding off of the source files, we would consider it, but not a high priority at the moment.

All that said, it's not a bad idea to standardize around croissant for the DCAIC. It especially makes sense for benchmark data that will be trained on many times.

If the DCAIC goes for it, what we'd probably do is add an exporter over in sleap-io that dumps it out in their spec, but we probably won't integrate it into our training pipelines until there's a performant dataloader to make it practical.

2 replies

satra Jan 26, 2025
Author

thanks @talmo for this detailed explanation. we have not taken any decision yet, and expect this to be somewhat of a flavor of the month approach. more importantly i think i'm interested in both the efficiency question as well as the semantics question. whether future agents can understand binary file formats through richer semantic metadata, and optimize access.

completely agree on the H.264/5 front and that it's really going to be about streaming based on highly compressible dedicated formats. we will continue to keep and eye out on relevant shims that enable this like torchvision.io.VideoReader and with evolution of multimodal models that do use the time component. depending on sampling frequency, at 30fps, which is quite common for human studies, a lot can change in one frame, but your point about correlation is well taken :)

talmo Jan 29, 2025
Maintainer

more importantly i think i'm interested in both the efficiency question as well as the semantics question. whether future agents can understand binary file formats through richer semantic metadata, and optimize access.

Definitely!

If you haven't yet, connecting with the DIVA folks might be good. They're connected with Vivek Kumar at Jax and have some highly relevant recent work [1] [2] on systems for large scale behavioral phenotyping and some really cool systems for annotating the timeseries with lots of additional metadata in a way that I think you'd be interested in. Here's the dashboard view:

Cheers,

Talmo

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

consideration of croissant or conversion to for ML ready representation #2096

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

consideration of croissant or conversion to for ML ready representation #2096

satra Jan 23, 2025

Replies: 1 comment · 2 replies

talmo Jan 24, 2025 Maintainer

1. TensorFlow Datasets native

2. Pose models are usually trained on images, not videos... sort of...

3. But what if we do want to properly use video?

satra Jan 26, 2025 Author

talmo Jan 29, 2025 Maintainer

satra
Jan 23, 2025

Replies: 1 comment 2 replies

talmo
Jan 24, 2025
Maintainer

satra Jan 26, 2025
Author

talmo Jan 29, 2025
Maintainer