Replies: 1 comment 2 replies
-
Hi @satra, We took a look at croissant last year when it came out. It definitely seems appealing, but we opted not to use it for a few reasons. 1. TensorFlow Datasets nativeAs we're currently in the process of migrating to PyTorch due to the highly opinionated and often difficult to navigate decisions Google has made with dependencies for their MLops tools, the idea of depending on TensorFlow Datasets seemed counterproductive. The more standalone They support PyTorch DataPipes which is fantastic, except for the fact that the PyTorch team is deprecating DataPipes in favor of a polylithic dataloader. We're likely to revisit this whole story once that's done since we're on dataloader framework refactor v3 already 😓. 2. Pose models are usually trained on images, not videos... sort of...Most current pose estimation pipelines rely on single frames, not contiguous sequences. This is because consecutive frames are heavily autocorrelated, so it's a lot less efficient to annotate multiple frames in a row as compared to sampling across longer time gaps. With importance sampling, you can optimize for diversity in the images and minimize the number of frames you need to label. The problem is that you need to have access to millions (or billions) of images to sample down to thousands that will actually form the training set. Our dataloaders are pretty optimized for random access to source video data. This allows us to bypass the need for an intermediate frame extraction step where we pull the images out of MP4s, as is often done in computer vision pipelines. It also means that we don't need to worry about as much about provenance metadata since the data is left in situ and we just extract what we need on the fly. Practically, in our current workflow, it would often take longer to pull out the frames, serialize them to disk, and then load them into memory before we start training than it does to just decode the frames on the fly (and cache them) during the first pass through the dataset. 3. But what if we do want to properly use video?While we could take advantage of adjacent frames for context (even without labels, as Lightning Pose does), storing image sequences efficiently is tricky. It's basically impossible to beat H.264/H.265, and a tremendously futile effort to try given how much engineering has gone into those algorithms over the past few decades. The issue with dataloaders for video is their lack of proper compatibility with how these videos are stored. Right now, most folks just save out individual frames as JPEG bytestrings (and often recompress them using GZIP, for no reason other than to waste CPU cycles). This is changing now with If we could store the bytestream that makes up the H.264 packets and we could decode it out in memory so we could get similar performance than decoding off of the source files, we would consider it, but not a high priority at the moment. All that said, it's not a bad idea to standardize around croissant for the DCAIC. It especially makes sense for benchmark data that will be trained on many times. If the DCAIC goes for it, what we'd probably do is add an exporter over in |
Beta Was this translation helpful? Give feedback.
-
https://research.google/blog/croissant-a-metadata-format-for-ml-ready-datasets/ is a metadata layer that's being touted across several channels for ML ready datasets. given the discussion about behavior standards and ML readiness in BBQS, any thoughts on utility or futility of such an approach for video data?
Beta Was this translation helpful? Give feedback.
All reactions