feat: parse TFRecords as Arrow data #1166

wjones127 · 2023-08-23T20:27:43Z

Adds utility functions for reading TFRecords files:

import pyarrow as pa

def infer_tfrecord_schema(
    uri: str, 
    tensor_features: Optional[List[str]],
    string_features: Optional[List[str]]
) -> pa.Schema:
    """Infer the schema for a TFRecord dataset"""
    ...

def read_tfrecord(uri: str, schema: pa.Schema) -> pa.RecordBatchReader:
    """
    Read a TFRecord file as a stream of record batches. This can be fed directly
    into lance.write_dataset().
    """
    ...

Closes: #1165

wjones127 · 2023-08-25T17:21:14Z

rust/src/io/object_store.rs

-        } else if !expanded_path.is_dir() {
-            return Err(Error::IO {
-                message: format!("{} is not a lance directory", str_path),
-            });


This is necessary so we can re-use object store for other purposes. Otherwise it is tightly coupled to the lance directory.

westonpace

This is great work, especially so quickly. I think my main concern is that there are a lot of unwraps in the proto parsing. Wouldn't it be better for malformed files to lead to errors instead of panics?

python/Cargo.toml

python/src/lib.rs

rust/src/utils/tfrecord.rs

python/python/tests/test_tf.py

rust/src/utils/tfrecord.rs

westonpace · 2023-08-28T14:56:27Z

rust/src/utils/tfrecord.rs

+
+/// Check if a feature has more than 1 value.
+fn feature_is_repeated(feature: &tfrecord::Feature) -> bool {
+    match feature.kind.as_ref().unwrap() {


Why is there an unwrap here? Could it fail?

All protobuf fields are "potentially" missing, but I think it's safe to say this one will always be filled in? What do you think?

rust/src/utils/tfrecord.rs

westonpace · 2023-08-28T15:00:33Z

rust/src/utils/tfrecord.rs

+            Kind::BytesList(data) => match self.feature_type {
+                FeatureType::String => FeatureType::String,
+                FeatureType::Binary => FeatureType::Binary,
+                FeatureType::Tensor { .. } => {
+                    let val = &data.value[0];
+                    let tensor_proto = TensorProto::decode(val.as_slice()).unwrap();
+                    FeatureType::Tensor {
+                        shape: tensor_proto
+                            .tensor_shape
+                            .as_ref()
+                            .unwrap()
+                            .dim
+                            .iter()
+                            .map(|d| d.size)
+                            .collect(),
+                        dtype: tensor_proto.dtype(),
+                    }
+                }
+                _ => {
+                    return Err(Error::IO {
+                        message: format!(
+                            "Data type mismatch: expected {:?}, got {:?}",
+                            self.feature_type,
+                            feature.kind.as_ref().unwrap()
+                        ),
+                    })
+                }
+            },
+            Kind::FloatList(_) => FeatureType::Float,
+            Kind::Int64List(_) => FeatureType::Integer,


This block is very similar to the block in new. Is there a helper method here? extract_type(feature: &Feature) -> FeatureType?

The full block is subtly different from new() (doesn't have access to the same parameters), but I did extract the tensor parsing into a helper, which I think makes it simpler. LMK what you think.

rust/src/io/object_store.rs

westonpace

Looks ready to me. Awesome feature!

westonpace · 2023-08-30T03:05:13Z

python/src/executor.rs

+    /// Spawn a task in the background
+    pub fn spawn_background<T>(&self, py: Option<Python<'_>>, task: T)
+    where
+        T: Future + Send + 'static,
+        T::Output: Send + 'static,
+    {
+        if let Some(py) = py {
+            py.allow_threads(|| {
+                self.runtime.spawn(task);
+            })
+        } else {
+            // Python::with_gil is a no-op if the GIL is already held by the thread.
+            Python::with_gil(|py| {
+                py.allow_threads(|| {
+                    self.runtime.spawn(task);
+                })
+            })
+        }
+    }
+


Is this still needed?

Yeah that's used in read_tfrecord. The pattern I've found for now for exporting streams as recordbatchreaders is to shove the stream on a background task and have it push onto the iterator via a channel. It's a little awkward but seems to work okay.

westonpace · 2023-08-30T03:10:45Z

rust/src/utils/tfrecord.rs

+/// Given a potentially unaligned slice, append the slice to the builder.
+fn append_primitive_from_slice<T>(
+    builder: &mut PrimitiveBuilder<T>,
+    slice: &[u8],
+    parse_val: impl Fn(&[u8]) -> T::Native,
+) where
+    T: arrow::datatypes::ArrowPrimitiveType,
+{
+    let (prefix, middle, suffix) = unsafe { slice.align_to::<T::Native>() };
+    for val in prefix.chunks_exact(T::get_byte_width()) {
+        builder.append_value(parse_val(val));
+    }
+
+    builder.append_slice(middle);
+
+    for val in suffix.chunks_exact(T::get_byte_width()) {
+        builder.append_value(parse_val(val));
+    }
+}


Just out of curiosity, what's going on here? What's the general policy for unsafe blocks in lance?

Oh this is to convert from a &[u8] (slice of bytes) to a &[T::Native] (f32 for a Float32Array, i64 for a Int64Array, etc.). The slice of bytes isn't guaranteed to be aligned, so this breaks it up into three pieces: the unaligned prefix, an aligned middle part, and the unaligned suffix.

What's the general policy for unsafe blocks in lance?
We haven't established a policy, but one we could adopt is: when ever you have an unsafe block, add a comment above explaining why the next line is sound.

Here, the soundness of align_to depends on whether the bytes that form the middle piece are actually valid values for the T::Native type. I'm not 100% certain, but I don't think this applies to f32 or i64. Are there any 32-bit values that aren't a valid f32? Or 64-bit values that aren't a valid i64? I could see this being an issue for boolean values represented with a byte; there's only two valid values but 256 possible values, so 244 possible invalid values.

I see now, I just read up on align_to. It's a nice convenience :). The only other thing I could think of would be endianness. For the half_val stuff we are fortunate that protobuf will handle endianess for us. However, for the tensor_content option it looks like tensorflow is just reinterpret casting the data bytes. So if the tfrecord file is created on a machine with one kind of endianess and then read on a machine with a different kind of endianess you will get back garbage.

However, maybe just a comment / to-do PR and we can worry about it later when we have users / use cases for big endian machines.

wjones127 changed the title ~~wip: parse TFRecords as Arrow data~~ feat: parse TFRecords as Arrow data Aug 24, 2023

wjones127 force-pushed the wjones127/tfrecord-to-arrow branch from 98b3503 to e6113c9 Compare August 25, 2023 17:20

wjones127 commented Aug 25, 2023

View reviewed changes

wjones127 marked this pull request as ready for review August 25, 2023 17:50

wjones127 requested review from eddyxu and westonpace August 25, 2023 17:50

westonpace requested changes Aug 28, 2023

View reviewed changes

wjones127 mentioned this pull request Aug 29, 2023

feat: add a background executor for Python #1172

Merged

wjones127 force-pushed the wjones127/tfrecord-to-arrow branch from c1735fa to 882d8b8 Compare August 29, 2023 20:59

wjones127 requested a review from westonpace August 29, 2023 23:31

westonpace approved these changes Aug 30, 2023

View reviewed changes

wjones127 force-pushed the wjones127/tfrecord-to-arrow branch from 532f175 to 33d4209 Compare August 30, 2023 19:01

wjones127 added 13 commits August 30, 2023 13:57

draft out API

508a6e7

setup unit test in Python

82a52ce

get schema inference working

d318a4d

make streaming and scaffold batch building

abaddad

start builder implementation

b9cb878

get basic types working

55ba66a

get tests passing

68fda42

cleanup

bc99bc1

add more tests

d612bc7

pr feedback

1298210

format

4061697

add comment

d47144e

update test for roundtrip

0a1ad2e

wjones127 force-pushed the wjones127/tfrecord-to-arrow branch from 33d4209 to 0a1ad2e Compare August 30, 2023 21:00

add final test

bc6a58d

wjones127 merged commit 3a11144 into main Aug 30, 2023

wjones127 deleted the wjones127/tfrecord-to-arrow branch August 30, 2023 21:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: parse TFRecords as Arrow data #1166

feat: parse TFRecords as Arrow data #1166

wjones127 commented Aug 23, 2023 •

edited

Loading

wjones127 Aug 25, 2023

westonpace left a comment

westonpace Aug 28, 2023

wjones127 Aug 29, 2023

westonpace Aug 28, 2023

wjones127 Aug 29, 2023

westonpace left a comment

westonpace Aug 30, 2023

wjones127 Aug 30, 2023

westonpace Aug 30, 2023

wjones127 Aug 30, 2023

westonpace Aug 30, 2023

feat: parse TFRecords as Arrow data #1166

feat: parse TFRecords as Arrow data #1166

Conversation

wjones127 commented Aug 23, 2023 • edited Loading

Choose a reason for hiding this comment

westonpace left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

westonpace left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wjones127 commented Aug 23, 2023 •

edited

Loading