Dataframe queries 0: bootstrap & data model #7338

teh-cmc · 2024-09-03T11:07:55Z

All the boilerplate for the new re_dataframe.

Also introduces all the new types:

QueryExpression, LatestAtQueryExpression, RangeQueryExpression
QueryHandle, LatestAtQueryHandle (unimplemented), RangeQueryHandle (unimplemented)
ColumnDescriptor, ControlColumnDescriptor, TimeColumnDescriptor, ComponentColumnDescriptor

No actual code logic, just definitions.

Part of New ChunkStore APIs to facilitate data access #7284

Dataframe APIs PR series:

Checklist

I have read and agree to Contributor Guide and the Code of Conduct
I've included a screenshot or gif (if applicable)
I have tested the web demo (if applicable):
- Using examples from latest main build: rerun.io/viewer
- Using full set of examples from nightly build: rerun.io/viewer
The PR title and labels are set such as to maximize their usefulness for the next release's CHANGELOG
If applicable, add a new check to the release checklist!
If have noted any breaking changes to the log API in CHANGELOG.md and the migration guide

To run all checks from main, comment on the PR with @rerun-bot full-check.

teh-cmc · 2024-09-03T14:50:52Z

crates/store/re_chunk_store/src/dataframe.rs

+    /// multiple rows at a given timestamp.
+    //
+    // TODO(cmc): issue for multi-pov support
+    pub pov: ComponentColumnDescriptor,


Using a full-blown ComponentColumnDescriptor for the pov has already proven to be horrible in practice.
We'll improve on that in the follow-up series.

abey79

Very promising!

abey79 · 2024-09-03T14:43:50Z

crates/store/re_chunk_store/src/dataframe.rs

+    /// `None` if the data wasn't logged through an archetype.
+    ///
+    /// Example: `rerun.archetypes.Points3D`.
+    pub archetype_name: Option<ArchetypeName>,


Do we have a clear plan to populate this in the short term? best effort if no ambiguity based on the indicator component? but that would require accessing codegen'd reflection from re_dataframe, right?

Or should we drop this until we have tagged components?

Do we have a clear plan to populate this in the short term? best effort if no ambiguity based on the indicator component? but that would require accessing codegen'd reflection from re_dataframe, right?

I investigated such shenanigans and quickly came to the conclusion that the pain wasn't worth it.

Or should we drop this until we have tagged components?

That is where my investigation led me 🙃

jleibs · 2024-09-03T15:22:50Z

crates/store/re_chunk_store/src/dataframe.rs

+
+/// Describes a time column, such as `log_time`.
+#[derive(Debug, Clone, PartialEq, Eq, Hash)]
+pub struct TimeColumnDescriptor {


The other two columns include a:

pub component_name: ComponentName,

Keeping one here seems here too feels like it could be nice symmetry. In particular, I'm thinking this eventually becomes IndexColumnDescriptor with timeline just being one of the component types that supports indexing.

I wouldn't mind it but for now I wouldn't know what to put in it and I'm in a hurry to ship this 😶

jleibs · 2024-09-04T00:43:35Z

crates/store/re_chunk_store/src/dataframe.rs

+        ArrowField::new(
+            component_name.short_name().to_owned(),
+            datatype.clone(),
+            false, /* nullable */
+        )


In order to successfully build an arrow-rs RecordBatch using the produced schema, I had to make the following change:

Suggested change

ArrowField::new(

component_name.short_name().to_owned(),

datatype.clone(),

false, /* nullable */

)

ArrowField::new(

component_name.short_name().to_owned(),

ArrowDatatype::List(std::sync::Arc::new(ArrowField::new(

"item",

datatype.clone(),

true, /* is_nullable = true; This seems backwards to me. */

))),

false, /* is_nullable = false; This seems backwards to me. */

)

However, this change appears to require nullabillity flags that are backwards from my intuition. I would have expected the INNER list to be non-nullable -- we can't null individual elements within a batch. But the OUTER list to be nullable -- individual rows can be null.

The way you've set those nullability flags makes sense to me:

for a list datatype, the nullability flag refers to its items.

for a top-level field, the nullability datatype should logically describe whether the entire field can be missing from the recordbatch payload -- though from what ive seen in practice it mostly just means gibberish and you have to set it to whatever value makes the system you're working with happy.

The fact that the list layer is missing at all is a very real problem though >< I forgot that TransportChunk made that one automagically work ✨

The schema resolution logic. * Part of #7284 --- Dataframe APIs PR series: - #7338 - #7339 - #7340 - #7341 - #7345

Implements the latest-api dataframe API. Examples: ``` cargo r --all-features -p re_dataframe --example latest_at -- /tmp/helix.rrd cargo r --all-features -p re_dataframe --example latest_at -- /tmp/helix.rrd /helix/structure/scaffolding/** ``` ```rust use itertools::Itertools as _; use re_chunk::{TimeInt, Timeline}; use re_chunk_store::{ChunkStore, ChunkStoreConfig, LatestAtQueryExpression, VersionPolicy}; use re_dataframe::QueryEngine; use re_log_types::StoreKind; fn main() -> anyhow::Result<()> { let args = std::env::args().collect_vec(); let get_arg = |i| { let Some(value) = args.get(i) else { eprintln!( "Usage: {} <path_to_rrd> <entity_path_expr>", args.first().map_or("$BIN", |s| s.as_str()) ); std::process::exit(1); }; value }; let path_to_rrd = get_arg(1); let entity_path_expr = args.get(2).map_or("/**", |s| s.as_str()); let stores = ChunkStore::from_rrd_filepath( &ChunkStoreConfig::DEFAULT, path_to_rrd, VersionPolicy::Warn, )?; for (store_id, store) in &stores { if store_id.kind != StoreKind::Recording { continue; } let cache = re_dataframe::external::re_query::Caches::new(store); let engine = QueryEngine { store, cache: &cache, }; let query = LatestAtQueryExpression { entity_path_expr: entity_path_expr.into(), timeline: Timeline::log_time(), at: TimeInt::MAX, }; let query_handle = engine.latest_at(&query, None /* columns */); let batch = query_handle.get(); eprintln!("{query}:\n{batch}"); } Ok(()) } ``` * Part of #7284 --- Dataframe APIs PR series: - #7338 - #7339 - #7340 - #7341 - #7345

Implements the dense range dataframe APIs. Examples: ``` cargo r --all-features -p re_dataframe --example range -- /tmp/data.rrd /helix/structure/scaffolding/beads cargo r --all-features -p re_dataframe --example range -- /tmp/data.rrd /helix/structure/scaffolding/beads /helix/structure/scaffolding/** ``` ```rust use itertools::Itertools as _; use re_chunk_store::{ ChunkStore, ChunkStoreConfig, ComponentColumnDescriptor, RangeQueryExpression, Timeline, VersionPolicy, }; use re_dataframe::QueryEngine; use re_log_types::{ResolvedTimeRange, StoreKind}; fn main() -> anyhow::Result<()> { let args = std::env::args().collect_vec(); let get_arg = |i| { let Some(value) = args.get(i) else { eprintln!( "Usage: {} <path_to_rrd_with_position3ds> <entity_path_pov> [entity_path_expr]", args.first().map_or("$BIN", |s| s.as_str()) ); std::process::exit(1); }; value }; let path_to_rrd = get_arg(1); let entity_path_pov = get_arg(2).as_str(); let entity_path_expr = args.get(3).map_or("/**", |s| s.as_str()); let stores = ChunkStore::from_rrd_filepath( &ChunkStoreConfig::DEFAULT, path_to_rrd, VersionPolicy::Warn, )?; for (store_id, store) in &stores { if store_id.kind != StoreKind::Recording { continue; } let cache = re_dataframe::external::re_query::Caches::new(store); let engine = QueryEngine { store, cache: &cache, }; let query = RangeQueryExpression { entity_path_expr: entity_path_expr.into(), timeline: Timeline::log_tick(), time_range: ResolvedTimeRange::new(0, 30), pov: ComponentColumnDescriptor::new::<re_types::components::Position3D>( entity_path_pov.into(), ), }; let query_handle = engine.range(&query, None /* columns */); eprintln!("{query}:"); for batch in query_handle.into_iter() { eprintln!("{batch}"); } } Ok(()) } ``` * Fixes #7284 --- Dataframe APIs PR series: - #7338 - #7339 - #7340 - #7341 - #7345

Implements the paginated dense range dataframe APIs. If there's no off-by-one anywhere in there, I will eat my hat. Getting this in the hands of people is the highest prio though, I'll add tests later. ![image](https://github.com/user-attachments/assets/e865ba62-21db-41c1-9899-35a0e7aea134) ![image](https://github.com/user-attachments/assets/32934ba8-2673-401a-aafc-409dfbe9b2c5) * Fixes #7284 --- Dataframe APIs PR series: - #7338 - #7339 - #7340 - #7341 - #7345

teh-cmc added 7 commits September 3, 2024 12:28

Chunk::{row_ids_array, times_array}

dfb5917

column descriptors and query expressions

df2b80a

ChunkStore::from_rrd_path

948c498

EntityPathRule: added display impl

c0d8373

todo for later

f533fdf

re_dataframe boilerplate and definitions

c950468

ARCHITECTURE.md

5e5d088

teh-cmc added ⛃ re_datastore affects the datastore itself 🔍 re_query affects re_query itself do-not-merge Do not merge this PR include in changelog labels Sep 3, 2024

This was referenced Sep 3, 2024

Dataframe queries 1: schema resolution #7339

Merged

Dataframe queries 2: latest-at #7340

Merged

Dataframe queries 3: dense range #7341

Merged

teh-cmc marked this pull request as ready for review September 3, 2024 11:33

teh-cmc commented Sep 3, 2024

View reviewed changes

abey79 approved these changes Sep 3, 2024

View reviewed changes

jleibs approved these changes Sep 3, 2024

View reviewed changes

teh-cmc mentioned this pull request Sep 3, 2024

Dataframe queries 4: paginated dense range #7345

Merged

6 tasks

jleibs reviewed Sep 4, 2024

View reviewed changes

just add the list layer for now

33c3466

teh-cmc merged commit e685541 into main Sep 4, 2024
27 of 30 checks passed

teh-cmc deleted the cmc/dataframe_queries_0_boilerplate branch September 4, 2024 07:59

teh-cmc added a commit that referenced this pull request Sep 4, 2024

Dataframe queries 1: schema resolution (#7339)

d6f2375

The schema resolution logic. * Part of #7284 --- Dataframe APIs PR series: - #7338 - #7339 - #7340 - #7341 - #7345

teh-cmc removed the do-not-merge Do not merge this PR label Sep 4, 2024

emilk removed the include in changelog label Oct 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataframe queries 0: bootstrap & data model #7338

Dataframe queries 0: bootstrap & data model #7338

teh-cmc commented Sep 3, 2024 •

edited by github-actions bot

Loading

teh-cmc Sep 3, 2024

abey79 left a comment

abey79 Sep 3, 2024

teh-cmc Sep 3, 2024

jleibs Sep 3, 2024

teh-cmc Sep 3, 2024

jleibs Sep 4, 2024 •

edited

Loading

teh-cmc Sep 4, 2024

Dataframe queries 0: bootstrap & data model #7338

Dataframe queries 0: bootstrap & data model #7338

Conversation

teh-cmc commented Sep 3, 2024 • edited by github-actions bot Loading

Checklist

teh-cmc Sep 3, 2024

Choose a reason for hiding this comment

abey79 left a comment

Choose a reason for hiding this comment

abey79 Sep 3, 2024

Choose a reason for hiding this comment

teh-cmc Sep 3, 2024

Choose a reason for hiding this comment

jleibs Sep 3, 2024

Choose a reason for hiding this comment

teh-cmc Sep 3, 2024

Choose a reason for hiding this comment

jleibs Sep 4, 2024 • edited Loading

Choose a reason for hiding this comment

teh-cmc Sep 4, 2024

Choose a reason for hiding this comment

teh-cmc commented Sep 3, 2024 •

edited by github-actions bot

Loading

jleibs Sep 4, 2024 •

edited

Loading