-
Notifications
You must be signed in to change notification settings - Fork 404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataframe queries 0: bootstrap & data model #7338
Conversation
/// multiple rows at a given timestamp. | ||
// | ||
// TODO(cmc): issue for multi-pov support | ||
pub pov: ComponentColumnDescriptor, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a full-blown ComponentColumnDescriptor
for the pov has already proven to be horrible in practice.
We'll improve on that in the follow-up series.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very promising!
/// `None` if the data wasn't logged through an archetype. | ||
/// | ||
/// Example: `rerun.archetypes.Points3D`. | ||
pub archetype_name: Option<ArchetypeName>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a clear plan to populate this in the short term? best effort if no ambiguity based on the indicator component? but that would require accessing codegen'd reflection from re_dataframe
, right?
Or should we drop this until we have tagged components?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a clear plan to populate this in the short term? best effort if no ambiguity based on the indicator component? but that would require accessing codegen'd reflection from
re_dataframe
, right?
I investigated such shenanigans and quickly came to the conclusion that the pain wasn't worth it.
Or should we drop this until we have tagged components?
That is where my investigation led me 🙃
|
||
/// Describes a time column, such as `log_time`. | ||
#[derive(Debug, Clone, PartialEq, Eq, Hash)] | ||
pub struct TimeColumnDescriptor { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other two columns include a:
pub component_name: ComponentName,
Keeping one here seems here too feels like it could be nice symmetry. In particular, I'm thinking this eventually becomes IndexColumnDescriptor
with timeline just being one of the component types that supports indexing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't mind it but for now I wouldn't know what to put in it and I'm in a hurry to ship this 😶
ArrowField::new( | ||
component_name.short_name().to_owned(), | ||
datatype.clone(), | ||
false, /* nullable */ | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to successfully build an arrow-rs RecordBatch using the produced schema, I had to make the following change:
ArrowField::new( | |
component_name.short_name().to_owned(), | |
datatype.clone(), | |
false, /* nullable */ | |
) | |
ArrowField::new( | |
component_name.short_name().to_owned(), | |
ArrowDatatype::List(std::sync::Arc::new(ArrowField::new( | |
"item", | |
datatype.clone(), | |
true, /* is_nullable = true; This seems backwards to me. */ | |
))), | |
false, /* is_nullable = false; This seems backwards to me. */ | |
) |
However, this change appears to require nullabillity flags that are backwards from my intuition. I would have expected the INNER list to be non-nullable -- we can't null individual elements within a batch. But the OUTER list to be nullable -- individual rows can be null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way you've set those nullability flags makes sense to me:
- for a list datatype, the nullability flag refers to its items.
- for a top-level field, the nullability datatype should logically describe whether the entire field can be missing from the recordbatch payload -- though from what ive seen in practice it mostly just means gibberish and you have to set it to whatever value makes the system you're working with happy.
The fact that the list layer is missing at all is a very real problem though >< I forgot that TransportChunk
made that one automagically work ✨
Implements the latest-api dataframe API. Examples: ``` cargo r --all-features -p re_dataframe --example latest_at -- /tmp/helix.rrd cargo r --all-features -p re_dataframe --example latest_at -- /tmp/helix.rrd /helix/structure/scaffolding/** ``` ```rust use itertools::Itertools as _; use re_chunk::{TimeInt, Timeline}; use re_chunk_store::{ChunkStore, ChunkStoreConfig, LatestAtQueryExpression, VersionPolicy}; use re_dataframe::QueryEngine; use re_log_types::StoreKind; fn main() -> anyhow::Result<()> { let args = std::env::args().collect_vec(); let get_arg = |i| { let Some(value) = args.get(i) else { eprintln!( "Usage: {} <path_to_rrd> <entity_path_expr>", args.first().map_or("$BIN", |s| s.as_str()) ); std::process::exit(1); }; value }; let path_to_rrd = get_arg(1); let entity_path_expr = args.get(2).map_or("/**", |s| s.as_str()); let stores = ChunkStore::from_rrd_filepath( &ChunkStoreConfig::DEFAULT, path_to_rrd, VersionPolicy::Warn, )?; for (store_id, store) in &stores { if store_id.kind != StoreKind::Recording { continue; } let cache = re_dataframe::external::re_query::Caches::new(store); let engine = QueryEngine { store, cache: &cache, }; let query = LatestAtQueryExpression { entity_path_expr: entity_path_expr.into(), timeline: Timeline::log_time(), at: TimeInt::MAX, }; let query_handle = engine.latest_at(&query, None /* columns */); let batch = query_handle.get(); eprintln!("{query}:\n{batch}"); } Ok(()) } ``` * Part of #7284 --- Dataframe APIs PR series: - #7338 - #7339 - #7340 - #7341 - #7345
Implements the dense range dataframe APIs. Examples: ``` cargo r --all-features -p re_dataframe --example range -- /tmp/data.rrd /helix/structure/scaffolding/beads cargo r --all-features -p re_dataframe --example range -- /tmp/data.rrd /helix/structure/scaffolding/beads /helix/structure/scaffolding/** ``` ```rust use itertools::Itertools as _; use re_chunk_store::{ ChunkStore, ChunkStoreConfig, ComponentColumnDescriptor, RangeQueryExpression, Timeline, VersionPolicy, }; use re_dataframe::QueryEngine; use re_log_types::{ResolvedTimeRange, StoreKind}; fn main() -> anyhow::Result<()> { let args = std::env::args().collect_vec(); let get_arg = |i| { let Some(value) = args.get(i) else { eprintln!( "Usage: {} <path_to_rrd_with_position3ds> <entity_path_pov> [entity_path_expr]", args.first().map_or("$BIN", |s| s.as_str()) ); std::process::exit(1); }; value }; let path_to_rrd = get_arg(1); let entity_path_pov = get_arg(2).as_str(); let entity_path_expr = args.get(3).map_or("/**", |s| s.as_str()); let stores = ChunkStore::from_rrd_filepath( &ChunkStoreConfig::DEFAULT, path_to_rrd, VersionPolicy::Warn, )?; for (store_id, store) in &stores { if store_id.kind != StoreKind::Recording { continue; } let cache = re_dataframe::external::re_query::Caches::new(store); let engine = QueryEngine { store, cache: &cache, }; let query = RangeQueryExpression { entity_path_expr: entity_path_expr.into(), timeline: Timeline::log_tick(), time_range: ResolvedTimeRange::new(0, 30), pov: ComponentColumnDescriptor::new::<re_types::components::Position3D>( entity_path_pov.into(), ), }; let query_handle = engine.range(&query, None /* columns */); eprintln!("{query}:"); for batch in query_handle.into_iter() { eprintln!("{batch}"); } } Ok(()) } ``` * Fixes #7284 --- Dataframe APIs PR series: - #7338 - #7339 - #7340 - #7341 - #7345
Implements the paginated dense range dataframe APIs. If there's no off-by-one anywhere in there, I will eat my hat. Getting this in the hands of people is the highest prio though, I'll add tests later.   * Fixes #7284 --- Dataframe APIs PR series: - #7338 - #7339 - #7340 - #7341 - #7345
All the boilerplate for the new
re_dataframe
.Also introduces all the new types:
QueryExpression
,LatestAtQueryExpression
,RangeQueryExpression
QueryHandle
,LatestAtQueryHandle
(unimplemented),RangeQueryHandle
(unimplemented)ColumnDescriptor
,ControlColumnDescriptor
,TimeColumnDescriptor
,ComponentColumnDescriptor
No actual code logic, just definitions.
Dataframe APIs PR series:
Checklist
main
build: rerun.io/viewernightly
build: rerun.io/viewerCHANGELOG.md
and the migration guideTo run all checks from
main
, comment on the PR with@rerun-bot full-check
.