Define DSL for analysis and query span data #1811

pavolloffay · 2019-09-24T09:06:03Z

Created based on #1639 (comment).

Define domain-specific language (DSL) for analysis and query Span data. An example from Facebook's canopy system:

The library should be able to connect to any Span source - jaeger query, json file, storage.

DSL in Canopy https://research.fb.com/publications/canopy-end-to-end-performance-tracing-at-scale/

cc @jaegertracing/data-analytics

jpkrohling · 2019-09-24T09:09:43Z

Would it make sense to start by implementing GraphQL (#169)?

pavolloffay · 2019-09-24T09:17:16Z

@yurishkuro I am wondering do we want to make the library to work in a distributed way (like spark RDD)?

In the previous discussions, we have also mentioned that we could reuse existing graph traversal frameworks e.g. Gremlin. I am not sure if GraphQL #169 provides the same capabilities or it is used only for UI integrations.

pavolloffay · 2019-09-24T09:57:03Z

We should also think about use cases this feature would solve:

users could ask the system more complicated questions e.g. - test a hypothesis
the query could be periodically run and invoke an action (alert) if conditions are met

pavolloffay · 2019-09-24T14:44:25Z

There are two popular graph query languages - Gremlin (https://tinkerpop.apache.org/gremlin.html) and Cypher (https://neo4j.com/developer/cypher-basics-i/)

Gremlin

supports multiple languages: groovy, java, python
integrated with Apache Spark - http://tinkerpop.apache.org/docs/current/reference/#spark-plugin (java), python?
Apache Flink integration with Gelly https://flink.apache.org/news/2015/08/24/introducing-flink-gelly.html, https://medium.com/@matgug/part-2-twitter-graph-analysis-with-gremlin-31550f9d134, I could not find more resources

Cypher

supports multiple languages: java, python (pypher https://neo4j.com/blog/express-cypher-queries-pure-python-pypher/), https://github.com/Wolfgang-Schuetzelhofer/jcypher/wiki/Domain%20Mapping
integrated with Apache Spark https://github.com/opencypher/morpheus (scala). It will be included in Spark 3.0 https://neo4j.com/blog/spark-developers-include-cypher-spark-3-0/
Apache Flink integration https://community.neo4j.com/t/about-the-stream-processing-category/33

pavolloffay · 2019-09-24T15:33:27Z

I am not sure how we could use this query language without backed supporting it. To use gremlin we would have provide a gremlin compatible layer to allow query execution. @jaegertracing/data-analytics @yurishkuro any ideas?

Maybe running the query on the subset of data directly in-memory would work.

annanay25 · 2019-09-24T19:02:09Z

Would it be easier if we could curate traces from a (relatively) complex system that someone in the community is running in production and would volunteer to publish? It would move focus from data collection to actual analysis and it would also help different teams collate and confirm results while working on the same data-set.

Didn't dig very deep but seem relevant - https://github.com/google/cluster-data

yurishkuro · 2019-09-25T04:38:58Z

@pavolloffay there are several parts to the DSL/library:

1. a way to define a stream of traces

This may include:
* loading traces from files
* loading traces from a Kafka topic (e.g. start 10,000 messages back) that contains spans.

In case of a source providing just spans, there needs to be a pre-aggregation step that assembles them into traces. This creates interesting challenges when done on a live stream as opposed to historical data, since on historical data we can simply group-by, while with a live stream we need to use window aggregation

The output of the first step is RDD-like stream of traces.

2. Filtering step

This is where the first part of the DSL comes in - how to express a query on a trace, when trace is represented as a graph. Joe's proposal didn't really address the graph nature of the trace, only filtering conditions on individual spans (which could also be a valid use case).

3. Evaluation / feature extraction step

The second part of the DSL - expressing feature extraction computation on the graph, like the Facebook's Canopy example above. Note an interesting thing in that example - it operates on a trace almost like on a flat collection of spans. They probably have expressions that can walk the graph, like $node->parent, but they didn't show it in the public talks.

I think the minimum DSL we need is just an ability to walk the in-memory representation of the trace as graph (i.e. for n in node.childen ...) and extract data (e.g. span.operationName, span.tag['key']`). The actual evaluations can be normal programs, in case of the filtering step returning boolean.

In other words, what we need is just a data model, and maybe some simple helper functions for finding things, like browser_thread = trace.execution_units[attr.name == 'client'], which is really, in generic sense, is func (t *Trace) findSpans(predicate func(*Span) bool) []*Spans. Helpers can actually come later, as long as we have the data model people can write them themselves initially.

pavolloffay · 2019-09-26T15:16:38Z

I have started defining DSL with gremlin in https://github.com/pavolloffay/jaeger-tracedsl

Here is an example from app class https://github.com/pavolloffay/jaeger-tracedsl/blob/master/src/main/java/io/jaegertracing/dsl/gremlin/App.java

    TraceTraversalSource traceSource = graph.traversal(TraceTraversalSource.class);
    GraphTraversal<Vertex, Vertex> spans = traceSource
        .hasTag(Tags.SPAN_KIND.getKey(), Tags.SPAN_KIND_CLIENT)
        .duration(P.gt(100));

    for (Vertex v : spans.toList()) {
      System.out.println(v.label());
      System.out.println(v.property(Keys.OPERATION_NAME).value());
      System.out.println(v.keys());
    }

You can see how the filtering and extraction look like. The API allows to use trace DSL but also core gremlin API at the same time. This is a simple example but it should be possible to do things like:

determine if two spans are connected graph.connected(tagsSpan1, tagSpan2)
process distance between two spans
distribution of service/process depth/breadth

Any suggestions are welcome. My next step would be:

more complicated filtering methods outlined above
simplify extraction/iteration (children(), root()...)
graph creation API - from file (downloaded from UI?), jaeger-query, directly from storage

annanay25 · 2019-09-26T19:28:08Z

In case of a source providing just spans, there needs to be a pre-aggregation step that assembles them into traces. This creates interesting challenges when done on a live stream as opposed to historical data, since on historical data we can simply group-by, while with a live stream we need to use window aggregation

@yurishkuro - Is this aggregator component available in open source?

yurishkuro · 2019-09-29T20:47:43Z

@annanay25 there's not much to it: https://github.com/PacktPublishing/Mastering-Distributed-Tracing/blob/master/Chapter12/src/main/java/tracefeatures/SpanCountJob.java#L55

pavolloffay · 2019-10-11T14:02:11Z

I have made some progress in my repository. The repository so far contains:

Gremlin trace DSL - defined methods for easier filtering and iteration over graph (extraction)
Examples with gremlin - e.g. find a span with given properties, are two spans connected? What is the distance between two spans? What is the maximum depth of trace (based on spans not services)?
Spark streaming with Kafka connector. It reads kafka topic in intervals, groups by traceids, creates graph for each trace and extracts max deph of the trace and prints it to stdout.

The next steps are:

publish extracted features to another kafka topic and get them to Prometheus.
wrap the code to jupyterlab notebook
get a trace query REST API
write a blog post
move span proto to IDL repository and build to java and python, we should consider publishing model classes to maven/pip..
publish graph DSL as a library
make it easy to deploy jupyterlab on k8s and connect to kafka
create a distribution (like spark-dependencies) with models/metrics which prove to be useful.

pavolloffay · 2019-10-11T14:06:23Z

It would be great if somebody could help with moving protos to IDL and make configure build process to different languages #1213.

pavolloffay mentioned this issue Sep 24, 2019

AI/ML platform for Jaeger #1639

Open

pavolloffay mentioned this issue Nov 12, 2019

Jypyterlab/Jupyter notebook integration #1813

Closed

bboreham mentioned this issue Feb 23, 2021

Allow a tag to be displayed next to operation name in search view jaegertracing/jaeger-ui#709

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define DSL for analysis and query span data #1811

Define DSL for analysis and query span data #1811

pavolloffay commented Sep 24, 2019 •

edited

Loading

jpkrohling commented Sep 24, 2019

pavolloffay commented Sep 24, 2019

pavolloffay commented Sep 24, 2019

pavolloffay commented Sep 24, 2019 •

edited

Loading

pavolloffay commented Sep 24, 2019

annanay25 commented Sep 24, 2019

yurishkuro commented Sep 25, 2019 •

edited

Loading

pavolloffay commented Sep 26, 2019

annanay25 commented Sep 26, 2019

yurishkuro commented Sep 29, 2019

pavolloffay commented Oct 11, 2019

pavolloffay commented Oct 11, 2019

Define DSL for analysis and query span data #1811

Define DSL for analysis and query span data #1811

Comments

pavolloffay commented Sep 24, 2019 • edited Loading

jpkrohling commented Sep 24, 2019

pavolloffay commented Sep 24, 2019

pavolloffay commented Sep 24, 2019

pavolloffay commented Sep 24, 2019 • edited Loading

Gremlin

Cypher

pavolloffay commented Sep 24, 2019

annanay25 commented Sep 24, 2019

yurishkuro commented Sep 25, 2019 • edited Loading

1. a way to define a stream of traces

2. Filtering step

3. Evaluation / feature extraction step

pavolloffay commented Sep 26, 2019

annanay25 commented Sep 26, 2019

yurishkuro commented Sep 29, 2019

pavolloffay commented Oct 11, 2019

pavolloffay commented Oct 11, 2019

pavolloffay commented Sep 24, 2019 •

edited

Loading

pavolloffay commented Sep 24, 2019 •

edited

Loading

yurishkuro commented Sep 25, 2019 •

edited

Loading