Add more data frame examples #388

lcawl · 2019-06-25T22:31:53Z

This issue tracks the addition of more examples similar to https://www.elastic.co/guide/en/elastic-stack-overview/master/ecommerce-dataframes.html

Duration aggregation using bucket_script
Count of errors / Count of 200’s using scripted_metric

lcawl · 2019-06-25T22:34:52Z

ecommerce - simple pivot - finding my best customers

POST _data_frame/transforms/_preview
{
  "source": {
    "index": "kibana_sample_data_ecommerce"
  },
  "dest" : {
    "index" : "sample_ecommerce_orders_by_customer"
  },
  "pivot": {
    "group_by": { --[1]
      "user": { "terms": { "field": "user" }}, 
      "customer_id": { "terms": { "field": "customer_id" }}
    },
    "aggregations": {
      "order_count": { "value_count": { "field": "order_id" }},
      "total_order_amt": { "sum": { "field": "taxful_total_price" }},
      "avg_amt_per_order": { "avg": { "field": "taxful_total_price" }},
      "avg_unique_products_per_order": { "avg": { "field": "total_unique_products" }},
      "total_unique_products": { "cardinality": { "field": "products.product_id" }}
    }
  }
}

[1] two group_by fields have been selected. This means the data frame will contain a unique row per user + customer_id combination. Within this dataset both these fields are unique. By including both in the data frame it will give more context to the final results.

note: condensed JSON formatting has been used for easier read-ability of the pivot object

The above example will transform order data into a customer centric index. This makes it easier to answer questions such as:

Which customers spend the most?
Which customers spend the most per order?
Which customers order most often?
Which customers ordered the least number of different products?

It is possible to answer these questions using aggregations alone, however data frames allows you to persist this data as a customer centric index. This has certain advantages in that enables you to analyze data at scale and gives more flexibility to explore and navigate data from a customer centric perspective and in some instances can make creating visualizations much simpler.

lcawl · 2019-06-25T22:37:15Z

flights stats - finding airline carriers with the most delays

POST _data_frame/transforms/_preview
{
  "source": {
    "index": "kibana_sample_data_flights",
    "query": { --[1]
      "bool": {
        "filter": [
          { "term":  { "Cancelled": false } }
        ]
      }
    }
  },
  "dest" : {
    "index" : "sample_flight_delays_by_carrier"
  },
  "pivot": {
    "group_by": { --[2] 
      "carrier": { "terms": { "field": "Carrier" }}
    },
    "aggregations": {
      "flights_count": { "value_count": { "field": "FlightNum" }},
      "delay_mins_total": { "sum": { "field": "FlightDelayMin" }},
      "flight_mins_total": { "sum": { "field": "FlightTimeMin" }},
      "delay_time_percentage": { --[3]
        "bucket_script": {
          "buckets_path": {
            "delay_time": "delay_mins_total.value",
            "flight_time": "flight_mins_total.value"
          },
          "script": "(params.delay_time / params.flight_time) * 100"
        }
      }
    }
  }
}

[1] Filter the source data to select only flights that were not cancelled
[2] Group by Dest which is the airport name
[3] Use a bucket script to perform calculations on results returned by the aggregation, in this case to calculate what percentage of travel time was taken up by delays

The above example will transform flight data into a entity centric index for flight carriers. This makes it easier to answer questions such as:

Which airline carrier has the most delays as a percentage of flight time?

Please note that this data is fictional and does not reflect actual delays or flight stats for any of the featured destination or origin airports.

possible todos - protect against div by zero, use flight_count to weight avgs

lcawl · 2019-06-25T22:39:11Z

web logs - with scripted metrics - finding suspicious clientip's

POST _data_frame/transforms/_preview
{
  "source": {
    "index": "kibana_sample_data_logs",
    "query": { --[1]
      "range" : {
        "timestamp" : {
          "gte" : "now-30d/d"
        }
      }
    }    
  },
  "dest" : {
    "index" : "sample_weblogs_by_clientip"
  },  "pivot": {
    "group_by": {  --[2]
      "clientip": { "terms": { "field": "clientip" } }
    },
    "aggregations": {
      "url_dc": { "cardinality": { "field": "url.keyword" }},
      "bytes_sum": { "sum": { "field": "bytes" }},
      "geo.src_dc": { "cardinality": { "field": "geo.src" }},
      "agent_dc": { "cardinality": { "field": "agent.keyword" }},
      "geo.dest_dc": { "cardinality": { "field": "geo.dest" }},
      "responses.total": { "value_count": { "field": "timestamp" }},
      "responses.counts": { --[3]
        "scripted_metric": { 
          "init_script": "state.responses = ['error':0L,'success':0L,'other':0L]",
          "map_script": """
            def code = doc['response.keyword'].value;
            if (code.startsWith('5') || code.startsWith('4')) {
              state.responses.error += 1 ;
            } else if(code.startsWith('2')) {
              state.responses.success += 1;
            } else {
              state.responses.other += 1;
            }
            """,
          "combine_script": "state.responses",
          "reduce_script": """
            def counts = ['error': 0L, 'success': 0L, 'other': 0L];
            for (responses in states) {
              counts.error += responses['error'];
              counts.success += responses['success'];
              counts.other += responses['other'];
            }
            return counts;
            """
          }
        },
      "timestamp.min": { "min": { "field": "timestamp" }},
      "timestamp.max": { "max": { "field": "timestamp" }},
      "timestamp.duration_ms": { --[4]
        "bucket_script": {
          "buckets_path": {
            "min_time": "timestamp.min.value",
            "max_time": "timestamp.max.value"
          },
          "script": "(params.max_time - params.min_time)"
        }
      }
    }
  }
}

[1] Use a range query to limit analysis to documents that are within the last 30 days at the point in time the data frame task is executed
[2] Group by clientip
[3] Use scripted_metric to perform distributed operation on web log data to count specific types of http responses
[4] Use a bucket_script to calculate the duration clientip access based on results of the aggregation

The above example will transform web log data into a clientip centric index. This makes it easier to answer questions such as:

Which clientip's are transferring the most amounts of data?
Which clientip's are interacting with many different URLs?
Which clientip's have high error rates?
Which clientip's are interacting with many dest countries?
(or combinations of these questions).

This is a basic example of a scripted metric. They are very flexible and will allow for very complex processing.

szabosteve · 2019-06-26T16:41:20Z

I picked up the web logs and best customers scenarios.
Created a PR (draft ATM) for the examples: #389
It considers we'll put all the examples on one page. It might not be the case -- I'll amend it accordingly.

lcawl added the :ml label Jun 25, 2019

lcawl assigned szabosteve and lcawl Jun 25, 2019

lcawl closed this as completed Mar 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add more data frame examples #388

Add more data frame examples #388

lcawl commented Jun 25, 2019

lcawl commented Jun 25, 2019 •

edited

Loading

lcawl commented Jun 25, 2019

lcawl commented Jun 25, 2019

szabosteve commented Jun 26, 2019

Add more data frame examples #388

Add more data frame examples #388

Comments

lcawl commented Jun 25, 2019

lcawl commented Jun 25, 2019 • edited Loading

ecommerce - simple pivot - finding my best customers

lcawl commented Jun 25, 2019

flights stats - finding airline carriers with the most delays

lcawl commented Jun 25, 2019

web logs - with scripted metrics - finding suspicious clientip's

szabosteve commented Jun 26, 2019

lcawl commented Jun 25, 2019 •

edited

Loading