Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more data frame examples #388

Closed
2 tasks
lcawl opened this issue Jun 25, 2019 · 4 comments
Closed
2 tasks

Add more data frame examples #388

lcawl opened this issue Jun 25, 2019 · 4 comments
Assignees
Labels

Comments

@lcawl
Copy link
Contributor

lcawl commented Jun 25, 2019

This issue tracks the addition of more examples similar to https://www.elastic.co/guide/en/elastic-stack-overview/master/ecommerce-dataframes.html

  • Duration aggregation using bucket_script
  • Count of errors / Count of 200’s using scripted_metric
@lcawl
Copy link
Contributor Author

lcawl commented Jun 25, 2019

ecommerce - simple pivot - finding my best customers

POST _data_frame/transforms/_preview
{
  "source": {
    "index": "kibana_sample_data_ecommerce"
  },
  "dest" : {
    "index" : "sample_ecommerce_orders_by_customer"
  },
  "pivot": {
    "group_by": { --[1]
      "user": { "terms": { "field": "user" }}, 
      "customer_id": { "terms": { "field": "customer_id" }}
    },
    "aggregations": {
      "order_count": { "value_count": { "field": "order_id" }},
      "total_order_amt": { "sum": { "field": "taxful_total_price" }},
      "avg_amt_per_order": { "avg": { "field": "taxful_total_price" }},
      "avg_unique_products_per_order": { "avg": { "field": "total_unique_products" }},
      "total_unique_products": { "cardinality": { "field": "products.product_id" }}
    }
  }
}
  • [1] two group_by fields have been selected. This means the data frame will contain a unique row per user + customer_id combination. Within this dataset both these fields are unique. By including both in the data frame it will give more context to the final results.

note: condensed JSON formatting has been used for easier read-ability of the pivot object

The above example will transform order data into a customer centric index. This makes it easier to answer questions such as:

  • Which customers spend the most?
  • Which customers spend the most per order?
  • Which customers order most often?
  • Which customers ordered the least number of different products?

It is possible to answer these questions using aggregations alone, however data frames allows you to persist this data as a customer centric index. This has certain advantages in that enables you to analyze data at scale and gives more flexibility to explore and navigate data from a customer centric perspective and in some instances can make creating visualizations much simpler.

@lcawl
Copy link
Contributor Author

lcawl commented Jun 25, 2019

flights stats - finding airline carriers with the most delays

POST _data_frame/transforms/_preview
{
  "source": {
    "index": "kibana_sample_data_flights",
    "query": { --[1]
      "bool": {
        "filter": [
          { "term":  { "Cancelled": false } }
        ]
      }
    }
  },
  "dest" : {
    "index" : "sample_flight_delays_by_carrier"
  },
  "pivot": {
    "group_by": { --[2] 
      "carrier": { "terms": { "field": "Carrier" }}
    },
    "aggregations": {
      "flights_count": { "value_count": { "field": "FlightNum" }},
      "delay_mins_total": { "sum": { "field": "FlightDelayMin" }},
      "flight_mins_total": { "sum": { "field": "FlightTimeMin" }},
      "delay_time_percentage": { --[3]
        "bucket_script": {
          "buckets_path": {
            "delay_time": "delay_mins_total.value",
            "flight_time": "flight_mins_total.value"
          },
          "script": "(params.delay_time / params.flight_time) * 100"
        }
      }
    }
  }
}
  • [1] Filter the source data to select only flights that were not cancelled
  • [2] Group by Dest which is the airport name
  • [3] Use a bucket script to perform calculations on results returned by the aggregation, in this case to calculate what percentage of travel time was taken up by delays

The above example will transform flight data into a entity centric index for flight carriers. This makes it easier to answer questions such as:

  • Which airline carrier has the most delays as a percentage of flight time?

Please note that this data is fictional and does not reflect actual delays or flight stats for any of the featured destination or origin airports.

possible todos - protect against div by zero, use flight_count to weight avgs

@lcawl
Copy link
Contributor Author

lcawl commented Jun 25, 2019

web logs - with scripted metrics - finding suspicious clientip's

POST _data_frame/transforms/_preview
{
  "source": {
    "index": "kibana_sample_data_logs",
    "query": { --[1]
      "range" : {
        "timestamp" : {
          "gte" : "now-30d/d"
        }
      }
    }    
  },
  "dest" : {
    "index" : "sample_weblogs_by_clientip"
  },  "pivot": {
    "group_by": {  --[2]
      "clientip": { "terms": { "field": "clientip" } }
    },
    "aggregations": {
      "url_dc": { "cardinality": { "field": "url.keyword" }},
      "bytes_sum": { "sum": { "field": "bytes" }},
      "geo.src_dc": { "cardinality": { "field": "geo.src" }},
      "agent_dc": { "cardinality": { "field": "agent.keyword" }},
      "geo.dest_dc": { "cardinality": { "field": "geo.dest" }},
      "responses.total": { "value_count": { "field": "timestamp" }},
      "responses.counts": { --[3]
        "scripted_metric": { 
          "init_script": "state.responses = ['error':0L,'success':0L,'other':0L]",
          "map_script": """
            def code = doc['response.keyword'].value;
            if (code.startsWith('5') || code.startsWith('4')) {
              state.responses.error += 1 ;
            } else if(code.startsWith('2')) {
              state.responses.success += 1;
            } else {
              state.responses.other += 1;
            }
            """,
          "combine_script": "state.responses",
          "reduce_script": """
            def counts = ['error': 0L, 'success': 0L, 'other': 0L];
            for (responses in states) {
              counts.error += responses['error'];
              counts.success += responses['success'];
              counts.other += responses['other'];
            }
            return counts;
            """
          }
        },
      "timestamp.min": { "min": { "field": "timestamp" }},
      "timestamp.max": { "max": { "field": "timestamp" }},
      "timestamp.duration_ms": { --[4]
        "bucket_script": {
          "buckets_path": {
            "min_time": "timestamp.min.value",
            "max_time": "timestamp.max.value"
          },
          "script": "(params.max_time - params.min_time)"
        }
      }
    }
  }
}
  • [1] Use a range query to limit analysis to documents that are within the last 30 days at the point in time the data frame task is executed
  • [2] Group by clientip
  • [3] Use scripted_metric to perform distributed operation on web log data to count specific types of http responses
  • [4] Use a bucket_script to calculate the duration clientip access based on results of the aggregation

The above example will transform web log data into a clientip centric index. This makes it easier to answer questions such as:

  • Which clientip's are transferring the most amounts of data?
  • Which clientip's are interacting with many different URLs?
  • Which clientip's have high error rates?
  • Which clientip's are interacting with many dest countries?
    (or combinations of these questions).

This is a basic example of a scripted metric. They are very flexible and will allow for very complex processing.

@szabosteve
Copy link
Contributor

I picked up the web logs and best customers scenarios.
Created a PR (draft ATM) for the examples: #389
It considers we'll put all the examples on one page. It might not be the case -- I'll amend it accordingly.

@lcawl lcawl closed this as completed Mar 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants