Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOCS] Creates a data frame examples page #389

Merged
merged 33 commits into from
Jul 2, 2019
Merged
Show file tree
Hide file tree
Changes from 30 commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
1eccf7e
Added dataframes intro file to the ML section, amended overview.ascii…
szabosteve May 29, 2019
e4a91a3
Fixed paragraph style.
szabosteve May 29, 2019
ca68a7a
Fixed markup.
szabosteve May 29, 2019
016d36f
Fixed markup.
szabosteve May 29, 2019
1916513
Rephrased sentences to improve readability.
szabosteve May 29, 2019
65f6dab
Fixed header.
szabosteve May 30, 2019
a564230
Extended the text: pivoting, aggregation, continuous data frames.
szabosteve May 31, 2019
236ba3a
Amended the promise about continuous data frames.
szabosteve May 31, 2019
04051ba
Fixed typo.
szabosteve May 31, 2019
7e89296
Added simple example to the intro.
szabosteve May 31, 2019
00e6e92
Quick fix.
szabosteve May 31, 2019
a09d80f
Amended the text based on the technical and peer reviews.
szabosteve Jun 11, 2019
09246fe
Added beta tag to the page.
szabosteve Jun 11, 2019
648cb7d
Improve readability.
szabosteve Jun 12, 2019
0ed95f8
Adds screenshot to the example.
szabosteve Jun 12, 2019
17fb7cd
Merge branch 'master' of github.com:elastic/stack-docs
szabosteve Jun 13, 2019
59d2520
Merge branch 'master' of github.com:elastic/stack-docs
szabosteve Jun 13, 2019
5fb3569
Merge branch 'master' of github.com:elastic/stack-docs
szabosteve Jun 17, 2019
a80e070
Merge branch 'master' of github.com:elastic/stack-docs
szabosteve Jun 24, 2019
c93b440
Merge branch 'master' of github.com:elastic/stack-docs
szabosteve Jun 26, 2019
1bafd48
[DOCS] Creates a data frame examples page.
szabosteve Jun 26, 2019
a90c521
Fixes titleabbrev.
szabosteve Jun 26, 2019
f79da15
Adds sample response to the web log example.
szabosteve Jun 26, 2019
6a18931
[DOCS] Shortens data frame examples navigation title
lcawl Jun 27, 2019
6e16553
Fine-tunes the web log example.
szabosteve Jun 27, 2019
ea3c148
Adds best customer example to the page.
szabosteve Jun 27, 2019
3e0f07d
Adds Kibana sample data reference to the example intro.
szabosteve Jun 27, 2019
498ed35
Adds airline example section.
szabosteve Jun 27, 2019
bc03c8c
Adds the flight example to the example pool.
szabosteve Jun 27, 2019
3da0fec
Fixes typos.
szabosteve Jun 27, 2019
9e54139
Addresses feedback.
szabosteve Jun 28, 2019
d2e55c9
Addresses feedback.
szabosteve Jul 1, 2019
80ea756
Addresses feedback.
szabosteve Jul 2, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
362 changes: 362 additions & 0 deletions docs/en/stack/data-frames/dataframe-examples.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,362 @@
[role="xpack"]
[testenv="basic"]
[[dataframe-examples]]
== {dataframe-cap} examples
++++
<titleabbrev>Examples</titleabbrev>
++++

beta[]

This page provides examples of how to use {dataframe-transforms} to derive useful
szabosteve marked this conversation as resolved.
Show resolved Hide resolved
insights from your data. All the examples use one of the
{kibana-ref}/add-sample-data.html[{kib} sample datasets]. For a more detailed,
step-by-step example, see
<<ecommerce-dataframes,Transforming your data with {dataframes}>>.

* <<example-best-customers>>
* <<example-airline>>
* <<example-clientips>>

[float]
[[example-best-customers]]
=== Finding your best customers

In this example, we use the eCommerce orders sample dataset to find the customers
who spent the most in our hypothetical webshop. Let's transform the data such
that the destination index contains the number of orders, the total price of
szabosteve marked this conversation as resolved.
Show resolved Hide resolved
the orders, the amount of unique products and the average price per order,
and the total amount of ordered products for each customer.

[source,js]
----------------------------------
POST _data_frame/transforms/_preview
{
"source": {
"index": "kibana_sample_data_ecommerce"
},
"dest" : {
szabosteve marked this conversation as resolved.
Show resolved Hide resolved
"index" : "sample_ecommerce_orders_by_customer"
},
"pivot": {
"group_by": { <1>
"user": { "terms": { "field": "user" }},
"customer_id": { "terms": { "field": "customer_id" }}
},
"aggregations": {
"order_count": { "value_count": { "field": "order_id" }},
"total_order_amt": { "sum": { "field": "taxful_total_price" }},
"avg_amt_per_order": { "avg": { "field": "taxful_total_price" }},
"avg_unique_products_per_order": { "avg": { "field": "total_unique_products" }},
"total_unique_products": { "cardinality": { "field": "products.product_id" }}
}
}
}
----------------------------------
// CONSOLE
// TEST[skip:SETUP]

<1> Two `group_by` fields have been selected. This means the {dataframe} will
contain a unique row per `user` and `customer_id` combination. Within this
dataset both these fields are unique. By including both in the {dataframe} it
gives more context to the final results.

NOTE: In the example above, condensed JSON formatting has been used for easier
readability of the pivot object.

The API returns the following response (note that this example contains the
szabosteve marked this conversation as resolved.
Show resolved Hide resolved
response partially and shows only the first document object):
szabosteve marked this conversation as resolved.
Show resolved Hide resolved

[source,js]
----------------------------------
{
"preview" : [
{
"total_unique_products" : {
"avg" : 2.0
},
"customer_id" : "10",
"user" : "recip",
"order_id" : {
szabosteve marked this conversation as resolved.
Show resolved Hide resolved
"value_count" : 59.0
},
"taxful_total_price" : {
"avg" : 66.89790783898304,
"sum" : 3946.9765625
},
"products" : {
"product_id" : {
"cardinality" : 116.0
}
}
},
...
}
----------------------------------
// CONSOLE
// TEST[skip:SETUP]

In the above example we saw how to transform order data into a customer centric
szabosteve marked this conversation as resolved.
Show resolved Hide resolved
index. Doing this makes it easier to answer questions such as:
szabosteve marked this conversation as resolved.
Show resolved Hide resolved

* Which customers spend the most?

* Which customers spend the most per order?

* Which customers order most often?

* Which customers ordered the least number of different products?

It's possible to answer these questions using aggregations alone, however
{dataframes} allow us to persist this data as a customer centric index. This
enables us to analyze data at scale and gives more flexibility to explore and
navigate data from a customer centric perspective. In some cases, it can even
make creating visualizations much simpler.

[float]
[[example-airline]]
=== Finding air carriers with the most delays

We use the Flights sample dataset to find out which air carrier delayed the most.
szabosteve marked this conversation as resolved.
Show resolved Hide resolved
First, we filter the source data such that excludes all the cancelled flights by
szabosteve marked this conversation as resolved.
Show resolved Hide resolved
using a query filter, then transform the data to contain the distinct number of
szabosteve marked this conversation as resolved.
Show resolved Hide resolved
flights, the sum of delayed minutes, and the sum of the flight minutes by air
carrier. Finally, we use a `bucket_script` to determine what percentage of the
flight time was actually delay.

[source,js]
----------------------------------
POST _data_frame/transforms/_preview
{
"source": {
"index": "kibana_sample_data_flights",
"query": { <1>
"bool": {
"filter": [
{ "term": { "Cancelled": false } }
]
}
}
},
"dest" : {
"index" : "sample_flight_delays_by_carrier"
},
"pivot": {
"group_by": { <2>
"carrier": { "terms": { "field": "Carrier" }}
},
"aggregations": {
"flights_count": { "value_count": { "field": "FlightNum" }},
"delay_mins_total": { "sum": { "field": "FlightDelayMin" }},
"flight_mins_total": { "sum": { "field": "FlightTimeMin" }},
"delay_time_percentage": { <3>
"bucket_script": {
"buckets_path": {
"delay_time": "delay_mins_total.value",
"flight_time": "flight_mins_total.value"
},
"script": "(params.delay_time / params.flight_time) * 100"
}
}
}
}
}
----------------------------------
// CONSOLE
// TEST[skip:SETUP]

<1> Filter the source data to select only flights that were not cancelled.
<2> The data is grouped by the `Carrier` field which contains the airline name.
<3> This `bucket_script` performs calculations on the results that returned by
szabosteve marked this conversation as resolved.
Show resolved Hide resolved
szabosteve marked this conversation as resolved.
Show resolved Hide resolved
the aggregation, in this particular example to calculate what percentage of
szabosteve marked this conversation as resolved.
Show resolved Hide resolved
travel time was taken up by delays.

The API returns the following response:
szabosteve marked this conversation as resolved.
Show resolved Hide resolved

[source,js]
----------------------------------
{
"preview" : [
{
"carrier" : "ES-Air",
"flights_count" : 2802.0,
"flight_mins_total" : 1436927.5130677223,
"delay_time_percentage" : 9.335543983955839,
"delay_mins_total" : 134145.0
},
{
szabosteve marked this conversation as resolved.
Show resolved Hide resolved
"carrier" : "JetBeats",
"flights_count" : 2833.0,
"flight_mins_total" : 1451143.6898144484,
"delay_time_percentage" : 8.937088787987832,
"delay_mins_total" : 129690.0
},
{
"carrier" : "Kibana Airlines",
"flights_count" : 2832.0,
"flight_mins_total" : 1419081.404241085,
"delay_time_percentage" : 9.088273556017194,
"delay_mins_total" : 128970.0
},
{
"carrier" : "Logstash Airways",
"flights_count" : 2914.0,
"flight_mins_total" : 1503620.8713908195,
"delay_time_percentage" : 9.544959286661593,
"delay_mins_total" : 143520.0
}
]
}
----------------------------------
// CONSOLE
// TEST[skip:SETUP]

The example above transforms flight data into a entity centric index for
szabosteve marked this conversation as resolved.
Show resolved Hide resolved
flight carriers. This makes it easier to answer questions such as:
szabosteve marked this conversation as resolved.
Show resolved Hide resolved

* Which air carrier has the most delays as a percentage of flight time?

NOTE: Please note that this data is fictional and does not reflect actual delays
szabosteve marked this conversation as resolved.
Show resolved Hide resolved
or flight stats for any of the featured destination or origin airports.

[float]
[[example-clientips]]
=== Finding suspicious client IPs by using scripted metrics

With {dataframe-transforms}, you can use
{ref}/search-aggregations-metrics-scripted-metric-aggregation.html[scripted
metric aggregations] on your data. These aggregations are flexible and make
it possible to perform very complex processing. Let's use scripted metrics to
identify suspicious client IPs in the web log sample dataset.

We transform the data such that the new index contains the sum of bytes and the
number of distinct URLs, agents, incoming requests by location, and geographic
destinations for each client IP. We also use a scripted field to count the
specific types of HTTP responses that each client IP receives. Ultimately, the
example below transforms web log data into an entity-centric index where the
szabosteve marked this conversation as resolved.
Show resolved Hide resolved
entity is `clientip`.

[source,js]
----------------------------------
POST _data_frame/transforms/_preview
{
"source": {
"index": "kibana_sample_data_logs",
"query": { <1>
"range" : {
"timestamp" : {
"gte" : "now-30d/d"
}
}
}
},
"dest" : {
"index" : "sample_weblogs_by_clientip"
},
"pivot": {
"group_by": { <2>
"clientip": { "terms": { "field": "clientip" } }
},
"aggregations": {
"url_dc": { "cardinality": { "field": "url.keyword" }},
"bytes_sum": { "sum": { "field": "bytes" }},
"geo.src_dc": { "cardinality": { "field": "geo.src" }},
"agent_dc": { "cardinality": { "field": "agent.keyword" }},
"geo.dest_dc": { "cardinality": { "field": "geo.dest" }},
"responses.total": { "value_count": { "field": "timestamp" }},
"responses.counts": { <3>
"scripted_metric": {
"init_script": "state.responses = ['error':0L,'success':0L,'other':0L]",
"map_script": """
def code = doc['response.keyword'].value;
if (code.startsWith('5') || code.startsWith('4')) {
state.responses.error += 1 ;
} else if(code.startsWith('2')) {
state.responses.success += 1;
} else {
state.responses.other += 1;
}
""",
"combine_script": "state.responses",
"reduce_script": """
def counts = ['error': 0L, 'success': 0L, 'other': 0L];
for (responses in states) {
counts.error += responses['error'];
counts.success += responses['success'];
counts.other += responses['other'];
}
return counts;
"""
}
},
"timestamp.min": { "min": { "field": "timestamp" }},
"timestamp.max": { "max": { "field": "timestamp" }},
"timestamp.duration_ms": { <4>
"bucket_script": {
"buckets_path": {
"min_time": "timestamp.min.value",
"max_time": "timestamp.max.value"
},
"script": "(params.max_time - params.min_time)"
}
}
}
}
}
----------------------------------
// CONSOLE
// TEST[skip:SETUP]

<1> This range query limits the transform to documents that are within the
last 30 days at the point in time the {dataframe-transform} is started.
szabosteve marked this conversation as resolved.
Show resolved Hide resolved
<2> The data is grouped by the `clientip` field.
<3> This `scripted_metric` performs a distributed operation on the web log data
to count specific types of HTTP responses (error, success, and other).
<4> This `bucket_script` calculates the duration of the `clientip` access based
on the results of the aggregation.

The API returns the following response (note that this example contains the
szabosteve marked this conversation as resolved.
Show resolved Hide resolved
response partially and shows only the first document object):

[source,js]
----------------------------------
{
"preview" : [
{
"geo" : {
"src_dc" : 12.0,
"dest_dc" : 9.0
},
"clientip" : "0.72.176.46",
"agent_dc" : 3.0,
"responses" : {
"total" : 14.0,
"counts" : {
"other" : 0,
"success" : 14,
"error" : 0
}
},
"bytes_sum" : 74808.0,
"timestamp" : {
"duration_ms" : 4.919943239E9,
"min" : "2019-06-17T07:51:57.333Z",
"max" : "2019-08-13T06:31:00.572Z"
},
"url_dc" : 11.0
},
...
}
----------------------------------
// CONSOLE
// TEST[skip:SETUP]

This {dataframe-transform} makes it easier to answer questions such as:
szabosteve marked this conversation as resolved.
Show resolved Hide resolved

* Which client IPs are transferring the most amounts of data?

* Which client IPs are interacting with a high number of different URLs?

* Which client IPs have high error rates?

* Which client IPs are interacting with a high number of destination countries?
3 changes: 2 additions & 1 deletion docs/en/stack/data-frames/index.asciidoc
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
include::dataframes.asciidoc[]
include::ecommerce-example.asciidoc[]
include::api-quickref.asciidoc[]
include::api-quickref.asciidoc[]
include::dataframe-examples.asciidoc[]