Big Data

Hadoop
Hive
Pig
Spark
Kafka
Presto
Impala
Apache Drill
Apache Flink
Apache Beam
Druid DB
ZooKeeper
Diagram
- Lambda vs Kappa Architecture

Hadoop

See Hadoop doc.

Hive

See Hive doc.

Pig

See Pig doc.

Spark

See Spark doc.

Kafka

See Kafka doc.

Presto

See Presto doc.

Impala

See Impala doc.

Apache Drill

See Apache Drill doc.

Apache Flink

Used by CapitalOne: http://www.slideshare.net/FlinkForward/flink-case-study-capital-one

streaming iterative data flow framework
DataSet API (batch) - Java/Scala/Python
DataStream API (streaming) - Java/Scala

Kafka -> Apache Flink -> Elasticsearch -> Kibana 4 -> HDFS long term storage + batch processing

real-time
stateful
checkpointing
exactly once event processing (no duplicate re-computation)
accurate
event-time-based windowing - even when events arrive out of order or arrive delayed
flexible windowsing - can't wait forever
alerts, transformations, enrichments, lookups, very low overhead in real-time
advanced windowing, machine learning (event correlation, fraud detection, event clustering, anomaly detection, user session analysis)
creates plan like Spark, execute() call triggers action
auto-updates its state without explicit function like Spark
Spark must define micro-batches in either time or size
Flink does not require defining a batch size

Apache Beam

https://beam.apache.org/

analytics abstraction layer
engine backends to:
- Spark
- Flink
- Apex
- Google Cloud DataFlow
APIs:
- Python
- Java

Druid DB

See also: Pivot - an exploratory analytics UI for Druid

OLAP ad-hoc interactive low latency queries "slice-n-dice"
inverted index for needle-in-a-haystack queries
columnar DB
optimized for scans
real-time ingest
fast aggregation + ingest
rollups on ingest (may reduce storage by 100x)
schema required (for roll-ups)
does not support full-text search like Elasticsearch
use Spark to process and upload results to Druid
doesn't support full joins (only large to small table joins)

ZooKeeper

See ZooKeeper doc.

Diagram

Lambda vs Kappa Architecture

Ported from various private Knowledge Base pages 2010+