Definition

RDDs are fault-tolerant, parallel data structures that let users explicitly persist intermediate results in memory, control their partitioning to optimize data placement, and manipulate them using a rich set of operator [Spark paper]

Enables efficient data reuse & hence allow for efficient iterative algorithms
Coarse-grained transformations are applied with persisted logging to enable the data to be recomputed (hence fault-tolerant).

Usage

Best suited to maps over a large dataset (simple lineage, small commit log)

unsuited to fine grained shared state (large commit log)

Comparison with Distributed shared memory

Aspect	RDDs	DSM
Reads	Coarse- or fine-grained	Fine-grained
Writes	Coarse-grained	Fine-grained
Consistency	Trivial	Up to app/runtime
Fault Recovery	Fine grained & low overhead using lineage (can reproduce from operators)	Checkpoints and program rollback
Straggler Mitigation	Possible using backup tasks	Difficult
Work Placement	automatic based on data locality	Up to app (runtimes try to be transparent)
Out of memory	Same as existing flow systems	Poor performance due to swap

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resilient Distributed Datasets.md

Resilient Distributed Datasets.md

Definition

Usage

Comparison with Distributed shared memory

Files

Resilient Distributed Datasets.md

Latest commit

History

Resilient Distributed Datasets.md

File metadata and controls

Definition

Usage

Comparison with Distributed shared memory