This repo contains the sources for both the slides and the Databricks notebooks for my Introduction to Apache Spark using Frameless talk, given at ScalaIO and at Scale by the Bay in 2018.
The slides are written in Markdown and must be translated to HTML+Reveal.JS using Pandoc. The following executables must be present in your shell's PATH to build the slides:
pandoc
(version 2.3.1 or better)lessc
(version 3.0.4 or better), for LESS stylesheet translationgit
, to check out the Reveal.js repository.
To build the slides, just run ./build.sh
. It'll build a standalone
slides.html
file in the top-level directory.
The notebooks
folder contains the individual notebooks used during the
presentation. You'll need all three. If you want, you can import them
individually. Or, you can simply download and import the notebooks.dbc
file in this directory; it contains all three notebooks.
For information on how to import notebooks into Databricks, including Databricks Community Edition, see https://docs.databricks.com/user-guide/notebooks/notebook-manage.html#import-a-notebook
There are three notebooks:
Defs.scala
: definitions shared across the other two notebooks (each of which invokesDefs
)00-Create-Data-Files.scala
, which downloads a data file of tweets from early 2018 and also parses a Kafka stream of current tweets, producing the new data files needed by the presentation. Follow the instructions in this notebook to create local copies of the data. BUT, also, see below.01-Presentation.scala
is the hands-on notebook part of the presentation.
I ran the notebooks in Databricks, with:
- Spark 2.3
- Scala 2.11
frameless-dataset_2.11-0.7.0
frameless-cats_2.11-0.7.0
You can us the 00-Create-Data-Files.scala
to download and create the data.
However, if you'd prefer to use existing data, you can also just get existing
Parquet files from the following locations:
- https://s3.amazonaws.com/ardentex-spark/spark-frameless/tweets.parquet.zip
- https://s3.amazonaws.com/ardentex-spark/spark-frameless/old-tweets.parquet.zip
My recommendation:
- Download those zip files.
- Unzip them.
- Upload them to your own S3 bucket.
- In a Databricks workspace (such as Databricks Community Edition), mount your S3 bucket to DBFS.
- Update the paths (in the
Defs.scala
notebook) to point to your S3 bucket. - Enjoy.
Feel free to drop me email ([email protected]) if you need help.