SF-Crime-Statistics

Project Overview:

In this project, we will use a real-world dataset, extracted from Kaggle, on San Francisco crime incidents, and we will provide statistical analyses of the data using Apache Spark Structured Streaming. We will create a Kafka server to produce data, and ingest data through Spark Structured Streaming.

Environment Prerequisites:

Spark 2.4.3
Scala 2.11.x
Java 1.8.x
Kafka build with Scala 2.11.x
Python 3.6.x or 3.7.x

How to run the project:

Run ./start.sh to install project requirements. If you use pip rather than conda, then use pip install -r requirements.txt
Start zookeeper server: /usr/bin/zookeeper-server-start config/zookeeper.properties
Open a new terminal and start kafka server: /usr/bin/kafka-server-start.sh config/server.properties
Open a new terminal and run python kafka_server.py
Open a new terminal and start kafka-consumer-console: /usr/bin/kafka-consumer-console --bootstrap-server localhost:9091 --topic service.calls --from-beginning
Run spark job: spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.4 --master local[*] data_stream.py

Questions:

How did changing values on the SparkSession property parameters affect the throughput and latency of the data?

Kafka emits at the rate of 1/second and with the current sparkSession allowing maxOffsetsPerTrigger=200, changing property parameters will not make a noticeable difference of performace due to the size of the dataset.

What were the 2-3 most efficient SparkSession property key/value pairs? Through testing multiple variations on values, how can you tell these were the most optimal?

Properties: spark.sql.shuffle.partitions spark.streaming.kafka.maxRatePerPartition spark.default.parallelism

To determine if we have reached the optimal group of the above property values we will need to reach the largest possible processedRowsPerSecond.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
config		config
screenshots		screenshots
LICENSE		LICENSE
README.md		README.md
data_stream.py		data_stream.py
kafka_server.py		kafka_server.py
police-department-calls-for-service.7z		police-department-calls-for-service.7z
producer_server.py		producer_server.py
radio_code.json		radio_code.json
requirements.txt		requirements.txt
start.sh		start.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SF-Crime-Statistics

Project Overview:

Environment Prerequisites:

How to run the project:

Questions:

About

Releases

Packages

Languages

License

constantinos07/SF-Crime-Statistics

Folders and files

Latest commit

History

Repository files navigation

SF-Crime-Statistics

Project Overview:

Environment Prerequisites:

How to run the project:

Questions:

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages