GitHub - onkargagare/San-Francisco-crime-statistics-and-spark-streaming: In this project, I will work on a real-world dataset, extracted from Kaggle, on San Francisco crime incidents, and will provide statistical analyses of the data using Apache Spark Structured Streaming. Will create a Kafka server to produce data, and ingest data through Spark Structured Streaming.

1.How did changing values on the SparkSession property parameters affect the throughput and latency of the data?

--> The processedRowsPerSecond increase and decrease affected the throughput and latency of the data, the next batch of data is kept on hold until the current batch is processed.

2.What were the 2-3 most efficient SparkSession property key/value pairs? Through testing multiple variations on values, how can you tell these were the most optimal?

--> processedRowsPerSecond depends on these settings- spark.streaming.kafka.maxRatePerPartition,
spark.default.parallelism,
spark.sql.shuffle.partitions

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
config		config
README.md		README.md
crime type name and counts.png		crime type name and counts.png
crime type name and description.png		crime type name and description.png
data_stream.py		data_stream.py
kafka-console-output.PNG		kafka-console-output.PNG
kafka_server.py		kafka_server.py
producer_server.py		producer_server.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

onkargagare/San-Francisco-crime-statistics-and-spark-streaming

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages