Skip to content

fjbecerra/data-platform-enron-emails

Repository files navigation

Streaming Data Platform - Enron Emails V2 xml

ScreenShot


  • ENRON-CONNECTOR

Generates events feeding .ZIP files (.XML format) off a local folder linked to AWS ESB remotely. Transforms XML's document tag + the content of its .TXT file (regardless it's an txt attachment) into AVRO format in streaming and send them to Kafka.

Files are processed asynchronously, taking advantage of the Hazelcast's ExecuteService destributed. Allowing to scale it out as needed.


  • KAFKA & SCHEMA REGISTRY

Kafka streams Avro files. Each Avro represents an single email details + body message.

Schema Registry, tracks Avro schema versions. UI. http://ADVERTISER-IP-HOST>:3030


  • ENRON-CONSUMER on SPARK.

The streaming job running on Spark, feeds messages from Kafka. It uses the spark streaming api, gets rid of the .TXT attachments.

Calculates the number of word within the body message per RDD, cleaning static things, such as company signature, automatic words generated by the mail service, numbers, asteriks...

Figures out the email of each recipient and calculates the relevancy. 1 for TO and 0.5 for CC and BCC recipients.

Sinks the results in Cassandra, using the Cassandra connector for Spark.


  • CASSANDRA

Contains analitics tables.

mail table. Contains the mail details + number of words within a message in real time

select * from enron.mail;

recipients_state. Contains all the existing unique recipients + their total relevancy regardless of order, in real time.

select * from enron.recipients_state;

mail_word_avg Avarage length, in words, of the emails.

select * from enron.mail_word_avg;

recipients_total top 100 recipient email addresses

select * from enron.recipients_total;

  • ZEPPELIN

http://IP-HOST:8080

Data visualization, allowing business users, developers, analytics run their scripts.

To link to cassandra, go to Interpreter -> Cassandra, and set in Host your host ip.

To get started, create a note, click on the save button, in order to have available those interpreters. To start query, type %cassandra and underneath, the query.


Requiremnts:

  • JAVA 8, 7 and SCALA 2.10

  • Docker 1.+

  • Docker Compose 1.+

  • Apache maven 3.+

  • sshfs - mount a remote system -


Configuration:

  • Mounting a remote system to a local folder. - Description for AWS -

1.Mount an AWS ESB and restore the Enron Emails Snapshot.

2.In order to read those files in ESB we mount a folder in our local host remotely using sshfs.

$ sudo apt-get install sshfs

SSHFS is a very powerful tool but its commands are a bit naughty.

Login as Root user. Pay attention the full path of your public/private key and no slash at the end of your target local.

# sshfs -o identityfile=<full path of your private key>.pem <instance user>@<instance aws public dns>:/<aws esb mount path>/edrm-enron-v2/ /<target local folder>

  • Editing docker-compose.yml.

1.Avro schema registry needs an advertiser host, so, we replace the value of all the environment variables called KAFKA_SCHEMA_REGISTRY to http://ADVERTISER-IP-HOST:8081

2.For enron-connector to read files, in enron-connector container -> volumes, we replace the path prefix :/enron/input to the target local folder (you configured this at the sshfs step).


  • Building the jars. We have to build it separately due to enron-connector should be built with JAVA 8 and enron-consumer with JAVA 7 since it is written in SCALA 2.10.

(Assuming JAVA 8 is by default)

1.Enron connector:

$ mvn -f enron-connector\pom.xml clean install

2.Enron consumer:

$ export JAVA_HOME = "<JAVA 7 install folder location>"
$ mvn -f enron-consumer\pom.xml clean install
  • Running Everything

Execute the bash file with name start-data-platform.sh

$ ./start-data-platform.sh

###Results

Execute this job to calculate the result once the data has landed to cassandra. Update your local folder to be mounted if needed

docker exec -i -t spark-master bash spark-submit --class com.fjbecerra.sql.AggregationJob --master local[*] --driver-memory 1G --executor-memory 1G /app/enron-consumer-0.0.1-SNAPSHOT-jar-with-dependencies.jar 

Quetion 1. Average length, in words, of the emails

ScreenShot

Question 2. Top 100 recipient email addresses

ScreenShot

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published