chicagolang2016/go-data.slide

Data Science/Engineering with Go
ChicaGolang 2016, #GoDataScience, @dwhitena 

Daniel Whitenack
Data Scientist at Telnyx, Mentor with Thinkful
http://datadan.io/
@dwhitena

* Outline

What is data science (in practice)?

What challenges are facing the data science community?

How Go helps overcome these challenges.

Examples.

Get Involved / Contribute.

* What is data science?

* What is #datascience?

.image images/alphago.jpg
.caption _AlphaGo_ defeats world champion Go player, image via [[https://goo.gl/ZuveiF][The Guardian]]

* What is data science (really)?

.image images/time.jpg
.caption via [[http://goo.gl/RhuhFC][Forbes]]

* What is data science (really)?

The process of transforming collections of _data_ into actionable insights.

Sometimes that includes "deep learning," etc., but it almost always includes:

- Creative problem solving
- ETL
- Arithemetic 
- Building a robust data-driven service... _maybe_

* What challenges are facing the data science community?

* What challenges facing the DS community?

.image images/sadness.png
.caption via [[https://twitter.com/josh_wills][Josh Wills]], Head of Data Engineering, Slack

* What challenges are facing the DS community?

Conflicting mindsets contributing to the infinite loop of sadness:

- Data Scientists - Problem solving, arithemetic, nifty modeling, ...
- Data Engineers - Scaling, pipelines, ...
- Devops - Reliability, monitoring, automation, ...
- Business - Money.

* What challenges are facing the DS community?

As a result of the infinite loop of sadness:

- Data scientists work in isolation on their laptops, and their nifty models are never utilized.

- Data Engineers build scalable pipelines that move data, but the data is not interesting or useable by data scientists.

- Devops is afraid to release pipelines and/or models into production and lives in a constant state of frustration with "data people."

- Business people don't see any actual value coming from all this data work.

* How Go helps overcome these challenges

* How Go will NOT overcome these challenges

by morphing into R/Python,

by integrating with every other ML/DS or "big data" framework, or

by abandoning our [[https://go-proverbs.github.io/][Go Proverbs]]

- "Don't communicate by sharing memory, share memory by communicating"
- "A little copying is better than a little dependency"
- "The bigger the interface, the weaker the abstraction"
- ...

* How Go can help overcome these challenges

By doing data science/engineering with a unique Go mindset:

- unifying data scientists and data engineers around simple, efficient, and scalable components/consumers of data pipelines.

- unifying data scientists and devops around easily deployable and maintainable Go services.

- unifying data scientists and business people around cheap, efficient data-driven services that are valuable because they Go to production.

* Examples

* Example 1: Unifying data scientists and data engineers

* Example 1: Unifying data scientists and data engineers 

You may have seen this:

.image images/million.jpg
.caption via [[http://marcio.io/2015/07/handling-1-million-requests-per-minute-with-golang/][Marcio Castilho]], Principal Architect, Malwarebytes

* Example 1: Unifying data scientists and engineers

"A little copying is better than a little dependency."

We can do data science in the same way (and make data engineers happy):

	func (d *Dispatcher) dispatch() {
		for {
			select {
			case job := <-JobQueue:
				
				go func(job Job) {
					
					jobChannel := <-d.WorkerPool
					jobChannel <- job
				
				}(job)
			}
		}
	}
	
* Example 1: Unifying data scientists and data engineers

	func (w Worker) Start() {

		rt := RethinkMsg{}

		go func() {
			for {
				w.WorkerPool <- w.JobChannel
				select {
				case job := <-w.JobChannel:
					switch {
					case "event" == job.Payload.Type:
						_ := EventSuggest(job.Payload, rt)
					case "availability" == job.Payload.Type:
						_ := AvailabilitySuggest(job.Payload, rt)
					}
				case <-w.quit:
					return
				}
			}
		}()
	}

* Example 1: Unifying data scientists and data engineers

	func EventSuggest(pl Payload, rt RethinkMsger) error {

		// nifty ML stuff
		// ...

	}

	func AvailabilitySuggest(pl Payload, rt RethinkMsger) error {

		// nifty ML stuff
		// ...

	}

* Example 2: Unifying data scientists and devops

* Example 2: Unifying data scientists and devops

For deployment, start with this for python DS (from [[https://github.com/dataquestio/ds-containers][here]]):

	FROM dataquestio/ubuntu-base
	ENV TERM=xterm
	ENV LANG en_US.UTF-8
	RUN apt-get update -y && apt-get install build-essential -y
	ADD apt-packages.txt /tmp/apt-packages.txt
	RUN xargs -a /tmp/apt-packages.txt apt-get install -y
	RUN pip install virtualenv
	RUN /usr/local/bin/virtualenv /opt/ds --distribute
	ADD /requirements/ /tmp/requirements
	ADD python2/requirements.txt /tmp/requirements/additional-reqs.txt
	RUN /opt/ds/bin/pip install -r /tmp/requirements/pre-requirements.txt
	RUN /opt/ds/bin/pip install -r /tmp/requirements/requirements.txt
	RUN /opt/ds/bin/pip install -r /tmp/requirements/additional-reqs.txt
	RUN useradd --create-home --home-dir /home/ds --shell /bin/bash ds
	RUN chown -R ds /opt/ds
	RUN adduser ds sudo
	ADD run_ipython.sh /home/ds

Continued on the next slide...

* Example 2: Unifying data scientists and devops
	
	RUN chmod +x /home/ds/run_ipython.sh
	RUN chown ds /home/ds/run_ipython.sh
	ADD .bashrc.template /home/ds/.bashrc
	EXPOSE 8888
	RUN usermod -a -G sudo ds
	RUN echo "ds ALL=(ALL) NOPASSWD: ALL" >> /etc/sudoers
	USER ds
	RUN mkdir -p /home/ds/notebooks
	ENV HOME=/home/ds
	ENV SHELL=/bin/bash
	ENV USER=ds
	VOLUME /home/ds/notebooks
	WORKDIR /home/ds/notebooks
	CMD ["/home/ds/run_ipython.sh"]

Wow...
Let's rethink this.

* Example 2: Unifying data scientists and devops

With Go:

	FROM scratch
	ADD myservice /myservice
	EXPOSE 8080
	CMD ["/myservice"]

* Example 2: Unifying data scientists and devops

More generally:

"Clear is better than clever."

"Don't just check errors, handle them gracefully."

"Design the architecture, name the components, document the details."

"Documentation is for users."

"Don't panic."

* Example 3: Unifying data scientists and business people

* Example 3: Unifying data scientists and business people

(see Examples 2 and 3)

* Get Involved / Contribute

* Get Involved / Contribute

.image images/goeco.jpg

* Get Involved / Contribute

[[https://github.com/gophergala2016/gophernotes][gophernotes]] - please try it out, submit pull requests, and/or contact me!

.image images/smaller.gif

* Get Involved / Contribute

[[https://github.com/gonum][gonum]] - numerical computing, statistics, visualization

[[https://github.com/sjwhitworth/golearn][GoLearn]] - general purpose ML

[[https://github.com/chrislusf/glow][glow]] - distributed computation system (map reduce)

[[http://pachyderm.io/][pachyderm]] - distributed, containerized data pipelines

[[https://github.com/hybridgroup/gobot/][GoBot]] - Go for sensors, IoT

[[https://github.com/kniren/gota][gota]] - DataFrames and data wrangling

(not to mention all the database, messaging, etc. things written in Go, such as BoltDB, InfluxDB, etc.)