thedataincubator · ZachGlassman · Oct 4, 2018 · Oct 4, 2018 · Oct 4, 2018 · Oct 4, 2018
diff --git a/.dockerignore b/.dockerignore
@@ -0,0 +1,8 @@
+code/
+data/
+graphs/
+img/
+output/
+__pycache__
+*.pyc 
+.git/
diff --git a/.gitignore b/.gitignore
@@ -4,3 +4,5 @@ code/secrets
 *.pyc
 .RData
 .Rhistory
+__pycache__/
+.pytest_cache/
diff --git a/.travis.yml b/.travis.yml
@@ -1,7 +1,16 @@
 language: node_js
 
+sudo: required
+
+services:
+  - docker
+
 node_js:
   - "node"
 
 install:
   - npm i markdown-spellcheck -g
+
+script:
+  - docker build -t tester .
+  - docker run -i tester /bin/bash -c "pytest -v --tests-per-worker auto"
diff --git a/Dockerfile b/Dockerfile
@@ -0,0 +1,14 @@
+FROM alpine:edge
+
+RUN apk update && apk add --no-cache \
+    python3 \
+    bash \
+    py3-lxml && \
+    python3 -m ensurepip
+
+ADD ./tests/requirements.txt /tmp/requirements.txt
+
+RUN pip3 install -qr /tmp/requirements.txt 
+
+ADD . /src/
+WORKDIR /src
diff --git a/deep-learning-libraries.md b/deep-learning-libraries.md
@@ -17,7 +17,7 @@ The ranking is based on equally weighing its three components: Github (stars and
 `TensorFlow` is at least two standard deviations above the mean on all calculated metrics. `TensorFlow` has almost three times as many Github forks and more than six times as many Stack Overflow questions than the second most popular framework, `Caffe`. First open-sourced by the Google Brain team in 2015, `TensorFlow` has climbed over more senior libraries such as `Theano` (4) and `Torch` (8) for the top spot on our list. While `TensorFlow` is distributed with a Python API running on a C++ engine, several of the libraries on our list can utilize `TensorFlow` as a back-end and offer their own interfaces. These include `Keras` (2), which will [soon be part of core TensorFlow](https://twitter.com/fchollet/status/820746845068505088) and `Sonnet` (6). The popularity of `TensorFlow` is likely due to a combination of its general-purpose deep learning framework, flexible interface, good-looking computational graph visualizations, and Google’s significant developer and community resources.
 
 ## `Caffe` has yet to be replaced by `Caffe2`
-`Caffe` takes a strong third place on our list with more Github activity than all of its competitors (excluding `TensorFlow`). `Caffe` is traditionally thought of as more specialized than `Tensorflow` and was developed with a focus on image processing, objection recognition, and pre-trained convolutional neural networks. Facebook released `Caffe2` (11) in April 2017, and it already ranks in the top half the deep learning libraries. `Caffe2` is a more lightweight, modular, and scalable version of `Caffe` that includes recurrent neural networks. `Caffe` and `Caffe2` are separate repos, so data scientists can continue to use the original `Caffe`. However, there are migration tools such as [Caffe Translator](https://github.com/caffe2/caffe2/blob/master/caffe2/python/caffe_translator.py) that provide a means of using `Caffe2` to drive existing `Caffe` models.
+`Caffe` takes a strong third place on our list with more Github activity than all of its competitors (excluding `TensorFlow`). `Caffe` is traditionally thought of as more specialized than `Tensorflow` and was developed with a focus on image processing, objection recognition, and pre-trained convolutional neural networks. Facebook released `Caffe2` (11) in April 2017, and it already ranks in the top half the deep learning libraries. `Caffe2` is a more lightweight, modular, and scalable version of `Caffe` that includes recurrent neural networks. `Caffe` and `Caffe2` are separate repos, so data scientists can continue to use the original `Caffe`. However, there are migration tools such as [Caffe Translator](https://github.com/pytorch/pytorch/blob/master/caffe2/python/caffe_translator.py) that provide a means of using `Caffe2` to drive existing `Caffe` models.
 
 ## `Keras` is the most popular front-end for deep learning
 `Keras` (2) is highest ranked non-framework library. `Keras` can be used as a front-end for `TensorFlow` (1), `Theano` (4), `MXNet` (7), `CNTK` (9), or `deeplearning4j` (14). `Keras` performed better than average on all three metrics measured. The popularity of `Keras` is likely due to its simplicity and ease-of-use. `Keras` allows for fast prototyping at the cost of some of the flexibility and control that comes from working directly with a framework. `Keras` is favorited by data scientists experimenting with deep learning on their data sets. The development and popularity of `Keras` continues with R Studio recently releasing [an interface](https://keras.rstudio.com) in `R` for `Keras`.  

diff --git a/js-viz-packages.md b/js-viz-packages.md
@@ -48,7 +48,7 @@ The data presented a few difficulties:
 
 All source code and data is on [our Github Page](https://github.com/thedataincubator/data-science-blogs).
 
-We first generated a list of 141 Data Science packages [from](https://github.com/fasouto/awesome-dataviz') [these](https://github.com/wbkd/awesome-d3) [four](https://en.wikipedia.org/wiki/Comparison_of_JavaScript_charting_frameworks) [sources](https://cssauthor.com/javascript-charting-libraries), and then collected metrics for all of them, to come up with the ranking. Github data is based on both stars and forks, while Stack Overflow data is based on tags and questions containing the package name. Downloads data is from npmjs. Downloads were totaled over a six month period, and the compound monthly growth rate was calculated over the same period. After scraping other sites for JS visualization package names, we had gathered over 200 package names. Many of them were aliases for the same packages (d3, D3JS). If a the first result of Github search returned the same repo as another package, we treated them as the same package, but saved the aliases to search Stack Overflow questions. 
+We first generated a list of 141 Data Science packages [from](https://github.com/fasouto/awesome-dataviz) [these](https://github.com/wbkd/awesome-d3) [four](https://en.wikipedia.org/wiki/Comparison_of_JavaScript_charting_frameworks) [sources](https://cssauthor.com/javascript-charting-libraries), and then collected metrics for all of them, to come up with the ranking. Github data is based on both stars and forks, while Stack Overflow data is based on tags and questions containing the package name. Downloads data is from npmjs. Downloads were totaled over a six month period, and the compound monthly growth rate was calculated over the same period. After scraping other sites for JS visualization package names, we had gathered over 200 package names. Many of them were aliases for the same packages (d3, D3JS). If a the first result of Github search returned the same repo as another package, we treated them as the same package, but saved the aliases to search Stack Overflow questions. 
 
 A few other notes:
 

diff --git a/python-packages.md b/python-packages.md
@@ -100,7 +100,7 @@ tables), and `shogun` (machine learning). They were all below average compared
 to the ranked packages, in all categories.
 
 Importantly,
-the [Anaconda distribution](https://www.continuum.io/anaconda-overview) bundles
+the [Anaconda distribution](https://www.anaconda.com/what-is-anaconda/) bundles
 together many of these packages, and this was not considered.
 
 Further, naturally, some packages that have been around longer will have higher

diff --git a/tests/requirements.txt b/tests/requirements.txt
@@ -0,0 +1,5 @@
+requests
+pytest
+beautifulsoup4
+markdown
+pytest-parallel
diff --git a/tests/test_links.py b/tests/test_links.py
@@ -0,0 +1,56 @@
+import os
+import re
+import codecs
+import pytest
+import requests
+from bs4 import BeautifulSoup
+import markdown
+
+def _get_files():
+    return [i for i in os.listdir()
+            if i.split('.')[-1] == 'md']
+
+def _parse_links(filename):
+    input_file = codecs.open(filename, 
+                             mode="r", 
+                             encoding="utf-8")
+    text = input_file.read()
+    soup = BeautifulSoup(markdown.markdown(text), "lxml")
+    return [link['href'] for link in soup.find_all('a', href=True)]
+
+def _valid(link):
+    if '.md' in link:
+        return False
+    if '.csv' in link:
+        return False
+    if '.' not in link:
+        return False
+    return True
+
+def _get_links_from_page(filename):
+    links = []
+    for link in _parse_links(filename):
+        if _valid(link):
+            links.append((filename, link))
+    return links
+
+def _get_links():
+    links = []
+    for filename in _get_files():
+        links += _get_links_from_page(filename)
+    return links
+
+@pytest.mark.parametrize("filename,link", _get_links())
+def test_link(filename, link):
+    request_success = False
+    for i in range(3):
+        if request_success:
+            break
+        try:
+            r = requests.get(link, timeout=10)
+            request_success = True
+        except Exception:
+            pass
+
+    assert request_success
+    assert r.status_code == 200