What's in this repo?

This repo supports a presentation I gave to the fall 2023 Multivariate Density Estimation course at Rice university. It's about some software tools I wish I'd known more about in grad-school including:

spark
docker
r packaging
ray
git/github
sql
automated testing

How does the repo work?

I only covered half the topics as I split it with another alumnus. The repo is an example of using spark, docker, and ray to explore the tax filings of nonprofits. Here are the main pieces:

Parsing the data

Run bash download_2023.sh to download the 2023 tax filings.
Use either of python/parse_irs_xml.py or r/xml_benchmark.R to parse the data.
The point of these files is to show that there are low-effort ways to greatly speed up parsing, not to be exemplary parsing code.

Further analysis using spark

Build an analysis docker

docker build --rm -t my_docker ./docker

Launch a local standalone spark cluster and the analysis image

docker-compose spark/docker-compose.yml up

Start an analysis shell

docker exec -it spark-rclient-1 /bin/bash

The data will be in the app directory. Play with the data as desired. There are some examples in r/spark_exploration.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What's in this repo?

How does the repo work?

Parsing the data

Further analysis using spark

Build an analysis docker

Launch a local standalone spark cluster and the analysis image

Start an analysis shell

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
data		data
docker		docker
python		python
r		r
spark		spark
.gitignore		.gitignore
docker-compose.yml		docker-compose.yml
download_2023.sh		download_2023.sh
readme.md		readme.md
spark_exploration_no_spark.R		spark_exploration_no_spark.R

r-kosar/MVDE_RKOSAR

Folders and files

Latest commit

History

Repository files navigation

What's in this repo?

How does the repo work?

Parsing the data

Further analysis using spark

Build an analysis docker

Launch a local standalone spark cluster and the analysis image

Start an analysis shell

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages