For the purposes of simplicity we will use the bash variables $DATA
and $DPROC
to contain the paths to the raw data directory and file based processed data
respectively.
# pwd points to the root of our project repo
export DATA=$(pwd)/Data/Illinois-20200302-text/data
export DPROC=$(pwd)/Data/Processed
xzcat $DATA/data.jsonl.xz > data.jsonl
The compressed file contains records in JSON line format, meaning that each JSON Object is delimited by a newline instead if a comma, i.e. instead of:
[
{"id": 0},
{"id": 1}
]
We have:
{"id": 0}
{"id": 1}
The JSONL format will cause issues in our jq
filters, to convert it we can use
the "slurp" feature of jq
:
cd ./Data/Illinois-20200302-text
jq -s $DATA/data.jsonl > $DPROC/data.json
First get the first record of the case data:
jq '.[0]' $DATA/data.json > $DPROC/data_0.json
Then after we've extracted the first case, lets generate a schema from it:
cd ./Data/Illinois-20200302-text
genson $DPROC/data_0.json > $DPROC/data_0_schema.json
To reduce our ETL iteration speed it's a good idea to reduce the size of our data set, for our project we've
jq '.[:200]' $DATA/data.json > $DPROC/data_first_200.json