ETL application processing legislation data and deployed on Azure.
- Deploying locally will process files from the legislation_files folder and insert into a local SQL Server instance
- Azure deployment will process files already existing in a blob bucket and insert into a serverless Azure SQL Server instance
- legislation_files - contains json files with legislation data
- local - docker files for building and running local dockerised app
- trigger_etl - core functionality
- azure_etl.py - Azure app entry point
- etl.py - core functionality for both azure and local deployment
- legislation: basis of a real-world legal document that contains rules and regulations.
- Legislation version: a specific version of a piece of legislation.
- Issuing body: the organization responsible for creating and issuing the legislation.
- Jurisdiction: the geographic area where a particular set of laws and regulations applies.
- Parts: the various components of legislation that organize and structure the document into sections, chapters, paragraphs, and clauses.
Bringing up SQL Server:
cd local
docker-compose up
Building the app image:
docker build -f local/Dockerfile -t cube-etl .
The image was developed on a linux machine, no guarentee that the odbc setup script will work on Windows, if any issues:
- Replace the setup_odbc.sh file with suitable odbc setup script (and change Dockerfile line:
RUN bash -c "./setup_odbc.sh"
), or, - Call the CLI directly with python instead (see below)
Running the app:
# Running docker container (volume for data access -v <dir of data files>:/app/legislation_files)
docker run -v legislation_files:/app/legislation_files -it cube-etl -P <password> -S <IP Address> -U sa
# Running without docker
python3 local_etl_cli.py -P <password> -S <IP Address> -U sa -f legislation_files
Checking the app CLI params:
python3 local_etl_cli.py --help
docker run -it cube-etl --help
The API can either be called via a web browser or via API request as demonstrated below.
To Run the ETL (FYI this takes ~5 minutes, expect this much time before receiving a http response):
import requests
url = "https://cube-etl.azurewebsites.net/api/trigger_etl?function=etl"
response = requests.get(url)
print(response.content)
To drop the database (~10 seconds):
import requests
url = "https://cube-etl.azurewebsites.net/api/trigger_etl?function=drop"
response = requests.get(url)
print(response.content)
To say hello:
import requests
url = "https://cube-etl.azurewebsites.net/api/trigger_etl?name=John"
response = requests.get(url)
print(response.content)
- search_legislation (minimum params = 1): Searches legislation parts based on 4 optional criteria:
- Jurisdiction (optional)
- IssuingBody (optional)
- Title (optional)
- Content (optional)
EXEC search_legislation
@Jurisdiction = 'Japan',
@Content = 'Enforcement'
- search_any: Searches legislation for a substring in the follow fields: legislation title, jurisdiction name, issuing body name, parts content.
- Search
EXEC search_any
@Search = 'tax'
- Unit and integration testing
- Move db credentials to env variables (create config files for other hard coded vars)
- Event trigger for ETL when new files placed in blob bucket
- Possibly move processed files into different bucket
- Data validation for input json files
- Power BI dashboard for insights
- Look for any inefficiencies (container/azurefunc memory, query performance etc)
- CI/CD (github actions / jenkins)
- Develop / automate infastructure deployment with terraform