new pipeline framework and mnist scenario (#6)

* preprocessing and model saving * pipelines * delete configuration * gitignore * mnist scenario * aci related changes for supporting pipelines * remove query configuration * updated policy * mnist logic * mnist changes for aci * readme files * readme * contract service URL * sample reference * batch size change
iSPIRT · Jan 5, 2024 · fc99972 · fc99972
1 parent debf5a3
commit fc99972
Show file tree

Hide file tree

Showing 55 changed files with 2,057 additions and 770 deletions.
diff --git a/README.md b/README.md
@@ -1,21 +1,59 @@
 # DEPA for Training
 
-[DEPA for Training](https://depa.world) is a techno-legal framework that enables privacy-preserving sharing of bulk, de-identified datasets for large scale analytics and training. This repository contains a reference implementation of [Confidential Clean Rooms](https://depa.world/training/confidential_clean_room_design), which together with the [Contract Service](https://github.com/kapilvgit/contract-ledger/tree/main), forms the basis of this framework. The repository also includes a [sample training scenario](./scenarios/covid/README.md) that can be deployed using the DEPA Training Framework. The reference implementation is provided on an As-Is basis. It is work-in-progress and should not be used in production.
+[DEPA for Training](https://depa.world) is a techno-legal framework that enables privacy-preserving sharing of bulk, de-identified datasets for large scale analytics and training. This repository contains a reference implementation of [Confidential Clean Rooms](https://depa.world/training/confidential_clean_room_design), which together with the [Contract Service](https://github.com/kapilvgit/contract-ledger/tree/main), forms the basis of this framework. The reference implementation is provided on an As-Is basis. It is work-in-progress and should not be used in production.
 
 # Getting Started
 
-Clone this repo as follows, and follow [instructions](./scenarios/covid/README.md) to deploy a sample CCR. 
+## GitHub Codespaces
+
+The simplest way to setup a development environment is using [GitHub Codespaces](https://github.com/codespaces). The repository includes a [devcontainer.json](../../.devcontainer/devcontainer.json), which customizes your codespace to install all required dependencies. Please ensure you allocate at least 64GB disk space in your codespace. Also, run the following command in the codespace to update submodules.
+
+```bash
+git submodule update --init --recursive
+```
+
+## Local Development Environment
+
+Alternatively, you can build and develop locally in a Linux environment (we have tested with Ubuntu 20.04 and 22.04), or Windows with WSL 2. Install the following dependencies. 
+
+- [docker](https://docs.docker.com/engine/install/ubuntu/) and docker-compose. After installing docker, add your user to the docker group using `sudo usermod -aG docker $USER`, and log back in to a shell. 
+- make (install using ```sudo apt-get install make```)
+- Python 3.6.9 and pip 
+- Python wheel package (install using ```pip install wheel```)
+
+Clone this repo as follows. 
 
 ```bash
 git clone --recursive http://github.com/iSPIRT/depa-training
 ```
 
-You can also use Github codespaces to create a development environment. Please ensure you allocate at least 64GB disk space in your codespace. Also, run the following command in the Codespace to update submodules.
+## Build CCR containers
+
+To build your own CCR container images, use the following command from the root of the repository. 
+
+```bash
+./ci/build.sh
+```
+
+This scripts build the following containers. 
+
+- ```depa-training```: Container with the core CCR logic for joining datasets and running differentially private training. 
+- ```depa-training-encfs```: Container for loading encrypted data into the CCR. 
 
+Alternatively, you can use pre-built container images from the ```ispirt``` repository by setting the following environment variable. 
 ```bash
-git submodule update --init --recursive
+export CONTAINER_REGISTRY=ispirt
 ```
 
+# Scenarios
+
+This repository contains two samples that illustrate the kinds of scenarios DEPA for Training can support. 
+
+- [Training a differentially private COVID prediction model on private datasets](./scenarios/covid/README.md)
+- [Convolutional Neural Network training on MNIST dataset](./scenarios/mnist/README.md)
+
+Follow these links to build and deploy these scenarios. 
+
 # Contributing
 
 This project welcomes feedback and contributions. Before you start, please take a moment to review our [Contribution Guidelines](./CONTRIBUTING.md). These guidelines provide information on how to contribute, set up your development environment, and submit your changes.

diff --git a/ci/Dockerfile.train b/ci/Dockerfile.train
@@ -17,6 +17,9 @@ RUN pip3 --default-timeout=1000 install pyspark pandas opacus onnx onnx2pytorch
 
 RUN apt-get install -y jq
 
-COPY train/ccr_join.py ccr_join.py 
-COPY train/ccr_train.py ccr_train.py
+# Install contract ledger client
+COPY train/dist/pytrain-0.0.1-py3-none-any.whl .
+RUN pip3 install pytrain-0.0.1-py3-none-any.whl
+
+# Install script to run training 
 COPY train/run.sh run.sh
diff --git a/ci/build.sh b/ci/build.sh
@@ -1,5 +1,10 @@
 #!/bin/bash
 
+# Build pytrain
+pushd src/train
+python3 setup.py bdist_wheel
+popd
+
 # Build training container
 docker build -f ci/Dockerfile.train src -t depa-training:latest
 

diff --git a/scenarios/covid/README.md b/scenarios/covid/README.md
@@ -11,43 +11,21 @@ The end-to-end training pipeline consists of the following phases.
 5. Deployment and execution of CCR
 6. Model decryption 
 
-## Pre-requisites
+## Build container images
 
-### GitHub Codespaces
-
-The simplest way to setup a development environment is using [GitHub Codespaces](https://github.com/codespaces). The repository includes a [devcontainer.json](../../.devcontainer/devcontainer.json), which customizes your codespace to install all required dependencies. 
-
-### Local Development Environment
-
-Alternatively, you deploy the sample locally on Linux (we have tested with Ubuntu 20.04), or Windows with WSL 2. You will need to install the following dependencies. 
-
-- [docker](https://docs.docker.com/engine/install/ubuntu/) and docker-compose. After installing docker, add your user to the docker group using `sudo usermod -aG docker $USER`, and log back in to a shell. 
-- make (install using ```sudo apt-get install make```)
-- Python 3.6.9 and pip 
-- Python wheel package (install using ```pip install wheel```)
-
-## Build CCR containers
-
-To build your own CCR container images, use the following command from the root of the repository. 
+Build container images required for this sample as follows. 
 
 ```bash
-./ci/build.sh
 cd scenarios/covid
 ./ci/build.sh
+
 ```
 
-These scripts build the following containers. 
+This script builds the following container images. 
 
-- ```depa-training```: Container with the core CCR logic for joining datasets and running differentially private training. 
-- ```depa-training-encfs```: Container for loading encrypted data into the CCR. 
 - ```preprocess-icmr, preprocess-cowin, preprocess-index```: Containers that pre-process and de-identify datasets. 
 - ```ccr-model-save```: Container that saves the model to be trained in ONNX format. 
 
-Alternatively, you can use pre-built container images from the ```ispirt``` repository by setting the following environment variable. 
-```bash
-export CONTAINER_REGISTRY=ispirt
-```
-
 ## Data pre-processing and de-identification
 
 The folders ```scenarios/covid/data``` contains three sample training datasets. Acting as TDPs for these datasets, run the following scripts to de-identify the datasets. 
@@ -76,7 +54,7 @@ Assuming you have cleartext access to all the de-identified datasets, you can tr
 ```bash
 ./train.sh
 ```
-The script joins the datasets using a configuration defined in [query_config.json](./config/query_config.json) and trains the model using a configuration defined in [model_config.json](./config/model_config.json). If all goes well, you should see output similar to the following output, and the trained model will be saved under the folder `/tmp/output`. 
+The script joins the datasets and trains the model using a pipeline configuration defined in [pipeline_config.json](./config/pipeline_config.json). If all goes well, you should see output similar to the following output, and the trained model will be saved under the folder `/tmp/output`. 
 
 ```
 docker-train-1  | {'input_dataset_path': '/tmp/sandbox_icmr_cowin_index_without_key_identifiers.csv', 'saved_model_path': '/mnt/remote/model/model.onnx', 'saved_model_optimizer': '/mnt/remote/model/dpsgd_model_opimizer.pth', 'saved_weights_path': '', 'batch_size': 2, 'total_epochs': 5, 'max_grad_norm': 0.1, 'epsilon_threshold': 1.0, 'delta': 0.01, 'sample_size': 60000, 'target_variable': 'icmr_a_icmr_test_result', 'test_train_split': 0.2, 'metrics': ['accuracy', 'precision', 'recall']}
@@ -152,7 +130,7 @@ cd scenarios/covid/data
 
 ### Sign and Register Contract
 
-Next, follow instructions [here](./../../external/contract-ledger/README.md) to sign and register a contract the contract service. The registered contract must contain references to the datasets with matching names, keyIDs and Azure Key Vault endpoints used in this sample. A sample contract is provided [here](https://github.com/kapilvgit/contract-ledger/blob/main/demo/contract/contract.json). After signing and registering the contract, retain the contract service URL and sequence number of the contract for the rest of this sample. 
+Next, follow instructions [here](./../../external/contract-ledger/README.md) to sign and register a contract with the contract service. You can either deploy your own contract service or use a test contract service hosted at ```https://contract-service.westeurope.cloudapp.azure.com:8000```. The registered contract must contain references to the datasets with matching names, keyIDs and Azure Key Vault endpoints used in this sample. A sample contract template for this scenario is provided [here](./contract/contract.json). After updating, signing and registering the contract, retain the contract service URL and sequence number of the contract for the rest of this sample. 
 
 ### Import encryption keys
 

diff --git a/scenarios/covid/config/model_config.json b/scenarios/covid/config/model_config.json
diff --git a/scenarios/covid/config/pipeline_config.json b/scenarios/covid/config/pipeline_config.json
@@ -0,0 +1,191 @@
+{
+    "pipeline": [
+        {
+            "name": "Join",
+            "config": {
+                "datasets": [
+                    {
+                        "id": "19517ba8-bab8-11ed-afa1-0242ac120002",
+                        "name": "icmr",
+                        "file": "dp_icmr_standardised_anon.csv",
+                        "select_variables": [
+                            "icmr_test_result",
+                            "fk_genetic_strain",
+                            "test_ct_value",
+                            "sample_genetic_sequenced"
+                        ],
+                        "mount_path": "/mnt/remote/icmr/"
+                    },
+                    {
+                        "id": "216d5cc6-bab8-11ed-afa1-0242ac120002",
+                        "name": "cowin",
+                        "file": "dp_cowin_standardised_anon.csv",
+                        "select_variables": [
+                            "age",
+                            "vaccine_name"
+                        ],
+                        "mount_path": "/mnt/remote/cowin/"
+                    },
+                    {
+                        "id": "2830a144-bab8-11ed-afa1-0242ac120002",
+                        "name": "index",
+                        "file": "dp_index_standardised_anon.csv",
+                        "select_variables": [
+                            "pasymp",
+                            "dsinfection",
+                            "potravel",
+                            "page"
+                        ],
+                        "mount_path": "/mnt/remote/index/"
+                    }
+                ],
+                "joined_dataset": {
+                    "joined_dataset": "sandbox_icmr_cowin_index_without_key_identifiers.csv",
+                    "joining_query": "select * from icmr, index, cowin where index.pk_mobno_hashed == icmr.pk_mobno_hashed and index.pk_mobno_hashed == cowin.pk_mobno_hashed",
+                    "joining_key": "pk_mobno_hashed",
+                    "model_output_folder": "/tmp/",
+                    "drop_columns": [
+                        "pk_icmrno",
+                        "pk_mobno",
+                        "ref_srfno",
+                        "ref_index_id",
+                        "ref_tr_index_id",
+                        "fk_pname",
+                        "ref_paddress",
+                        "cowin_beneficiary_name"
+                    ],
+                    "identifiers": [
+                        "pk_mobno_hashed",
+                        "ref_srfno_hashed",
+                        "pk_icmrno_hashed",
+                        "index_idcpatient",
+                        "index_pname",
+                        "index_paddress",
+                        "index_phcname",
+                        "pk_icmrno_hashed",
+                        "pk_mobno_hashed",
+                        "ref_bucode_hashed",
+                        "cowin_beneficiary_name",
+                        "cowin_d1_vaccinated_at",
+                        "cowin_d2_vaccinated_at",
+                        "pk_beneficiary_id_hashed",
+                        "ref_uhid_hashed",
+                        "pk_mobno_hashed",
+                        "ref_id_verified_hashed",
+                        "ref_index_id_hashed",
+                        "fk_pname_hashed",
+                        "fk_cpatient_hashed",
+                        "ref_tr_index_id_hashed",
+                        "ref_pahospital_code_hashed",
+                        "ref_tpahospital_code_hashed",
+                        "ref_paddress_hashed",
+                        "fk_icmr_labid_hashed",
+                        "index_labcode",
+                        "ref_labid",
+                        "index_pgender",
+                        "index_pstate",
+                        "index_pdistrict",
+                        "index_plocation",
+                        "index_pzone",
+                        "index_pward",
+                        "index_ptaluka",
+                        "index_astatus",
+                        "index_anumber",
+                        "index_adname",
+                        "index_adnumber",
+                        "index_stambulance",
+                        "index_stdate",
+                        "index_audate",
+                        "index_cdate",
+                        "index_pcdate",
+                        "index_adddate",
+                        "index_bmdate",
+                        "index_moddate",
+                        "index_admdate",
+                        "index_movdate",
+                        "index_disdate",
+                        "index_pudate",
+                        "index_cudate",
+                        "index_distcode",
+                        "index_distrefno",
+                        "index_ptype",
+                        "ref_labid_hashed",
+                        "index_fdate",
+                        "index_todate",
+                        "index_apcnumber",
+                        "index_pbtype",
+                        "index_pbquota",
+                        "icmr_pupdate",
+                        "index_bucode",
+                        "index_trbucode",
+                        "index_commstatus",
+                        "index_commby",
+                        "index_dristatus",
+                        "index_driby",
+                        "index_vehnumber",
+                        "index_hosstatus",
+                        "index_hosby",
+                        "index_padone",
+                        "index_padate",
+                        "index_pstatus",
+                        "index_dsummary",
+                        "index_statusuby",
+                        "index_ureason",
+                        "index_usummary",
+                        "index_padmitted",
+                        "index_hcode",
+                        "index_pahospital",
+                        "index_tpahospital",
+                        "index_htype",
+                        "index_pbedcode",
+                        "index_labname",
+                        "index_pid",
+                        "cowin_dose_1_date",
+                        "cowin_d1_vaccinated_by",
+                        "cowin_dose_2_date",
+                        "cowin_d2_vaccinated_by",
+                        "cowin_pupdate",
+                        "cowin_gender",
+                        "index_remarks",
+                        "icmr_a_icmr_test_type"
+                    ],
+                    "joined_result_columns": [
+                        "icmr_a_icmr_test_result",
+                        "icmr_a_test_ct_value",
+                        "icmr_a_sample_genetic_sequenced",
+                        "fk_genetic_strain",
+                        "index_pasymp",
+                        "index_dsinfection",
+                        "index_potravel",
+                        "index_page",
+                        "cowin_age",
+                        "cowin_vaccine_name"
+                    ]
+                }
+            }
+        },
+        {
+            "name": "PrivateTrain",
+            "config": {
+                "input_dataset_path": "/tmp/sandbox_icmr_cowin_index_without_key_identifiers.csv",
+                "saved_model_path": "/mnt/remote/model/model.onnx",
+                "saved_model_optimizer": "/mnt/remote/model/dpsgd_model_opimizer.pth",
+                "trained_model_output_path": "/mnt/remote/output/model.onnx",
+                "saved_weights_path": "",
+                "batch_size": 2,
+                "total_epochs": 5,
+                "max_grad_norm": 0.1,
+                "epsilon_threshold": 1.0,
+                "delta": 0.01,
+                "sample_size": 60000,
+                "target_variable": "icmr_a_icmr_test_result",
+                "test_train_split": 0.2,
+                "metrics": [
+                    "accuracy",
+                    "precision",
+                    "recall"
+                ]
+            }
+        }
+    ]
+}