Skip to content

Commit

Permalink
new pipeline framework and mnist scenario (#6)
Browse files Browse the repository at this point in the history
* preprocessing and model saving

* pipelines

* delete configuration

* gitignore

* mnist scenario

* aci related changes for supporting pipelines

* remove query configuration

* updated policy

* mnist logic

* mnist changes for aci

* readme files

* readme

* contract service URL

* sample reference

* batch size change
  • Loading branch information
kapilvgit authored Jan 5, 2024
1 parent debf5a3 commit fc99972
Show file tree
Hide file tree
Showing 55 changed files with 2,057 additions and 770 deletions.
46 changes: 42 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,21 +1,59 @@
# DEPA for Training

[DEPA for Training](https://depa.world) is a techno-legal framework that enables privacy-preserving sharing of bulk, de-identified datasets for large scale analytics and training. This repository contains a reference implementation of [Confidential Clean Rooms](https://depa.world/training/confidential_clean_room_design), which together with the [Contract Service](https://github.com/kapilvgit/contract-ledger/tree/main), forms the basis of this framework. The repository also includes a [sample training scenario](./scenarios/covid/README.md) that can be deployed using the DEPA Training Framework. The reference implementation is provided on an As-Is basis. It is work-in-progress and should not be used in production.
[DEPA for Training](https://depa.world) is a techno-legal framework that enables privacy-preserving sharing of bulk, de-identified datasets for large scale analytics and training. This repository contains a reference implementation of [Confidential Clean Rooms](https://depa.world/training/confidential_clean_room_design), which together with the [Contract Service](https://github.com/kapilvgit/contract-ledger/tree/main), forms the basis of this framework. The reference implementation is provided on an As-Is basis. It is work-in-progress and should not be used in production.

# Getting Started

Clone this repo as follows, and follow [instructions](./scenarios/covid/README.md) to deploy a sample CCR.
## GitHub Codespaces

The simplest way to setup a development environment is using [GitHub Codespaces](https://github.com/codespaces). The repository includes a [devcontainer.json](../../.devcontainer/devcontainer.json), which customizes your codespace to install all required dependencies. Please ensure you allocate at least 64GB disk space in your codespace. Also, run the following command in the codespace to update submodules.

```bash
git submodule update --init --recursive
```

## Local Development Environment

Alternatively, you can build and develop locally in a Linux environment (we have tested with Ubuntu 20.04 and 22.04), or Windows with WSL 2. Install the following dependencies.

- [docker](https://docs.docker.com/engine/install/ubuntu/) and docker-compose. After installing docker, add your user to the docker group using `sudo usermod -aG docker $USER`, and log back in to a shell.
- make (install using ```sudo apt-get install make```)
- Python 3.6.9 and pip
- Python wheel package (install using ```pip install wheel```)

Clone this repo as follows.

```bash
git clone --recursive http://github.com/iSPIRT/depa-training
```

You can also use Github codespaces to create a development environment. Please ensure you allocate at least 64GB disk space in your codespace. Also, run the following command in the Codespace to update submodules.
## Build CCR containers

To build your own CCR container images, use the following command from the root of the repository.

```bash
./ci/build.sh
```

This scripts build the following containers.

- ```depa-training```: Container with the core CCR logic for joining datasets and running differentially private training.
- ```depa-training-encfs```: Container for loading encrypted data into the CCR.

Alternatively, you can use pre-built container images from the ```ispirt``` repository by setting the following environment variable.
```bash
git submodule update --init --recursive
export CONTAINER_REGISTRY=ispirt
```

# Scenarios

This repository contains two samples that illustrate the kinds of scenarios DEPA for Training can support.

- [Training a differentially private COVID prediction model on private datasets](./scenarios/covid/README.md)
- [Convolutional Neural Network training on MNIST dataset](./scenarios/mnist/README.md)

Follow these links to build and deploy these scenarios.

# Contributing

This project welcomes feedback and contributions. Before you start, please take a moment to review our [Contribution Guidelines](./CONTRIBUTING.md). These guidelines provide information on how to contribute, set up your development environment, and submit your changes.
Expand Down
7 changes: 5 additions & 2 deletions ci/Dockerfile.train
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,9 @@ RUN pip3 --default-timeout=1000 install pyspark pandas opacus onnx onnx2pytorch

RUN apt-get install -y jq

COPY train/ccr_join.py ccr_join.py
COPY train/ccr_train.py ccr_train.py
# Install contract ledger client
COPY train/dist/pytrain-0.0.1-py3-none-any.whl .
RUN pip3 install pytrain-0.0.1-py3-none-any.whl

# Install script to run training
COPY train/run.sh run.sh
5 changes: 5 additions & 0 deletions ci/build.sh
Original file line number Diff line number Diff line change
@@ -1,5 +1,10 @@
#!/bin/bash

# Build pytrain
pushd src/train
python3 setup.py bdist_wheel
popd

# Build training container
docker build -f ci/Dockerfile.train src -t depa-training:latest

Expand Down
34 changes: 6 additions & 28 deletions scenarios/covid/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,43 +11,21 @@ The end-to-end training pipeline consists of the following phases.
5. Deployment and execution of CCR
6. Model decryption

## Pre-requisites
## Build container images

### GitHub Codespaces

The simplest way to setup a development environment is using [GitHub Codespaces](https://github.com/codespaces). The repository includes a [devcontainer.json](../../.devcontainer/devcontainer.json), which customizes your codespace to install all required dependencies.

### Local Development Environment

Alternatively, you deploy the sample locally on Linux (we have tested with Ubuntu 20.04), or Windows with WSL 2. You will need to install the following dependencies.

- [docker](https://docs.docker.com/engine/install/ubuntu/) and docker-compose. After installing docker, add your user to the docker group using `sudo usermod -aG docker $USER`, and log back in to a shell.
- make (install using ```sudo apt-get install make```)
- Python 3.6.9 and pip
- Python wheel package (install using ```pip install wheel```)

## Build CCR containers

To build your own CCR container images, use the following command from the root of the repository.
Build container images required for this sample as follows.

```bash
./ci/build.sh
cd scenarios/covid
./ci/build.sh

```

These scripts build the following containers.
This script builds the following container images.

- ```depa-training```: Container with the core CCR logic for joining datasets and running differentially private training.
- ```depa-training-encfs```: Container for loading encrypted data into the CCR.
- ```preprocess-icmr, preprocess-cowin, preprocess-index```: Containers that pre-process and de-identify datasets.
- ```ccr-model-save```: Container that saves the model to be trained in ONNX format.

Alternatively, you can use pre-built container images from the ```ispirt``` repository by setting the following environment variable.
```bash
export CONTAINER_REGISTRY=ispirt
```

## Data pre-processing and de-identification

The folders ```scenarios/covid/data``` contains three sample training datasets. Acting as TDPs for these datasets, run the following scripts to de-identify the datasets.
Expand Down Expand Up @@ -76,7 +54,7 @@ Assuming you have cleartext access to all the de-identified datasets, you can tr
```bash
./train.sh
```
The script joins the datasets using a configuration defined in [query_config.json](./config/query_config.json) and trains the model using a configuration defined in [model_config.json](./config/model_config.json). If all goes well, you should see output similar to the following output, and the trained model will be saved under the folder `/tmp/output`.
The script joins the datasets and trains the model using a pipeline configuration defined in [pipeline_config.json](./config/pipeline_config.json). If all goes well, you should see output similar to the following output, and the trained model will be saved under the folder `/tmp/output`.

```
docker-train-1 | {'input_dataset_path': '/tmp/sandbox_icmr_cowin_index_without_key_identifiers.csv', 'saved_model_path': '/mnt/remote/model/model.onnx', 'saved_model_optimizer': '/mnt/remote/model/dpsgd_model_opimizer.pth', 'saved_weights_path': '', 'batch_size': 2, 'total_epochs': 5, 'max_grad_norm': 0.1, 'epsilon_threshold': 1.0, 'delta': 0.01, 'sample_size': 60000, 'target_variable': 'icmr_a_icmr_test_result', 'test_train_split': 0.2, 'metrics': ['accuracy', 'precision', 'recall']}
Expand Down Expand Up @@ -152,7 +130,7 @@ cd scenarios/covid/data

### Sign and Register Contract

Next, follow instructions [here](./../../external/contract-ledger/README.md) to sign and register a contract the contract service. The registered contract must contain references to the datasets with matching names, keyIDs and Azure Key Vault endpoints used in this sample. A sample contract is provided [here](https://github.com/kapilvgit/contract-ledger/blob/main/demo/contract/contract.json). After signing and registering the contract, retain the contract service URL and sequence number of the contract for the rest of this sample.
Next, follow instructions [here](./../../external/contract-ledger/README.md) to sign and register a contract with the contract service. You can either deploy your own contract service or use a test contract service hosted at ```https://contract-service.westeurope.cloudapp.azure.com:8000```. The registered contract must contain references to the datasets with matching names, keyIDs and Azure Key Vault endpoints used in this sample. A sample contract template for this scenario is provided [here](./contract/contract.json). After updating, signing and registering the contract, retain the contract service URL and sequence number of the contract for the rest of this sample.

### Import encryption keys

Expand Down
20 changes: 0 additions & 20 deletions scenarios/covid/config/model_config.json

This file was deleted.

191 changes: 191 additions & 0 deletions scenarios/covid/config/pipeline_config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,191 @@
{
"pipeline": [
{
"name": "Join",
"config": {
"datasets": [
{
"id": "19517ba8-bab8-11ed-afa1-0242ac120002",
"name": "icmr",
"file": "dp_icmr_standardised_anon.csv",
"select_variables": [
"icmr_test_result",
"fk_genetic_strain",
"test_ct_value",
"sample_genetic_sequenced"
],
"mount_path": "/mnt/remote/icmr/"
},
{
"id": "216d5cc6-bab8-11ed-afa1-0242ac120002",
"name": "cowin",
"file": "dp_cowin_standardised_anon.csv",
"select_variables": [
"age",
"vaccine_name"
],
"mount_path": "/mnt/remote/cowin/"
},
{
"id": "2830a144-bab8-11ed-afa1-0242ac120002",
"name": "index",
"file": "dp_index_standardised_anon.csv",
"select_variables": [
"pasymp",
"dsinfection",
"potravel",
"page"
],
"mount_path": "/mnt/remote/index/"
}
],
"joined_dataset": {
"joined_dataset": "sandbox_icmr_cowin_index_without_key_identifiers.csv",
"joining_query": "select * from icmr, index, cowin where index.pk_mobno_hashed == icmr.pk_mobno_hashed and index.pk_mobno_hashed == cowin.pk_mobno_hashed",
"joining_key": "pk_mobno_hashed",
"model_output_folder": "/tmp/",
"drop_columns": [
"pk_icmrno",
"pk_mobno",
"ref_srfno",
"ref_index_id",
"ref_tr_index_id",
"fk_pname",
"ref_paddress",
"cowin_beneficiary_name"
],
"identifiers": [
"pk_mobno_hashed",
"ref_srfno_hashed",
"pk_icmrno_hashed",
"index_idcpatient",
"index_pname",
"index_paddress",
"index_phcname",
"pk_icmrno_hashed",
"pk_mobno_hashed",
"ref_bucode_hashed",
"cowin_beneficiary_name",
"cowin_d1_vaccinated_at",
"cowin_d2_vaccinated_at",
"pk_beneficiary_id_hashed",
"ref_uhid_hashed",
"pk_mobno_hashed",
"ref_id_verified_hashed",
"ref_index_id_hashed",
"fk_pname_hashed",
"fk_cpatient_hashed",
"ref_tr_index_id_hashed",
"ref_pahospital_code_hashed",
"ref_tpahospital_code_hashed",
"ref_paddress_hashed",
"fk_icmr_labid_hashed",
"index_labcode",
"ref_labid",
"index_pgender",
"index_pstate",
"index_pdistrict",
"index_plocation",
"index_pzone",
"index_pward",
"index_ptaluka",
"index_astatus",
"index_anumber",
"index_adname",
"index_adnumber",
"index_stambulance",
"index_stdate",
"index_audate",
"index_cdate",
"index_pcdate",
"index_adddate",
"index_bmdate",
"index_moddate",
"index_admdate",
"index_movdate",
"index_disdate",
"index_pudate",
"index_cudate",
"index_distcode",
"index_distrefno",
"index_ptype",
"ref_labid_hashed",
"index_fdate",
"index_todate",
"index_apcnumber",
"index_pbtype",
"index_pbquota",
"icmr_pupdate",
"index_bucode",
"index_trbucode",
"index_commstatus",
"index_commby",
"index_dristatus",
"index_driby",
"index_vehnumber",
"index_hosstatus",
"index_hosby",
"index_padone",
"index_padate",
"index_pstatus",
"index_dsummary",
"index_statusuby",
"index_ureason",
"index_usummary",
"index_padmitted",
"index_hcode",
"index_pahospital",
"index_tpahospital",
"index_htype",
"index_pbedcode",
"index_labname",
"index_pid",
"cowin_dose_1_date",
"cowin_d1_vaccinated_by",
"cowin_dose_2_date",
"cowin_d2_vaccinated_by",
"cowin_pupdate",
"cowin_gender",
"index_remarks",
"icmr_a_icmr_test_type"
],
"joined_result_columns": [
"icmr_a_icmr_test_result",
"icmr_a_test_ct_value",
"icmr_a_sample_genetic_sequenced",
"fk_genetic_strain",
"index_pasymp",
"index_dsinfection",
"index_potravel",
"index_page",
"cowin_age",
"cowin_vaccine_name"
]
}
}
},
{
"name": "PrivateTrain",
"config": {
"input_dataset_path": "/tmp/sandbox_icmr_cowin_index_without_key_identifiers.csv",
"saved_model_path": "/mnt/remote/model/model.onnx",
"saved_model_optimizer": "/mnt/remote/model/dpsgd_model_opimizer.pth",
"trained_model_output_path": "/mnt/remote/output/model.onnx",
"saved_weights_path": "",
"batch_size": 2,
"total_epochs": 5,
"max_grad_norm": 0.1,
"epsilon_threshold": 1.0,
"delta": 0.01,
"sample_size": 60000,
"target_variable": "icmr_a_icmr_test_result",
"test_train_split": 0.2,
"metrics": [
"accuracy",
"precision",
"recall"
]
}
}
]
}
Loading

0 comments on commit fc99972

Please sign in to comment.