Publicly Verifiable & Private Collaborative ML Model Training

This repository contains the implementation of private collaborative and public verifiable training for the logistic regression model using Noir. This project comes as a result of the NGR Request for Private Shared States using Noir sponsored by Aztec Labs.

The project contains the following features implemented in the Noir programming language:

An implementation of fixed-point numbers using quantized arithmetic.
An implementation of deterministic logistic regression training for both two classes and multiple classes.
Benchmarking and test of the performance and accuracy of the implementation using the Iris plants dataset and the Breast cancer dataset.

How to use

First, you need to include the library in your Nargo.toml file as follows:

[package]
name = "noir_project"
type = "bin"
authors = [""]
compiler_version = ">=0.36.0"

[dependencies]
noir_mpc_ml = { git = "https://github.com/hashcloak/noir-mpc-ml/tree/benchmarking/lib", branch = "master" }

Bellow, we present an example of how to use the training for a dataset with 30 rows, 4 features, and 3 classes. For this example, suppose that the Prover, wants to prove that he has the dataset that produces a certain set of parameters known to the verifier for a logistic regression model using a public number of epochs and learning rate. Hence the source code will be as follows:

use noir_mpc_ml::ml::train_multi_class;
use noir_mpc_ml::quantized::Quantized;

fn main(
    data: [[Quantized; 4]; 30],
    labels: [[Quantized; 30]; 3],
    learning_rate: pub Quantized,
    ratio: pub Quantized,
    epochs: pub u64,
    parameters: [([Quantized; 4], Quantized); 3],
) {
    let parameters_train = train_multi_class(epochs, data, labels, learning_rate, ratio);
    assert(parameters == parameters_train);
}

Some of the concepts present in the previous example will be explained in dept later, but we will explain some basic concepts here. The Quantized type represents a fixed-point number. To train a model, we use the train_multiclass method which receives the features of each sample, the labels, the ratio which is $1 / N$ where $N$ is the number of samples, and the number of epochs for the training.

In this case, notice that the labels are provided in a $N \times C$ matrix where $N$ is the number of samples and $C$ is the number of classes. The labels matrix will have a 1 in the position $(i, c)$ if the $i$-th sample is of class $c$, otherwise, the entry will have a 0.

Benchmarks

We executed benchmarks for the logistic regression library using the Iris and the Wine datasets. To execute these benchmarks yourself, you can go to the benchmarks/ folder in which you will find instruction to execute and test the library.

Number of gates

In the following tables, we present the number of gates for different epochs and number of training samples using the Iris and the Wine dataset. The number of gates is measured using the Noir tooling.

For the Iris dataset

Epochs	# of train samples	ACIR opcodes	# of gates
10	20	573,556	854,462
10	60	1,507,108	2,250,434
10	100	2,440,660	3,646,936
20	20	1,199,518	1,788,902
30	20	1,825,639	2,726,816

For the Wine dataset

Epochs	# of train samples	ACIR opcodes	# of gates
10	20	792,313	1,168,109
10	60	1,761,142	2,607,698
10	100	2,729,836	4,047,862
20	20	1,638,775	2,417,639
30	20	2,485,396	3,670,703

Training using co-noir

The following table shows the training time using co-noir for the Iris dataset using a server with an AMD EPYC Processor and 32 GB of RAM.

Epochs	# of train samples	Training time [sec]
10	30	3,162
10	50	4,971
20	30	6,105
20	50	9,969

Fixed-point arithmetic

The fixed point arithmetic follows the strategy presented in the paper of Catrina and Saxena. In the paper, the authors propose a way to represent fixed-point numbers using field elements for MPC protocols. Additionally, the propose MPC protocols for addition, multiplication and division. In the context of zero-knowledge proofs, we saw this paper as an opportunity to implement the fixed-point arithmetic given that the primitive data type in Noir is the Field. This allows us to implement the fixed-point data type without relying on native integer types from Noir, whose impose an additional overhead to the computation.

In the representation, the fixed-point numbers are represented by a Field element that is wrapped in the Quantized struct. This field element will represent a fractional number that has $k$ bits in total and $f$ of those $k$ bits are used to represent the decimal part. We will denote this set of fractional numbers as $\mathbb{Q}_{\langle k, f \rangle}$. An element $\tilde{x} \in \mathbb{Q}_{\langle k, f \rangle}$ can be encoded as a Field element by computing $(x = \tilde{x} \cdot 2^{-f}) \mod p$ where $p$ is the order of the Field. Adding two Quantized elements corresponds to add both encodings. However, the multiplication requires a truncation given that multiplying both encodings results in a number with precision $2f$.

Logistic regression training

The implementation of the logistic training algorithm is done by using the gradient descent method. This algorithm is an iterative method that updates the weight of the parameters of the model in the direction of the gradient of a log-loss function.

The algorithm is as follows:

Inputs: the data samples $X \in \mathbb{R}^{(n \times m)}$, the labels $y \in \mathbb{R}^{n}$, the learning rate $\alpha \in \mathbb{R}$, and the number of epochs $E$.

Let $w \in \mathbb{R}^m$ and $b \in \mathbb{R}$ initialized in zero.
Execute the following steps $E$ times:

For each $j \in {1, \dots, m}$, $w_j = w_j - (\alpha / n) \cdot \sum_{i=1}^n [\sigma(w \cdot x_i + b) - y_i] \cdot X_{ij}$
$b = b - (\alpha / n) \cdot \sum_{i=1}^n [\sigma(w \cdot x_i + b) - y_i]$

Return $w$ and $b$.

Acknowledgements

We thank Aztec for funding this project.

Name		Name	Last commit message	Last commit date
Latest commit History 104 Commits
benchmarking		benchmarking
lib		lib
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Publicly Verifiable & Private Collaborative ML Model Training

How to use

Benchmarks

Number of gates

For the Iris dataset

For the Wine dataset

Training using co-noir

Fixed-point arithmetic

Logistic regression training

Acknowledgements

About

Releases

Packages

Contributors 2

Languages

License

hashcloak/noir-mpc-ml

Folders and files

Latest commit

History

Repository files navigation

Publicly Verifiable & Private Collaborative ML Model Training

How to use

Benchmarks

Number of gates

For the Iris dataset

For the Wine dataset

Training using co-noir

Fixed-point arithmetic

Logistic regression training

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages