This repository contains the implementation of private collaborative and public verifiable training for the logistic regression model using Noir. This project comes as a result of the NGR Request for Private Shared States using Noir sponsored by Aztec Labs.
The project contains the following features implemented in the Noir programming language:
- An implementation of fixed-point numbers using quantized arithmetic.
- An implementation of deterministic logistic regression training for both two classes and multiple classes.
- Benchmarking and test of the performance and accuracy of the implementation using the Iris plants dataset and the Breast cancer dataset.
First, you need to include the library in your Nargo.toml
file as follows:
[package]
name = "noir_project"
type = "bin"
authors = [""]
compiler_version = ">=0.36.0"
[dependencies]
noir_mpc_ml = { git = "https://github.com/hashcloak/noir-mpc-ml/tree/benchmarking/lib", branch = "master" }
Bellow, we present an example of how to use the training for a dataset with 30 rows, 4 features, and 3 classes. For this example, suppose that the Prover, wants to prove that he has the dataset that produces a certain set of parameters known to the verifier for a logistic regression model using a public number of epochs and learning rate. Hence the source code will be as follows:
use noir_mpc_ml::ml::train_multi_class;
use noir_mpc_ml::quantized::Quantized;
fn main(
data: [[Quantized; 4]; 30],
labels: [[Quantized; 30]; 3],
learning_rate: pub Quantized,
ratio: pub Quantized,
epochs: pub u64,
parameters: [([Quantized; 4], Quantized); 3],
) {
let parameters_train = train_multi_class(epochs, data, labels, learning_rate, ratio);
assert(parameters == parameters_train);
}
Some of the concepts present in the previous example will be explained in dept
later, but we will explain some basic concepts here. The Quantized
type
represents a fixed-point number. To train a model, we use the train_multiclass
method which receives the features of each sample, the labels, the ratio which
is
In this case, notice that the labels are provided in a labels
matrix will have a 1 in the position
We executed benchmarks for the logistic regression library using the Iris and the Wine datasets. To execute these benchmarks yourself, you can go to the benchmarks/
folder in which you will find instruction to execute and test the library.
In the following tables, we present the number of gates for different epochs and number of training samples using the Iris and the Wine dataset. The number of gates is measured using the Noir tooling.
Epochs | # of train samples | ACIR opcodes | # of gates |
---|---|---|---|
10 | 20 | 573,556 | 854,462 |
10 | 60 | 1,507,108 | 2,250,434 |
10 | 100 | 2,440,660 | 3,646,936 |
20 | 20 | 1,199,518 | 1,788,902 |
30 | 20 | 1,825,639 | 2,726,816 |
Epochs | # of train samples | ACIR opcodes | # of gates |
---|---|---|---|
10 | 20 | 792,313 | 1,168,109 |
10 | 60 | 1,761,142 | 2,607,698 |
10 | 100 | 2,729,836 | 4,047,862 |
20 | 20 | 1,638,775 | 2,417,639 |
30 | 20 | 2,485,396 | 3,670,703 |
The following table shows the training time using co-noir for the Iris dataset using a server with an AMD EPYC Processor and 32 GB of RAM.
Epochs | # of train samples | Training time [sec] |
---|---|---|
10 | 30 | 3,162 |
10 | 50 | 4,971 |
20 | 30 | 6,105 |
20 | 50 | 9,969 |
The fixed point arithmetic follows the strategy presented in the paper of
Catrina and Saxena. In the paper, the
authors propose a way to represent fixed-point numbers using field elements for
MPC protocols. Additionally, the propose MPC protocols for addition,
multiplication and division. In the context of zero-knowledge proofs, we saw this
paper as an opportunity to implement the fixed-point arithmetic given that the
primitive data type in Noir is the Field
. This allows us to implement the
fixed-point data type without relying on native integer types from Noir, whose
impose an additional overhead to the computation.
In the representation, the fixed-point numbers are represented by a Field
element that is wrapped in the Quantized
struct. This field element will
represent a fractional number that has Field
element by computing Field
. Adding two Quantized
elements corresponds to add both encodings.
However, the multiplication requires a truncation given that multiplying both encodings
results in a number with precision
The implementation of the logistic training algorithm is done by using the gradient descent method. This algorithm is an iterative method that updates the weight of the parameters of the model in the direction of the gradient of a log-loss function.
The algorithm is as follows:
Inputs: the data samples
- Let
$w \in \mathbb{R}^m$ and$b \in \mathbb{R}$ initialized in zero. - Execute the following steps
$E$ times:
- For each
$j \in {1, \dots, m}$ ,$w_j = w_j - (\alpha / n) \cdot \sum_{i=1}^n [\sigma(w \cdot x_i + b) - y_i] \cdot X_{ij}$ $b = b - (\alpha / n) \cdot \sum_{i=1}^n [\sigma(w \cdot x_i + b) - y_i]$
- Return
$w$ and$b$ .
We thank Aztec for funding this project.