Skip to content

Commit

Permalink
chore: improve wording and simply schema
Browse files Browse the repository at this point in the history
  • Loading branch information
RomanBredehoft committed Jun 7, 2024
1 parent e9f4f70 commit a7b33c6
Showing 1 changed file with 11 additions and 23 deletions.
34 changes: 11 additions & 23 deletions docs/advanced_examples/EncryptedPandas.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@
"source": [
"### User 1\n",
"\n",
"On the first user's side, load the private data using Pandas. For this example, we took the [Tips]( https://www.kaggle.com/code/sanjanabasu/tips-dataset/input) dataset and separated it into two csv files so that: \n",
"On the first user's side, load the private data using Pandas. This example uses the [Tips]( https://www.kaggle.com/code/sanjanabasu/tips-dataset/input) dataset. It was split into two csv files so that: \n",
"- all columns are different, except for column \"index\", representing the initial data-frame's index\n",
"- some indexes are common, some others are not"
]
Expand Down Expand Up @@ -198,30 +198,18 @@
"source": [
"In order to be encrypted, string values first need to be mapped to integers (see section below about `get_schema`). By default, this mapping is done automatically. However, for example, the column won't be able to be selected when merging encrypted data-frames. This is because such an operator requires the data-frames' string mapping to match, else values will be mixed up.\n",
"\n",
"This is exactly the case here, as the index column only contains string values, and we thus need to define the mapping ourselves. This mapping will then be shared to the second client (see below) in order to make sure both matches. Other non-integer columns do not require any pre-computed mapping if they are not expected to be selected for merging. All mappings are grouped per column as a dictionary, called \"schema\". \n",
"This is exactly the case here, as the index column only contains string values, thus the mapping must be defined by the application developer. This mapping will then be shared to the second client (see below) in order to make sure both matches. Other non-integer columns do not require any pre-computed mapping if they are not expected to be selected for merging. All mappings are grouped per column as a dictionary, called \"schema\". \n",
"\n",
"Therefore, let's define our schema:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"schema = {\n",
" \"index\": {\n",
" \"client_1\": 1,\n",
" \"client_2\": 2,\n",
" \"client_3\": 3,\n",
" \"client_4\": 4,\n",
" \"client_5\": 5,\n",
" \"client_6\": 6,\n",
" \"client_7\": 7,\n",
" \"client_8\": 8,\n",
" \"client_9\": 9,\n",
" }\n",
"}"
"schema = {\"index\": {index_value: i + 1 for i, index_value in enumerate(df_left[\"index\"].values)}}"
]
},
{
Expand Down Expand Up @@ -271,7 +259,7 @@
"- floating points: the values are quantized under a certain precision, and quantization parameters (scale, zero-point) are sent to the server\n",
"- strings: the values are mapped to integers using a dict, which is sent to the server as well\n",
"\n",
"More generally, the quantized values need be within the range currently allowed. This notably means that the number of rows allowed in a data-frame are also limited, as we expect the keys on which to merge to be unique.\n",
"More generally, the quantized values must be within the range currently allowed. This notably means that the number of rows allowed in a data-frame are also limited, as keys on which to merge are expected to be unique.\n",
"\n",
"Once the inputs are quantized and encrypted, the user can print the encrypted data-frame's schema. A schema represents the data-frame's columns as well as their dtype and associated quantization parameters or mappings. "
]
Expand Down Expand Up @@ -477,7 +465,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Currently, the users need to share the private keys in order to be able to run an encrypted merge. We are currently working on new techniques that would avoid this."
"Currently, the users need to share the private keys in order to be able to run an encrypted merge. Future works will provide new techniques that would avoid this."
]
},
{
Expand All @@ -502,7 +490,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Encrypt the second user's data-frame. Here, we need to use the same schema used for client 1 in order to make sure that custom mappings are matching.\n",
"Encrypt the second user's data-frame. Here, the same schema used for client 1 is needed in order to make sure that custom mappings are matching.\n",
"\n",
"It is possible to get the encrypted data-frame's representation by simply returning the variable."
]
Expand Down Expand Up @@ -606,7 +594,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We now chose to run a left join on the encrypted data-frames' common column \"index\" using FHE. This step can take several seconds. "
"The server can now run a left join on the encrypted data-frames' common column \"index\" using FHE. This step can take several seconds. "
]
},
{
Expand Down Expand Up @@ -843,7 +831,7 @@
"source": [
"### Concrete ML vs Pandas comparison\n",
"\n",
"As this is only a demo in a notebook, we are able to compute Pandas' expected output (in a non-private setting) and compare it to the result above. "
"For this demo, expected output from Pandas (in a non-private setting) can be computed and compared to the result above. "
]
},
{
Expand Down Expand Up @@ -1014,7 +1002,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We can observe slight differences between Pandas and Concrete ML with floating points values. This is only due to quantization artifacts, as we currently only allow a few bits of precision. We can still see that both data-frames are equal under a small float relative tolerance."
"Slight differences cab be observed between Pandas and Concrete ML with floating points values. This is only due to quantization artifacts, as currently only 4 bits of precision are supported. Still, both data-frames are equal under a small float relative tolerance."
]
},
{
Expand Down Expand Up @@ -1060,7 +1048,7 @@
"\n",
"#### Future Work\n",
"\n",
"We are currently working on improving the encrypted data-frame feature. In the near future, we are planning on allowing bigger precisions, which would make encrypted data-frames able to handle larger integers, floating points with better precisions and more unique strings values, as well as provide more rows. We will also add support for more encrypted operations on data-frames. Additionally, we are working new techniques that would avoid users having to share a private keys between themselves. "
"In the near future, bigger precisions will be allowed, which would make encrypted data-frames able to handle larger integers, floating points with better precisions and more unique strings values, as well as provide more rows. Support for more encrypted operations on data-frames will also be added. While users need to share private keys with the current version of the API, threshold decryption, a multi party key generation protocol, could allow them to compute on joint data without revealing it to each other."
]
}
],
Expand Down

0 comments on commit a7b33c6

Please sign in to comment.