Skip to content

Commit

Permalink
docs: add quickstart for multi-table synthesis (#89)
Browse files Browse the repository at this point in the history
* docs: add quickstart for multi-table synthesis

* fix(linting): code formatting

---------

Co-authored-by: Fabiana Clemente <[email protected]>
Co-authored-by: Azory YData Bot <[email protected]>
  • Loading branch information
3 people authored Feb 20, 2024
1 parent bb5d9c4 commit d7e99ca
Show file tree
Hide file tree
Showing 9 changed files with 60 additions and 1 deletion.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
57 changes: 57 additions & 0 deletions docs/get-started/create_database_sd_generator.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# How to create your first Relational Database Synthetic Data generator

:fontawesome-brands-youtube:{ .youtube }
Check this quickstart video on <a href="https://youtu.be/40Q56xZbv00?si=T6DMZ-f8mAyPdzf7"><u>how to create your first Relational Database Synthetic Data generator</u></a>.

To generate your first synthetic relational database, you need to have a Multi-Dataset already available in your Data Catalog.
Check this tutorial to see how you can <a href="../create_multitable_dataset"><u>add your first dataset to Fabric’s Data Catalog</u></a>.

With your database created as a Datasource, you are now able to start configure your Synthetic Data (SD) generator to create a replicate of your database.
You can either select **"Synthetic Data"** from your left side menu, or you can select **"Create Synthetic Data"** in your project Home
as shown in the image below.

![Create Synthetic Data](../assets/quickstart/synthetic_data/create_synthetic_data.webp){: style="width:75%"}

You'll be asked to select the dataset you wish to generate synthetic data from and verify the tables you'd like to
include in the synthesis process, validating their data types - *Time-series* or *Tabular*.

!!! Tip "Table data types are relevant for synthetic data quality"
In case some of your tables hold time-series information (meaning there is a time relation between records) it is very important
that during the process of configuring your synthetic data generator you do change update your tables data types accordingly.
This will not only ensure the quality of that particular table, but also the overall database quality and relations.

![Configure the schema](../assets/quickstart/synthetic_data/configure_schema_sd.webp){: style="width:75%"}

All the PK and FK identified based on the database schema definition, have an automatically created anonymization setting defined.
Aa standard and incremental integer will be used as the anonymization configuration, but user can change to other pre-defined generation options
or regex base (user can provide the expected pattern of generation).

![Multi-Table Anonymization](../assets/quickstart/synthetic_data/mt_anonymization.webp){: style="width:75%"}

Finally, as the last step of our process it comes the **Synthetic Data** generator specific configurations, for this particular case we need to
define both *Display Name* and the *Destination connector*. The *Destination connector* it is mandatory and allow to select the database where
the generated synthetic database is expected to be written.
After providing both inputs we can finish the process by clicking in the **"Save"** button as per the image below.

![Select a connector](../assets/quickstart/synthetic_data/select_connector_sd_samples.webp){: style="width:75%"}

Your **Synthetic Data** generator is now training and listed under **"Synthetic Data"**. While the model is being trained, the *Status* will be
🟡, as soon as the training is completed successfully it will transition to 🟢.
Once the Synthetic Data generator has finished training, you're ready to start generating your first synthetic dataset.
You can start by exploring an overview of the model configurations and even validate the quality of the synthetic data generator from a referential integrity
point of view.

![Synthetic data generator training completed](../assets/quickstart/synthetic_data/mt_sd_trained.webp){: style="width:75%"}

Next, you can generate synthetic data samples by accessing the *Generation* tab or click on *"Go to Generation"*.
In this section, you are able to generate as many synthetic samples as you want.
For that you need to define the size of your database in comparison to the real one. This ratio is provided as a percentage.
In the example below, we have asked a sample with 100% size, meaning, a synthetic database with the same size as the original.

![Generate synthetic data records](../assets/quickstart/synthetic_data/generate_database.webp){: style="width:75%"}

A new line in your *"Sample History"* will be shown and as soon as the sample generation is completed you will be able to
check the quality the synthetic data already available in your destination database.

**Congrats!** 🚀 You have now successfully created your first Relation **Synthetic Database** with Fabric.
Get ready for your journey of improved quality data for AI.
2 changes: 1 addition & 1 deletion docs/get-started/create_syntheticdata_generator.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
Check this quickstart video on <a href="https://youtu.be/GsfggG9PhgE?si=ixlCaesd3cLFOCZm"><u>how to create your first Synthetic Data generator</u></a>.

To generate your first synthetic data, you need to have a Dataset already available in your Data Catalog.
Check this tutorial to see how you can <a href="upload_csv"><u>add your first dataset to Fabric’s Data Catalog</u></a>.
Check this tutorial to see how you can <a href="../upload_csv"><u>add your first dataset to Fabric’s Data Catalog</u></a>.

With your first dataset created, you are now able to start the creation of your Synthetic Data generator. You can either
select **"Synthetic Data"** from your left side menu, or you can select **"Create Synthetic Data"** in your project Home
Expand Down
1 change: 1 addition & 0 deletions docs/get-started/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,5 +7,6 @@ data quality, data preparation workflows and how you can start leveraging synthe
### 📚 <a href="upload_csv"><u>Create your first Dataset with the Data Catalog</u></a>
### 💾 <a href="create_multitable_dataset"><u>Create your Multi-Table Dataset with the Data Catalog</u></a>
### ⚙️ <a href="create_syntheticdata_generator"><u>Create your first Synthetic Data generator</u></a>
### 🗄️ <a href="create_database_sd_generator"><u>Create a Relational Database Synthetic Data generator</u></a>
### 🧪 <a href="create_lab"><u>Create your first Lab</u></a>
### 🌀 <a href="create_pipeline"><u>Create your first data Pipeline</u></a>
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ nav:
- How to create your first Dataset from a CSV file: "get-started/upload_csv.md"
- How to create your first Relational database in Fabric's Catalog: "get-started/create_multitable_dataset.md"
- How to create your first Synthetic Data generator: "get-started/create_syntheticdata_generator.md"
- How to create your first Synthetic Data generator for databases: "get-started/create_database_sd_generator.md"
- How to create your first Lab: "get-started/create_lab.md"
- How to create your first Pipeline: "get-started/create_pipeline.md"
- Fabric Community: "get-started/fabric_community.md"
Expand Down

0 comments on commit d7e99ca

Please sign in to comment.