Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add information about pii classification feature #1517

Merged
merged 30 commits into from
Dec 7, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
07d5819
Update duplicates_pandas.py (#1427)
boris-kogan Aug 21, 2023
6ceeead
chore(actions): update sonarsource/sonarqube-scan-action action to v2…
renovate[bot] Aug 30, 2023
d8ffdb1
chore(actions): update actions/checkout action to v4
renovate[bot] Sep 5, 2023
c6f90cb
docs: setup new docs with mkdocs (#1418)
vascoalramos Sep 12, 2023
cb66a7e
chore(actions): update actions/checkout action to v4
renovate[bot] Sep 12, 2023
fcf57b2
fix: remove the duplicated cardinality threshold under categorical an…
ricardodcpereira Sep 18, 2023
58158ef
fix: fixate matplotlib upper version
ricardodcpereira Sep 18, 2023
829acf5
docs: change from `zap` to `sparkles` (#1447)
Anselmoo Sep 19, 2023
fdc0346
fix: template {{ file_name }} error in HTML wrapper (#1380)
jogecodes Sep 20, 2023
1c500d5
feat: add density histogram (#1458)
alexbarros Sep 26, 2023
62b0231
docs: update README.html (#1461)
miriamspsantos Sep 26, 2023
6c23196
fix: bug when creating a new report (#1440)
chrimaho Sep 27, 2023
df76ea7
fix: gen wordcloud only for non-empty cols (#1459)
alexbarros Sep 27, 2023
b7fac9e
fix: table template ignoring text format (#1462)
alexbarros Sep 27, 2023
797c799
fix: to_category misshandling pd.NA (#1464)
alexbarros Sep 27, 2023
07322a5
docs: add 📊 for Key features (#1451)
Anselmoo Sep 27, 2023
6d60670
docs: fix hyperlink - related to package name change (#1457)
martin-kokos Sep 27, 2023
bc12fde
chore(deps): increase numpy upper limit (#1467)
alexbarros Sep 27, 2023
a57e234
chore(deps): fix numba package version, and filter warns (#1468)
alexbarros Sep 27, 2023
9f8bf18
chore(deps): update dependency typeguard to v4 (#1324)
renovate[bot] Oct 4, 2023
5733205
Merge branch 'develop' of https://github.com/ydataai/ydata-profiling …
Dec 6, 2023
2aca483
Merge remote-tracking branch 'origin/develop' into develop
Dec 6, 2023
86a0ea9
docs: update docs with advent of code
Dec 7, 2023
e7801c2
Merge branch 'develop' into docs/advent_code
fabclmnt Dec 7, 2023
5d9d13a
docs: update links for fabric
Dec 7, 2023
479a646
Merge branch 'docs/advent_code' of https://github.com/ydataai/ydata-p…
Dec 7, 2023
bc6dc22
chore(actions): update actions/setup-python action to v5
renovate[bot] Dec 7, 2023
5b7e966
Merge remote-tracking branch 'origin/docs/advent_code' into docs/adve…
Dec 7, 2023
3ea8258
docs: add information about PII classification & management.
Dec 7, 2023
a55590d
Merge branch 'develop' into docs/PII_experience
fabclmnt Dec 7, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 8 additions & 5 deletions docs/features/collaborative_data_profiling.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,19 @@
# Data Catalog - A collaborative experience to profile datasets & relational databases
# Data Catalog **
A collaborative experience to profile datasets & relational databases

!!! note "Data Catalog with data quality profiling"
!!! info "** YData's Enterprise feature"

This feature is only available for users of [YData Fabric](https://ydata.ai).

[Sign-up Fabric community](http://ydata.ai/register?utm_source=ydata-profiling&utm_medium=documentation&utm_campaign=YData%20Fabric%20Community) to try the **data catalog**
and **collaborative** experience for datasets and database profiling at scale!
[Sign-up Fabric community](http://ydata.ai/register?utm_source=ydata-profiling&utm_medium=documentation&utm_campaign=YData%20Fabric%20Community) to try the **Data catalog**

[YData Fabric](https://ydata.ai/products/fabric) is a Data-Centric AI
development platform. YData Fabric provides all capabilities of
ydata-profiling in a hosted environment combined with a guided UI
experience.

[Fabric's Data Catalog](https://ydata.ai/products/data_catalog)
[Fabric's Data Catalog](https://ydata.ai/products/data_catalog),
a scalable and interactive version of ydata-profiling,
provides a comprehensive and powerful tool designed to enable data
professionals, including data scientists and data engineers, to manage
and understand data within an organization. The Data Catalog act as a
Expand Down
59 changes: 59 additions & 0 deletions docs/features/pii_identification_management.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# Personally identifiable information (PII) identification & management **

!!! info "** YData's Enterprise feature"

This feature is only available for users of [YData Fabric](https://ydata.ai).

[Sign-up Fabric community](http://ydata.ai/register?utm_source=ydata-profiling&utm_medium=documentation&utm_campaign=YData%20Fabric%20Community) and
start your journey into **data management** with automated PII identification.

Personal Identifiable Information **(PII)** refers to any information that can be used to identify an individual.
This includes but is not limited to, names, addresses, phone numbers, social security numbers, email addresses,
and financial information. PII is crucial in today's digital age, where data is extensively collected, stored,
and processed.

[YData Fabric Data Catalog](https://ydata.ai/products/data_catalog), a scalable and interactive version of ydata-profiling,
integrates into the data profiling experience, an advanced machine learning solutions based on a Named Entity Recognition (NER) model
combine with traditional rule-based patterns identification, allowing to efficiently detect PII.

:fontawesome-brands-youtube:{ .youtube }
<a href="https://www.youtube.com/clip/UgkxBntXvAvCQ6I39Cp2KZRD4Ug9-NPzG1o1"><u>See Fabric's Data Catalog PII identification in action</u></a>.

## Why Fabric Catalog automated PII identification?

The relevance of automating the identification of PII lies in the need to protect individuals' privacy and comply
with various data protection regulations. Mishandling or unauthorized access to PII can lead to severe consequences
such as identity theft, financial fraud, and breaches of privacy. With the increasing volume of data generated manual
identification of PII becomes impractical and error-prone.

Additionally, having a robust PII management solution is essential for organizations to establish and maintain
a secure approach to handling sensitive information, fostering trust and adhering to legal requirements.

## Why Fabric to manage dataset PII identification

Besides automated PII identification, *Fabric Catalog* offers several key benefits in the content of data governance,
privacy compliance and overall data management, through automated data profiling and metadata management:

### Compliance with Privacy Regulations:
Many countries and regions have stringent data protection regulations (such as GDPR, CCPA, or HIPAA)
that require organizations to handle PII responsibly. A dedicated platform ensures that PII is correctly classified,
helping organizations comply with legal requirements and avoid potential penalties.

### Data Profiling for Accuracy:

Data profiling involves analyzing and understanding the structure and content of data. By incorporating data profiling
capabilities into the platform, organizations can ensure accurate identification and classification of PII.
This helps in maintaining the integrity of data and reduces the risk of misclassifications.

### Efficient Management of PII:
As the volume of data continues to grow, manually managing and editing PII classifications becomes impractical.
A platform streamlines this process, making it more efficient and reducing the likelihood of errors.
It allows organizations to keep track of PII across various datasets and systems.

### Facilitating Data Governance:

Data governance involves establishing policies and processes to ensure high data quality, security, and compliance.
A PII management solution enhances data governance efforts by providing a centralized hub for overseeing PII classifications,
metadata, and related policies.


4 changes: 4 additions & 0 deletions docs/features/sensitive_data.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,3 +56,7 @@ pd.read_csv("filename.csv", dtype={"phone": str})
Note that the type detection is hard. That is why
[visions](https://github.com/dylan-profiler/visions), a type system to
help developers solve these cases, was developed.

## Automated PII classification & management

You can find more details about this feature [here](pii_identification_management.md).
24 changes: 13 additions & 11 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,14 +9,15 @@ understanding and preparing data for analysis in a single line of code! If you'r

!!! tip "Advent of Code - Get featured on ydata-profiling"

*“I want to get into open source, but I don’t know how.”* - Does this sound familiar to you? Have you been wanting to get more involved with open-source software, but no one’s given you an entry point?
*“I want to get into open source, but I don’t know how.”* - Does this sound familiar to you? Have you been wanting to
get more involved with open-source software, but no one’s given you an entry point?

That's why we joined [The Advent of code this year](https://zilliz.com/advent-of-code). Contribute to ydata-profiling and win some 🐼🐼 swag!

How can you be part of it?

- Give us some love with a Github ⭐
- Write an article or create a tutorial like other [members the communit already did.](https://medium.com/@seckindinc/data-profiling-with-python-36497d3a1261)
- Write an article or create a tutorial like other [members the community already did.](https://medium.com/@seckindinc/data-profiling-with-python-36497d3a1261)
- Feeling adventurous? Contribute with a PR. We have a list of [great issues to get you started.](https://github.com/ydataai/ydata-profiling/issues?q=label%3A%22getting+started+%E2%98%9D%22+)

![ydata-profiling report](_static/img/ydata-profiling.gif)
Expand Down Expand Up @@ -55,15 +56,16 @@ YData-profiling can be used to deliver a variety of different applications. The

Check out the [free Community Version](http://ydata.ai/register?utm_source=ydata-profiling&utm_medium=documentation&utm_campaign=YData%20Fabric%20Community).

| Features & functionalities | Description |
|------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|
| [Comparing datasets](features/comparing_datasets.md) | Comparing multiple version of the same dataset |
| [Profiling a Time-Series dataset](features/time_series_datasets.md) | Generating a report for a time-series dataset with a single line of code |
| [Profiling large datasets](features/big_data.md) | Tips on how to prepare data and configure `ydata-profiling` for working with large datasets |
| [Handling sensitive data](features/sensitive_data.md) | Generating reports which are mindful about sensitive data in the input dataset |
| [Dataset metadata and data dictionaries](features/metadata.md) | Complementing the report with dataset details and column-specific data dictionaries |
| [Customizing the report's appearance](features/custom_report_appearance.md ) | Changing the appearance of the report's page and of the contained visualizations |
| [Profiling Databases **](features/collaborative_data_profiling.md) | For a seamless profiling experience in your organization's databases, check [Fabric Data Catalog](https://ydata.ai/products/data_catalog), which allows to consume data from different types of storages such as RDBMs (Azure SQL, PostGreSQL, Oracle, etc.) and object storages (Google Cloud Storage, AWS S3, Snowflake, etc.), among others. |
| Features & functionalities | Description |
|----------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|
| [Comparing datasets](features/comparing_datasets.md) | Comparing multiple version of the same dataset |
| [Profiling a Time-Series dataset](features/time_series_datasets.md) | Generating a report for a time-series dataset with a single line of code |
| [Profiling large datasets](features/big_data.md) | Tips on how to prepare data and configure `ydata-profiling` for working with large datasets |
| [Handling sensitive data](features/sensitive_data.md) | Generating reports which are mindful about sensitive data in the input dataset |
| [Dataset metadata and data dictionaries](features/metadata.md) | Complementing the report with dataset details and column-specific data dictionaries |
| [Customizing the report's appearance](features/custom_report_appearance.md ) | Changing the appearance of the report's page and of the contained visualizations |
| [Profiling Relational databases **](features/collaborative_data_profiling.md) | For a seamless profiling experience in your organization's databases, check [Fabric Data Catalog](https://ydata.ai/products/data_catalog), which allows to consume data from different types of storages such as RDBMs (Azure SQL, PostGreSQL, Oracle, etc.) and object storages (Google Cloud Storage, AWS S3, Snowflake, etc.), among others. |
| [PII classification & management **](features/pii_identification_management.md ) | Automated PII classification and management through an UI experience |

### Tutorials

Expand Down
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ nav:
- Dataset metadata: 'features/metadata.md'
- Datasets catalog **: 'features/collaborative_data_profiling.md'
- Sensitive data: 'features/sensitive_data.md'
- Automated PII classification & management **: 'features/pii_identification_management.md'
- Time-series: 'features/time_series_datasets.md'
- Comparing datasets: 'features/comparing_datasets.md'
- Big data: 'features/big_data.md'
Expand Down
Loading