Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs and cleanup #492

Merged
merged 15 commits into from
Jan 26, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,12 @@
- Fixed bug with checking for converged values in semivalues
[PR #341](https://github.com/appliedAI-Initiative/pyDVL/pull/341)

### Docs

- Add applications of data valuation section, display examples more prominently,
make all sections visible in table of contents, use mkdocs material cards
in the home page [PR #492](https://github.com/aai-institute/pyDVL/pull/492)

## 0.8.0 - 🆕 New interfaces, scaling computation, bug fixes and improvements 🎁

### Added
Expand Down
24 changes: 12 additions & 12 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ to make your life easier.

Run the following to set up the pre-commit git hook to run before pushes:

```shell script
```shell
pre-commit install --hook-type pre-push
```

Expand All @@ -32,15 +32,15 @@ pre-commit install --hook-type pre-push
We strongly suggest using some form of virtual environment for working with the
library. E.g. with venv:

```shell script
```shell
python -m venv ./venv
. venv/bin/activate # `venv\Scripts\activate` in windows
pip install -r requirements-dev.txt -r requirements-docs.txt
```

With conda:

```shell script
```shell
conda create -n pydvl python=3.8
conda activate pydvl
pip install -r requirements-dev.txt -r requirements-docs.txt
Expand All @@ -49,7 +49,7 @@ pip install -r requirements-dev.txt -r requirements-docs.txt
A very convenient way of working with your library during development is to
install it in editable mode into your environment by running

```shell script
```shell
pip install -e .
```

Expand All @@ -58,7 +58,7 @@ suite) [pandoc](https://pandoc.org/) is required. Except for OSX, it should be i
automatically as a dependency with `requirements-docs.txt`. Under OSX you can
install pandoc (you'll need at least version 2.11) with:

```shell script
```shell
brew install pandoc
```

Expand Down Expand Up @@ -152,11 +152,11 @@ Two important markers are:
To test the notebooks separately, run (see [below](#notebooks) for details):

```shell
tox -e tests -- notebooks/
tox -e notebook-tests
```

To create a package locally, run:
```shell script
```shell
python setup.py sdist bdist_wheel
```

Expand Down Expand Up @@ -517,13 +517,13 @@ Then, a new release can be created using the script
`bumpversion` automatically derive the next release version by bumping the patch
part):

```shell script
```shell
build_scripts/release-version.sh 0.1.6
```

To find out how to use the script, pass the `-h` or `--help` flags:

```shell script
```shell
build_scripts/release-version.sh --help
```

Expand All @@ -549,7 +549,7 @@ create a new release manually by following these steps:
2. When ready to release: From the develop branch create the release branch and
perform release activities (update changelog, news, ...). For your own
convenience, define an env variable for the release version
```shell script
```shell
export RELEASE_VERSION="vX.Y.Z"
git checkout develop
git branch release/${RELEASE_VERSION} && git checkout release/${RELEASE_VERSION}
Expand All @@ -560,7 +560,7 @@ create a new release manually by following these steps:
(the `release` part is ignored but required by bumpversion :rolling_eyes:).
4. Merge the release branch into `master`, tag the merge commit, and push back to the repo.
The CI pipeline publishes the package based on the tagged commit.
```shell script
```shell
git checkout master
git merge --no-ff release/${RELEASE_VERSION}
git tag -a ${RELEASE_VERSION} -m"Release ${RELEASE_VERSION}"
Expand All @@ -571,7 +571,7 @@ create a new release manually by following these steps:
always strictly more recent than the last published release version from
`master`.
6. Merge the release branch into `develop`:
```shell script
```shell
git checkout develop
git merge --no-ff release/${RELEASE_VERSION}
git push origin develop
Expand Down
38 changes: 38 additions & 0 deletions build_scripts/copy_contributing_guide.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
import logging
import os
from pathlib import Path

import mkdocs.plugins

logger = logging.getLogger(__name__)

root_dir = Path(__file__).parent.parent
docs_dir = root_dir / "docs"
contributing_file = root_dir / "CONTRIBUTING.md"
target_filepath = docs_dir / contributing_file.name


@mkdocs.plugins.event_priority(100)
def on_pre_build(config):
logger.info("Temporarily copying contributing guide to docs directory")
try:
if os.path.getmtime(contributing_file) <= os.path.getmtime(target_filepath):
logger.info(
f"Contributing guide '{os.fspath(contributing_file)}' hasn't been updated, skipping."
)
return
except FileNotFoundError:
pass
logger.info(
f"Creating symbolic link for '{os.fspath(contributing_file)}' "
f"at '{os.fspath(target_filepath)}'"
)
target_filepath.symlink_to(contributing_file)

logger.info("Finished copying contributing guide to docs directory")


@mkdocs.plugins.event_priority(-100)
def on_shutdown():
logger.info("Removing temporary contributing guide in docs directory")
target_filepath.unlink()
1 change: 1 addition & 0 deletions docs/css/extra.css
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,7 @@ a.autorefs-external:hover::after {
.nt-card-image:focus {
filter: invert(32%) sepia(93%) saturate(1535%) hue-rotate(220deg) brightness(102%) contrast(99%);
}

.md-header__button.md-logo {
padding: 0;
}
Expand Down
22 changes: 22 additions & 0 deletions docs/css/grid-cards.css
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
/* Shadow and Hover */
.grid.cards > ul > li {
box-shadow: 0 2px 2px 0 rgb(0 0 0 / 14%), 0 3px 1px -2px rgb(0 0 0 / 20%), 0 1px 5px 0 rgb(0 0 0 / 12%);

&:hover {
transform: scale(1.05);
z-index: 999;
background-color: rgba(0, 0, 0, 0.05);
}
}

[data-md-color-scheme="slate"] {
.grid.cards > ul > li {
box-shadow: 0 2px 2px 0 rgb(4 40 33 / 14%), 0 3px 1px -2px rgb(40 86 94 / 47%), 0 1px 5px 0 rgb(139 252 255 / 64%);

&:hover {
transform: scale(1.05);
z-index: 999;
background-color: rgba(139, 252, 255, 0.05);
}
}
}
1 change: 0 additions & 1 deletion docs/css/neoteroi.css

This file was deleted.

8 changes: 4 additions & 4 deletions docs/getting-started/first-steps.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
---
title: Getting Started
title: First Steps
alias:
name: getting-started
text: Getting Started
name: first-steps
text: First Steps
---

# Getting started
# First Steps

!!! Warning
Make sure you have read [[installation]] before using the library.
Expand Down
43 changes: 28 additions & 15 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,26 +9,39 @@ It runs most of them in parallel either locally or in a cluster and supports
distributed caching of results.

If you're a first time user of pyDVL, we recommend you to go through the
[[getting-started]] and [[installation]] guides.
[[installation]] and [[first-steps]] guides in the Getting Started section.

::cards:: cols=2
<div class="grid cards" markdown>

- title: Installation
content: Steps to install and requirements
url: getting-started/installation.md
- :fontawesome-solid-toolbox:{ .lg .middle } __Installation__

---
Steps to install and requirements

[[installation|:octicons-arrow-right-24: Installation]]

- :fontawesome-solid-scale-unbalanced:{ .lg .middle } __Data valuation__

---

- title: Data valuation
content: >
Basics of data valuation and description of the main algorithms
url: value/

- title: Influence Function
content: >
[[data-valuation|:octicons-arrow-right-24: Data Valuation]]

- :fontawesome-solid-scale-unbalanced-flip:{ .lg .middle } __Influence Function__

---

An introduction to the influence function and its computation with pyDVL
url: influence/

- title: Browse the API
content: Full documentation of the API
url: api/pydvl/
[[influence-values|:octicons-arrow-right-24: Influence Values]]

- :fontawesome-regular-file-code:{ .lg .middle } __API Reference__

---

Full documentation of the API

[:octicons-arrow-right-24: API Reference](api/pydvl/)

::/cards::
</div>
91 changes: 91 additions & 0 deletions docs/value/applications.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
---
title: Applications of data valuation
---

# Applications of data valuation

Data valuation methods hold promise for improving various aspects
of data engineering and machine learning workflows. When applied judiciously,
these methods can enhance data quality, model performance, and cost-effectiveness.

However, the results can be inconsistent. Values have a strong dependency
on the training procedure and the performance metric used. For instance,
accuracy is a poor metric for imbalanced sets and this has a stark effect
on data values. Some models exhibit great variance in some regimes
and this again has a detrimental effect on values.

While still an evolving field with methods requiring careful use, data valuation can
be applied across a wide range of data engineering tasks. For a comprehensive
overview, along with concrete examples, please refer to the [Transferlab blog
post]({{ transferlab.website }}blog/data-valuation-applications/) on this topic.

## Data Engineering

While still an emerging field, judicious use of data valuation techniques
has the potential to enhance data quality, model performance,
and the cost-effectiveness of data workflows in many applications.
Some of the promising applications in data engineering include:

- Removing low-value data points can reduce noise and increase model performance.
However, care is needed to avoid overfitting when iteratively retraining on pruned datasets.
- Pruning redundant samples enables more efficient training of large models.
Value-based metrics can determine which data to discard for optimal efficiency gains.
- Computing value scores for unlabeled data points supports efficient active learning.
High-value points can be prioritized for labeling to maximize gains in model performance.
- Analyzing high- and low-value data provides insights to guide targeted data collection
and improve upstream data processes. Low-value points may reveal data issues to address.
- Data value metrics can also help identify irrelevant or duplicated data
when evaluating offerings from data providers.

## Model development

Data valuation techniques can provide insights for model debugging and interpretation.
Some of the useful applications include:

- Interpretation and debugging: Analyzing the most or least valuable samples
for a class can reveal cases where the model relies on confounding features
instead of true signal. Investigating influential points for misclassified examples
highlights limitations to address.
- Sensitivity/robustness analysis: Prior work shows removing a small fraction
of highly influential data can completely flip model conclusions.
This reveals potential issues with the modeling approach, data collection process,
or intrinsic difficulty of the problem that require further inspection.
Robust models require many points removed before conclusions meaningfully shift.
High sensitivity means conclusions heavily depend on small subsets of data,
indicating deeper problems to resolve.
- Monitoring changes in data value during training provides insights into
model convergence and overfitting.
AnesBenmerzoug marked this conversation as resolved.
Show resolved Hide resolved
- Continual learning: in order to avoid forgetting when training on new data,
a subset of previously seen data is presented again. Data valuation helps
in the selection of highly influential samples.

## Attacks

Data valuation techniques have applications in detecting data manipulation and contamination:

- Watermark removal: Points with low value on a correct validation set may be
part of a watermarking mechanism. Removing them can strip a model of its fingerprints.
- Poisoning attacks: Influential points can be shifted to induce large changes
in model estimators. However, the feasibility of such attacks is limited,
and their value for adversarial training is unclear.

Overall, while data valuation techniques show promise for identifying anomalous
or manipulated data, more research is needed to develop robust methods suited
for security applications.

## Data markets

Additionally, one of the motivating applications for the whole field is that of
data markets: a marketplace where data owners can sell their data to interested
parties. In this setting, data valuation can be key component to determine the
price of data. Market pricing depends on the value addition for buyers
(e.g. improved model performance) and costs/privacy concerns for sellers.

Game-theoretic valuation methods like Shapley values can help assign fair prices,
but have limitations around handling duplicates or adversarial data.
Model-free methods like LAVA [@just_lava_2023] and CRAIG are
particularly well suited for this, as they use the Wasserstein distance between
a vendor's data and the buyer's to determine the value of the former.

However, this is a complex problem which can face practical banal problems like
the fact that data owners may not wish to disclose their data for valuation.
27 changes: 0 additions & 27 deletions docs/value/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,33 +83,6 @@ among all samples, failing to identify repeated ones as unnecessary, with e.g. a
zero value.


## Applications of data valuation

Many applications are touted for data valuation, but the results can be
inconsistent. Values have a strong dependency on the training procedure and the
performance metric used. For instance, accuracy is a poor metric for imbalanced
sets and this has a stark effect on data values. Some models exhibit great
variance in some regimes and this again has a detrimental effect on values.

Nevertheless, some of the most promising applications are:

* Cleaning of corrupted data.
* Pruning unnecessary or irrelevant data.
* Repairing mislabeled data.
* Guiding data acquisition and annotation (active learning).
* Anomaly detection and model debugging and interpretation.

Additionally, one of the motivating applications for the whole field is that of
data markets: a marketplace where data owners can sell their data to interested
parties. In this setting, data valuation can be key component to determine the
price of data. Algorithm-agnostic methods like LAVA [@just_lava_2023] are
particularly well suited for this, as they use the Wasserstein distance between
a vendor's data and the buyer's to determine the value of the former.

However, this is a complex problem which can face practical banal problems like
the fact that data owners may not wish to disclose their data for valuation.


## Computing data values

Using pyDVL to compute data values is a simple process that can be broken down
Expand Down
Loading
Loading