aai-institute · AnesBenmerzoug · Jan 26, 2024 · Jan 23, 2024 · Jan 23, 2024 · Jan 23, 2024
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -20,6 +20,12 @@
 - Fixed bug with checking for converged values in semivalues
   [PR #341](https://github.com/appliedAI-Initiative/pyDVL/pull/341)
 
+### Docs
+
+- Add applications of data valuation section, display examples more prominently,
+  make all sections visible in table of contents, use mkdocs material cards
+  in the home page [PR #492](https://github.com/aai-institute/pyDVL/pull/492)
+
 ## 0.8.0 - 🆕 New interfaces, scaling computation, bug fixes and improvements 🎁
 
 ### Added

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -23,7 +23,7 @@ to make your life easier.
 
 Run the following to set up the pre-commit git hook to run before pushes:
 
-```shell script
+```shell
 pre-commit install --hook-type pre-push
 ```
 
@@ -32,15 +32,15 @@ pre-commit install --hook-type pre-push
 We strongly suggest using some form of virtual environment for working with the
 library. E.g. with venv:
 
-```shell script
+```shell
 python -m venv ./venv
 . venv/bin/activate  # `venv\Scripts\activate` in windows
 pip install -r requirements-dev.txt -r requirements-docs.txt
 ```
 
 With conda:
 
-```shell script
+```shell
 conda create -n pydvl python=3.8
 conda activate pydvl
 pip install -r requirements-dev.txt -r requirements-docs.txt
@@ -49,7 +49,7 @@ pip install -r requirements-dev.txt -r requirements-docs.txt
 A very convenient way of working with your library during development is to
 install it in editable mode into your environment by running
 
-```shell script
+```shell
 pip install -e .
 ```
 
@@ -58,7 +58,7 @@ suite) [pandoc](https://pandoc.org/) is required. Except for OSX, it should be i
 automatically as a dependency with `requirements-docs.txt`. Under OSX you can
 install pandoc (you'll need at least version 2.11) with:
 
-```shell script
+```shell
 brew install pandoc
 ```
 
@@ -152,11 +152,11 @@ Two important markers are:
 To test the notebooks separately, run (see [below](#notebooks) for details):
 
 ```shell
-tox -e tests -- notebooks/
+tox -e notebook-tests
 ```
 
 To create a package locally, run:
-```shell script
+```shell
 python setup.py sdist bdist_wheel
 ```
 
@@ -517,13 +517,13 @@ Then, a new release can be created using the script
 `bumpversion` automatically derive the next release version by bumping the patch
 part):
 
-```shell script
+```shell
 build_scripts/release-version.sh 0.1.6
 ```
 
 To find out how to use the script, pass the `-h` or `--help` flags:
 
-```shell script
+```shell
 build_scripts/release-version.sh --help
 ```
 
@@ -549,7 +549,7 @@ create a new release manually by following these steps:
 2. When ready to release: From the develop branch create the release branch and
    perform release activities (update changelog, news, ...). For your own
    convenience, define an env variable for the release version
-    ```shell script
+    ```shell
     export RELEASE_VERSION="vX.Y.Z"
     git checkout develop
     git branch release/${RELEASE_VERSION} && git checkout release/${RELEASE_VERSION}
@@ -560,7 +560,7 @@ create a new release manually by following these steps:
    (the `release` part is ignored but required by bumpversion :rolling_eyes:).
 4. Merge the release branch into `master`, tag the merge commit, and push back to the repo. 
    The CI pipeline publishes the package based on the tagged commit.
-    ```shell script
+    ```shell
     git checkout master
     git merge --no-ff release/${RELEASE_VERSION}
     git tag -a ${RELEASE_VERSION} -m"Release ${RELEASE_VERSION}"
@@ -571,7 +571,7 @@ create a new release manually by following these steps:
    always strictly more recent than the last published release version from 
    `master`.
 6. Merge the release branch into `develop`:
-    ```shell script
+    ```shell
     git checkout develop
     git merge --no-ff release/${RELEASE_VERSION}
     git push origin develop

diff --git a/build_scripts/copy_contributing_guide.py b/build_scripts/copy_contributing_guide.py
@@ -0,0 +1,38 @@
+import logging
+import os
+from pathlib import Path
+
+import mkdocs.plugins
+
+logger = logging.getLogger(__name__)
+
+root_dir = Path(__file__).parent.parent
+docs_dir = root_dir / "docs"
+contributing_file = root_dir / "CONTRIBUTING.md"
+target_filepath = docs_dir / contributing_file.name
+
+
+@mkdocs.plugins.event_priority(100)
+def on_pre_build(config):
+    logger.info("Temporarily copying contributing guide to docs directory")
+    try:
+        if os.path.getmtime(contributing_file) <= os.path.getmtime(target_filepath):
+            logger.info(
+                f"Contributing guide '{os.fspath(contributing_file)}' hasn't been updated, skipping."
+            )
+            return
+    except FileNotFoundError:
+        pass
+    logger.info(
+        f"Creating symbolic link for '{os.fspath(contributing_file)}' "
+        f"at '{os.fspath(target_filepath)}'"
+    )
+    target_filepath.symlink_to(contributing_file)
+
+    logger.info("Finished copying contributing guide to docs directory")
+
+
+@mkdocs.plugins.event_priority(-100)
+def on_shutdown():
+    logger.info("Removing temporary contributing guide in docs directory")
+    target_filepath.unlink()
diff --git a/docs/css/extra.css b/docs/css/extra.css
@@ -69,6 +69,7 @@ a.autorefs-external:hover::after {
 .nt-card-image:focus {
   filter: invert(32%) sepia(93%) saturate(1535%) hue-rotate(220deg) brightness(102%) contrast(99%);
 }
+
 .md-header__button.md-logo {
     padding: 0;
 }

diff --git a/docs/css/grid-cards.css b/docs/css/grid-cards.css
@@ -0,0 +1,22 @@
+/* Shadow and Hover     */
+.grid.cards > ul > li {
+    box-shadow: 0 2px 2px 0 rgb(0 0 0 / 14%), 0 3px 1px -2px rgb(0 0 0 / 20%), 0 1px 5px 0 rgb(0 0 0 / 12%);
+
+    &:hover {
+        transform: scale(1.05);
+        z-index: 999;
+        background-color: rgba(0, 0, 0, 0.05);
+    }
+}
+
+[data-md-color-scheme="slate"] {
+    .grid.cards > ul > li {
+        box-shadow: 0 2px 2px 0 rgb(4 40 33 / 14%), 0 3px 1px -2px rgb(40 86 94 / 47%), 0 1px 5px 0 rgb(139 252 255 / 64%);
+
+        &:hover {
+            transform: scale(1.05);
+            z-index: 999;
+            background-color: rgba(139, 252, 255, 0.05);
+        }
+    }
+}
diff --git a/docs/css/neoteroi.css b/docs/css/neoteroi.css
diff --git a/docs/getting-started/first-steps.md b/docs/getting-started/first-steps.md
@@ -1,11 +1,11 @@
 ---
-title: Getting Started
+title: First Steps
 alias: 
-  name: getting-started
-  text: Getting Started
+  name: first-steps
+  text: First Steps
 ---
 
-# Getting started
+# First Steps
 
 !!! Warning
     Make sure you have read [[installation]] before using the library. 

diff --git a/docs/index.md b/docs/index.md
@@ -9,26 +9,39 @@ It runs most of them in parallel either locally or in a cluster and supports
 distributed caching of results.
 
 If you're a first time user of pyDVL, we recommend you to go through the
-[[getting-started]] and [[installation]] guides.
+[[installation]] and [[first-steps]] guides in the Getting Started section.
 
-::cards:: cols=2
+<div class="grid cards" markdown>
 
-- title: Installation
-  content: Steps to install and requirements
-  url: getting-started/installation.md
+-   :fontawesome-solid-toolbox:{ .lg .middle } __Installation__
+
+    ---
+    Steps to install and requirements
+
+    [[installation|:octicons-arrow-right-24: Installation]]
+
+-   :fontawesome-solid-scale-unbalanced:{ .lg .middle } __Data valuation__
+
+    ---
 
-- title: Data valuation
-  content: >
     Basics of data valuation and description of the main algorithms
-  url: value/
 
-- title: Influence Function
-  content: >
+    [[data-valuation|:octicons-arrow-right-24: Data Valuation]]
+
+-   :fontawesome-solid-scale-unbalanced-flip:{ .lg .middle } __Influence Function__
+
+    ---
+
     An introduction to the influence function and its computation with pyDVL
-  url: influence/
 
-- title: Browse the API
-  content: Full documentation of the API
-  url: api/pydvl/
+    [[influence-values|:octicons-arrow-right-24: Influence Values]]
+
+-   :fontawesome-regular-file-code:{ .lg .middle } __API Reference__
+
+    ---
+
+    Full documentation of the API
+
+    [:octicons-arrow-right-24: API Reference](api/pydvl/)
 
-::/cards::
+</div>
diff --git a/docs/value/applications.md b/docs/value/applications.md
@@ -0,0 +1,91 @@
+---
+title: Applications of data valuation
+---
+
+# Applications of data valuation
+
+Data valuation methods hold promise for improving various aspects
+of data engineering and machine learning workflows. When applied judiciously,
+these methods can enhance data quality, model performance, and cost-effectiveness.
+
+However, the results can be inconsistent. Values have a strong dependency
+on the training procedure and the performance metric used. For instance,
+accuracy is a poor metric for imbalanced sets and this has a stark effect
+on data values. Some models exhibit great variance in some regimes
+and this again has a detrimental effect on values.
+
+While still an evolving field with methods requiring careful use, data valuation can
+be applied across a wide range of data engineering tasks. For a comprehensive
+overview, along with concrete examples, please refer to the [Transferlab blog
+post]({{ transferlab.website }}blog/data-valuation-applications/) on this topic.
+
+## Data Engineering
+
+While still an emerging field, judicious use of data valuation techniques
+has the potential to enhance data quality, model performance,
+and the cost-effectiveness of data workflows in many applications. 
+Some of the promising applications in data engineering include:
+
+- Removing low-value data points can reduce noise and increase model performance.
+  However, care is needed to avoid overfitting when iteratively retraining on pruned datasets.
+- Pruning redundant samples enables more efficient training of large models.
+  Value-based metrics can determine which data to discard for optimal efficiency gains.
+- Computing value scores for unlabeled data points supports efficient active learning.
+  High-value points can be prioritized for labeling to maximize gains in model performance.
+- Analyzing high- and low-value data provides insights to guide targeted data collection
+  and improve upstream data processes. Low-value points may reveal data issues to address.
+- Data value metrics can also help identify irrelevant or duplicated data
+  when evaluating offerings from data providers.
+
+## Model development
+
+Data valuation techniques can provide insights for model debugging and interpretation.
+Some of the useful applications include:
+
+- Interpretation and debugging: Analyzing the most or least valuable samples
+  for a class can reveal cases where the model relies on confounding features
+  instead of true signal. Investigating influential points for misclassified examples
+  highlights limitations to address.
+- Sensitivity/robustness analysis: Prior work shows removing a small fraction
+  of highly influential data can completely flip model conclusions.
+  This reveals potential issues with the modeling approach, data collection process,
+  or intrinsic difficulty of the problem that require further inspection.
+  Robust models require many points removed before conclusions meaningfully shift.
+  High sensitivity means conclusions heavily depend on small subsets of data,
+  indicating deeper problems to resolve.
+- Monitoring changes in data value during training provides insights into
+  model convergence and overfitting.
+- Continual learning: in order to avoid forgetting when training on new data,
+  a subset of previously seen data is presented again. Data valuation helps
+  in the selection of highly influential samples.
+
+## Attacks
+
+Data valuation techniques have applications in detecting data manipulation and contamination:
+
+- Watermark removal: Points with low value on a correct validation set may be
+  part of a watermarking mechanism. Removing them can strip a model of its fingerprints.
+- Poisoning attacks: Influential points can be shifted to induce large changes
+  in model estimators. However, the feasibility of such attacks is limited,
+  and their value for adversarial training is unclear.
+
+Overall, while data valuation techniques show promise for identifying anomalous
+or manipulated data, more research is needed to develop robust methods suited
+for security applications.
+
+## Data markets
+
+Additionally, one of the motivating applications for the whole field is that of
+data markets: a marketplace where data owners can sell their data to interested
+parties. In this setting, data valuation can be key component to determine the
+price of data. Market pricing depends on the value addition for buyers
+(e.g. improved model performance) and costs/privacy concerns for sellers.
+
+Game-theoretic valuation methods like Shapley values can help assign fair prices,
+but have limitations around handling duplicates or adversarial data.
+Model-free methods like LAVA [@just_lava_2023] and CRAIG are
+particularly well suited for this, as they use the Wasserstein distance between
+a vendor's data and the buyer's to determine the value of the former. 
+
+However, this is a complex problem which can face practical banal problems like
+the fact that data owners may not wish to disclose their data for valuation.
diff --git a/docs/value/index.md b/docs/value/index.md
@@ -83,33 +83,6 @@ among all samples, failing to identify repeated ones as unnecessary, with e.g. a
 zero value.
 
 
-## Applications of data valuation
-
-Many applications are touted for data valuation, but the results can be
-inconsistent. Values have a strong dependency on the training procedure and the
-performance metric used. For instance, accuracy is a poor metric for imbalanced
-sets and this has a stark effect on data values. Some models exhibit great
-variance in some regimes and this again has a detrimental effect on values.
-
-Nevertheless, some of the most promising applications are:
-
-* Cleaning of corrupted data.
-* Pruning unnecessary or irrelevant data.
-* Repairing mislabeled data.
-* Guiding data acquisition and annotation (active learning).
-* Anomaly detection and model debugging and interpretation.
-
-Additionally, one of the motivating applications for the whole field is that of
-data markets: a marketplace where data owners can sell their data to interested
-parties. In this setting, data valuation can be key component to determine the
-price of data. Algorithm-agnostic methods like LAVA [@just_lava_2023] are
-particularly well suited for this, as they use the Wasserstein distance between
-a vendor's data and the buyer's to determine the value of the former. 
-
-However, this is a complex problem which can face practical banal problems like
-the fact that data owners may not wish to disclose their data for valuation.
-
-
 ## Computing data values
 
 Using pyDVL to compute data values is a simple process that can be broken down