refined documentation

TristanThrush · Sep 19, 2024 · 9ed239a · 9ed239a
1 parent e4c5aee
commit 9ed239a
Show file tree

Hide file tree

Showing 2 changed files with 9 additions and 6 deletions.
diff --git a/README.md b/README.md
@@ -124,7 +124,7 @@ want our algorithm to tell you to train on 300 billion tokens of Wikipedia if yo
 have 3 billion tokens, so we should also have the sampling distribution satisfy a
 per-text constraint that prevents the weights from being so high that you will have to
 duplicate data from any text domains. The following code projects our estimate to
-satisfy these constraints, where tau is the vector of per-domain thresholds:
+satisfy these constraints, where `tau` is the vector of per-domain thresholds:
 
 ```python
 from perplexity_correlations.projection import linear
@@ -161,7 +161,6 @@ it at all or include all of it). We can treat these include/don't include judgem
 labels for each text:
 
 ```python
-# Compute the labels
 labels = []
 for weight in projected_estimate:
     labels.append("include" if weight > 0 else "exclude")
@@ -201,7 +200,7 @@ token budget.
 https://tristanthrush.github.io/perplexity-correlations/
 
 
-## Development Guidelines
+## Development guidelines
 
 Install the dev requirements and pre-commit hooks:
 
@@ -210,7 +209,7 @@ pip install -r requirements-dev.txt
 pre-commit install
 ```
 
-### Formatting and Linting
+### Formatting and linting
 
 This project uses [Black](https://black.readthedocs.io/en/stable/) for code formatting
 and [Flake8](https://flake8.pycqa.org/en/latest/) for linting. After installing the

diff --git a/perplexity_correlations/estimation/estimation_functions.py b/perplexity_correlations/estimation/estimation_functions.py
@@ -242,8 +242,8 @@ def sign_sign(X, y):
     function which we do not have to know.
 
     This function uses the single-index model parameter estimator from
-    Thrush et al. (2024): https://arxiv.org/abs/2409.05816,
-    which is the U-statistic:
+    Thrush et al.'s (2024) (https://arxiv.org/abs/2409.05816) initial experiments,
+    although the preprint does not document it yet. It is the U-statistic:
 
     sign(y_g-y_k)*sign(x_g-x_k),
 
@@ -323,6 +323,10 @@ def spearmanr(X, y):
 
     NOTE: This estimator is robust to outliers in X and y.
 
+    NOTE: The current version of the Thrush et al. paper does not provide the proof
+    that this estimator matches the ranks of the optimal weights in expectation,
+    but we have now proved this.
+
     Parameters
     ----------
     X : numpy.ndarray