Skip to content

Commit

Permalink
refined documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
TristanThrush committed Sep 19, 2024
1 parent e4c5aee commit 9ed239a
Show file tree
Hide file tree
Showing 2 changed files with 9 additions and 6 deletions.
7 changes: 3 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -124,7 +124,7 @@ want our algorithm to tell you to train on 300 billion tokens of Wikipedia if yo
have 3 billion tokens, so we should also have the sampling distribution satisfy a
per-text constraint that prevents the weights from being so high that you will have to
duplicate data from any text domains. The following code projects our estimate to
satisfy these constraints, where tau is the vector of per-domain thresholds:
satisfy these constraints, where `tau` is the vector of per-domain thresholds:

```python
from perplexity_correlations.projection import linear
Expand Down Expand Up @@ -161,7 +161,6 @@ it at all or include all of it). We can treat these include/don't include judgem
labels for each text:

```python
# Compute the labels
labels = []
for weight in projected_estimate:
labels.append("include" if weight > 0 else "exclude")
Expand Down Expand Up @@ -201,7 +200,7 @@ token budget.
https://tristanthrush.github.io/perplexity-correlations/


## Development Guidelines
## Development guidelines

Install the dev requirements and pre-commit hooks:

Expand All @@ -210,7 +209,7 @@ pip install -r requirements-dev.txt
pre-commit install
```

### Formatting and Linting
### Formatting and linting

This project uses [Black](https://black.readthedocs.io/en/stable/) for code formatting
and [Flake8](https://flake8.pycqa.org/en/latest/) for linting. After installing the
Expand Down
8 changes: 6 additions & 2 deletions perplexity_correlations/estimation/estimation_functions.py
Original file line number Diff line number Diff line change
Expand Up @@ -242,8 +242,8 @@ def sign_sign(X, y):
function which we do not have to know.
This function uses the single-index model parameter estimator from
Thrush et al. (2024): https://arxiv.org/abs/2409.05816,
which is the U-statistic:
Thrush et al.'s (2024) (https://arxiv.org/abs/2409.05816) initial experiments,
although the preprint does not document it yet. It is the U-statistic:
sign(y_g-y_k)*sign(x_g-x_k),
Expand Down Expand Up @@ -323,6 +323,10 @@ def spearmanr(X, y):
NOTE: This estimator is robust to outliers in X and y.
NOTE: The current version of the Thrush et al. paper does not provide the proof
that this estimator matches the ranks of the optimal weights in expectation,
but we have now proved this.
Parameters
----------
X : numpy.ndarray
Expand Down

0 comments on commit 9ed239a

Please sign in to comment.