Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scripts to compute statistics #22

Merged
merged 20 commits into from
Aug 18, 2023
24 changes: 24 additions & 0 deletions .github/workflows/CI.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,9 @@ on:
tags:
- '*'
pull_request:
branches:
- main
- master
workflow_dispatch:


Expand All @@ -16,6 +19,21 @@ permissions:

jobs:

info:
runs-on: ubuntu-latest
steps:
- name: Echo environment variables
run: |
echo "reference: ${{ github.ref }}"
echo "event name: ${{ github.event_name }}"
echo "run tests: ${{ github.event_name == 'pull_request' || github.event_name == 'push' }}"
echo "is main: ${{ github.ref == 'refs/heads/main' || github.ref == 'refs/heads/master' }}"
echo "is release: ${{ startsWith(github.ref, 'refs/tags/') }}"
echo "commit: ${{ github.sha }}"
echo "-------------------------------------------------------------------------"
echo "Full PR details:"
echo "${{ toJson(github.event.pull_request) }}"

tests:
runs-on: ubuntu-latest
env:
Expand Down Expand Up @@ -93,7 +111,10 @@ jobs:
source .venv/bin/activate
pip uninstall -y dolma



build-linux:
if: ${{ github.ref == 'refs/heads/main' || github.ref == 'refs/heads/master' || startsWith(github.ref, 'refs/tags/') }}
runs-on: ubuntu-latest
env:
CC: gcc-11
Expand Down Expand Up @@ -128,6 +149,7 @@ jobs:
path: dist

build-windows:
if: ${{ github.ref == 'refs/heads/main' || github.ref == 'refs/heads/master' || startsWith(github.ref, 'refs/tags/') }}
runs-on: windows-latest
strategy:
matrix:
Expand All @@ -151,6 +173,7 @@ jobs:
path: dist

build-macos:
if: ${{ github.ref == 'refs/heads/main' || github.ref == 'refs/heads/master' || startsWith(github.ref, 'refs/tags/') }}
runs-on: macos-latest
strategy:
matrix:
Expand All @@ -174,6 +197,7 @@ jobs:

sdist:
runs-on: ubuntu-latest
if: ${{ github.ref == 'refs/heads/main' || github.ref == 'refs/heads/master' || startsWith(github.ref, 'refs/tags/') }}
steps:
- uses: actions/checkout@v3
- name: Build sdist
Expand Down
21 changes: 12 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,13 @@
# Dolma
<img alt="Dolma's official logo. It's dolma written in yellow, round lowercase letters over a blue background." src="https://github.com/allenai/dolma/blob/main/res/logo.png?raw=true" width="100%">

*Data to feed OLMo's Appetite*

<img alt="Dolma's official logo. It's dolma written in yellow, round lowercase letters over a blue background." src="https://github.com/allenai/dolma/blob/main/res/logo.png?raw=true" width="100%">
Dolma is an open dataset of 3 trillion tokens from a diverse mix of web content, academic publications, code, books, and encyclopedic materials.
It was created as a training corpus for [OLMo](https://allenai.org/olmo), AI2 language model.

Data and tools for generating and inspecting OLMo pre-training data.
Dolma is available for download on the HuggingFace 🤗 Hub: [`huggingface.co/datasets/allenai/dolma`](https://huggingface.co/datasets/allenai/dolma). To access Dolma, users must agree to the terms of the terms of [AI2 ImpACT License for Medium Risk Artifacts](https://allenai.org/licenses/impact-mr).
You can also read more about Dolma in [our announcement](https://blog.allenai.org/dolma-3-trillion-tokens-open-llm-corpus-9a0ff4b8da64), as well as by consulting its [data sheet](https://drive.google.com/file/d/12gOf5I5RytsD159nSP7iim_5zN31FCXq/view?usp=drive_link).

To get started, install dolma using [pip](https://pypi.org/project/dolma/).
This repository contains tools for generating and inspecting Dolma. To get started, install the Dolma Python library from [PyPI](https://pypi.org/project/dolma/).

```shell
pip install dolma
Expand All @@ -28,7 +29,7 @@ For all commands, configurations can be specified from command line, or by passi
dolma -c config.yaml dedupe --dedupe.name "test"
```

### `dolma tag`
### The `tag` command

The tag command is used to run any of the built-in taggers on a set of documents. For example:

Expand All @@ -44,7 +45,7 @@ dolma tag \

This command will run the `random_number_v1` tagger on all documents in the specified S3 paths. The results will be written to the `s3://ai2-llm/pretraining-data/sources/common-crawl/test/v0/attributes/sample` and `s3://ai2-llm/pretraining-data/sources/common-crawl/test/v1/attributes/sample` paths.

### `dolma dedupe`
### The `dedupe` command

The dedupe command is used to deduplicate a set of documents at the attribute level using a bloom filter.
For example configurations, see directory `tests/config`. For example:
Expand All @@ -53,14 +54,15 @@ For example configurations, see directory `tests/config`. For example:
dolma dedupe -c tests/config/dedupe-paragraphs.json
```

### `dolma mix`
### The `mix` command

The mix command is used to mix documents from multiple sources, optionally filtering by attributes and/or performing string replacement. For example configurations, see directory `tests/config`. For example:

```shell
dolma mix -c tests/config/mixer.json
```


## Development

Create a conda environment with Python >= 3.8. In this case, we use Python 3.10 and use Anaconda to create the environment.
Expand Down Expand Up @@ -90,6 +92,7 @@ make test

You can choose to run just the Python or Rust tests by calling `make test-python` or `make test-rust` respectively.


## Citation

If you use this repository, please cite it as:
Expand All @@ -98,7 +101,7 @@ If you use this repository, please cite it as:
@software{dolma,
author = {{Soldaini, Luca and Lo, Kyle and Kinney, Rodney and Naik, Aakanksha and Ravichander, Abhilasha and Bhagia, Akshita and Groeneveld, Dirk and Schwenk, Dustin and Magnusson, Ian and Chandu, Khyathi}},
license = {{Apache-2.0}},
title = {{DOLMa}},
title = {{Dolma}},
url = {https://github.com/allenai/dolma}
}
```
63 changes: 63 additions & 0 deletions configs/dedup/pes2o_decontamination.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
{
"documents": [
"s3://ai2-llm/pretraining-data/sources/s2/v3-fos/documents/dataset=s2ag/split=train/part_id=0/*.gz",
"s3://ai2-llm/pretraining-data/sources/s2/v3-fos/documents/dataset=s2ag/split=train/part_id=1/*.gz",
"s3://ai2-llm/pretraining-data/sources/s2/v3-fos/documents/dataset=s2ag/split=train/part_id=2/*.gz",
"s3://ai2-llm/pretraining-data/sources/s2/v3-fos/documents/dataset=s2ag/split=train/part_id=3/*.gz",
"s3://ai2-llm/pretraining-data/sources/s2/v3-fos/documents/dataset=s2ag/split=train/part_id=4/*.gz",
"s3://ai2-llm/pretraining-data/sources/s2/v3-fos/documents/dataset=s2ag/split=train/part_id=5/*.gz",
"s3://ai2-llm/pretraining-data/sources/s2/v3-fos/documents/dataset=s2ag/split=train/part_id=6/*.gz",
"s3://ai2-llm/pretraining-data/sources/s2/v3-fos/documents/dataset=s2ag/split=train/part_id=7/*.gz",
"s3://ai2-llm/pretraining-data/sources/s2/v3-fos/documents/dataset=s2ag/split=train/part_id=8/*.gz",
"s3://ai2-llm/pretraining-data/sources/s2/v3-fos/documents/dataset=s2ag/split=train/part_id=9/*.gz",
"s3://ai2-llm/pretraining-data/sources/s2/v3-fos/documents/dataset=s2orc/split=train/part_id=0/*.gz",
"s3://ai2-llm/pretraining-data/sources/s2/v3-fos/documents/dataset=s2orc/split=train/part_id=1/*.gz",
"s3://ai2-llm/pretraining-data/sources/s2/v3-fos/documents/dataset=s2orc/split=train/part_id=2/*.gz",
"s3://ai2-llm/pretraining-data/sources/s2/v3-fos/documents/dataset=s2orc/split=train/part_id=3/*.gz",
"s3://ai2-llm/pretraining-data/sources/s2/v3-fos/documents/dataset=s2orc/split=train/part_id=4/*.gz",
"s3://ai2-llm/pretraining-data/sources/s2/v3-fos/documents/dataset=s2orc/split=train/part_id=5/*.gz",
"s3://ai2-llm/pretraining-data/sources/s2/v3-fos/documents/dataset=s2orc/split=train/part_id=6/*.gz",
"s3://ai2-llm/pretraining-data/sources/s2/v3-fos/documents/dataset=s2orc/split=train/part_id=7/*.gz",
"s3://ai2-llm/pretraining-data/sources/s2/v3-fos/documents/dataset=s2orc/split=train/part_id=8/*.gz",
"s3://ai2-llm/pretraining-data/sources/s2/v3-fos/documents/dataset=s2orc/split=train/part_id=9/*.gz",
"s3://ai2-llm/pretraining-data/sources/s2/v3-fos/documents/dataset=s2ag/split=valid/part_id=0/*.gz",
"s3://ai2-llm/pretraining-data/sources/s2/v3-fos/documents/dataset=s2ag/split=valid/part_id=1/*.gz",
"s3://ai2-llm/pretraining-data/sources/s2/v3-fos/documents/dataset=s2ag/split=valid/part_id=2/*.gz",
"s3://ai2-llm/pretraining-data/sources/s2/v3-fos/documents/dataset=s2ag/split=valid/part_id=3/*.gz",
"s3://ai2-llm/pretraining-data/sources/s2/v3-fos/documents/dataset=s2ag/split=valid/part_id=4/*.gz",
"s3://ai2-llm/pretraining-data/sources/s2/v3-fos/documents/dataset=s2ag/split=valid/part_id=5/*.gz",
"s3://ai2-llm/pretraining-data/sources/s2/v3-fos/documents/dataset=s2ag/split=valid/part_id=6/*.gz",
"s3://ai2-llm/pretraining-data/sources/s2/v3-fos/documents/dataset=s2ag/split=valid/part_id=7/*.gz",
"s3://ai2-llm/pretraining-data/sources/s2/v3-fos/documents/dataset=s2ag/split=valid/part_id=8/*.gz",
"s3://ai2-llm/pretraining-data/sources/s2/v3-fos/documents/dataset=s2ag/split=valid/part_id=9/*.gz",
"s3://ai2-llm/pretraining-data/sources/s2/v3-fos/documents/dataset=s2orc/split=valid/part_id=0/*.gz",
"s3://ai2-llm/pretraining-data/sources/s2/v3-fos/documents/dataset=s2orc/split=valid/part_id=1/*.gz",
"s3://ai2-llm/pretraining-data/sources/s2/v3-fos/documents/dataset=s2orc/split=valid/part_id=2/*.gz",
"s3://ai2-llm/pretraining-data/sources/s2/v3-fos/documents/dataset=s2orc/split=valid/part_id=3/*.gz",
"s3://ai2-llm/pretraining-data/sources/s2/v3-fos/documents/dataset=s2orc/split=valid/part_id=4/*.gz",
"s3://ai2-llm/pretraining-data/sources/s2/v3-fos/documents/dataset=s2orc/split=valid/part_id=5/*.gz",
"s3://ai2-llm/pretraining-data/sources/s2/v3-fos/documents/dataset=s2orc/split=valid/part_id=6/*.gz",
"s3://ai2-llm/pretraining-data/sources/s2/v3-fos/documents/dataset=s2orc/split=valid/part_id=7/*.gz",
"s3://ai2-llm/pretraining-data/sources/s2/v3-fos/documents/dataset=s2orc/split=valid/part_id=8/*.gz",
"s3://ai2-llm/pretraining-data/sources/s2/v3-fos/documents/dataset=s2orc/split=valid/part_id=9/*.gz"
],
"work_dir": {
"input": "/tmp/s2/v3-fos/deduper/input",
"output": "/tmp/s2/v3-fos/deduper/output"
},
"dedupe": {
"name": "decontamination",
"paragraphs": {
"attribute_name": "bff_duplicate_paragraph_spans"
},
"skip_empty": true
},
"bloom_filter": {
"file": "/tmp/decontamination/deduper_decontamination_lucas_20230525.bin",
"size_in_bytes": 8388608,
"read_only": true,
"estimated_doc_count": 3898706,
"desired_false_positive_rate": 0.001
},
"processes": 120
}
3 changes: 1 addition & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -31,11 +31,10 @@ dependencies = [
"numpy"
]
classifiers = [
"Development Status :: 3 - Alpha",
"Development Status :: 4 - Beta",
"Typing :: Typed",
"Programming Language :: Rust",
"Programming Language :: Python :: Implementation :: CPython",
"Programming Language :: Python :: Implementation :: PyPy",
]

[[project.authors]]
Expand Down
3 changes: 3 additions & 0 deletions python/dolma/cli/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,9 @@
from .analyzer import AnalyzerCli
from .deduper import DeduperCli
from .mixer import MixerCli

# must import these to register the resolvers
from .resolvers import * # noqa: F401,F403
from .tagger import ListTaggerCli, TaggerCli

AVAILABLE_COMMANDS = {
Expand Down
36 changes: 36 additions & 0 deletions python/dolma/cli/resolvers.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
import multiprocessing
from typing import List, TypeVar

from cached_path import cached_path
from omegaconf.omegaconf import OmegaConf as om
from omegaconf.omegaconf import Resolver

from ..core.paths import glob_path

__all__ = ["cache", "glob", "processes"]


C = TypeVar("C", bound=Resolver)


def resolver(resolver: C) -> C:
resolver_name = f"d.{resolver.__name__}"
om.register_new_resolver(resolver_name, resolver, replace=True)
return resolver


@resolver
def cache(path: str) -> str:
return str(cached_path(path))


@resolver
def glob(path: str) -> List[str]:
globbed = list(glob_path(path))
assert len(globbed) > 0, f"Path {path} does not match any files"
return globbed


@resolver
def processes(n: int = 0) -> int:
return max(1, multiprocessing.cpu_count() - n)
32 changes: 17 additions & 15 deletions python/dolma/core/analyzer.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@
import re
import shutil
from contextlib import ExitStack
from queue import Queue
from tempfile import TemporaryDirectory
from typing import Dict, List, Optional

Expand All @@ -13,18 +12,24 @@
from rich.console import Console
from rich.table import Table

from .binning import BucketsValTracker
from .binning import BaseBucketApi, FixedBucketsValTracker, InferBucketsValTracker
from .data_types import OutputSpec
from .errors import DolmaError
from .parallel import BaseParallelProcessor
from .parallel import BaseParallelProcessor, QueueType
from .paths import glob_path, mkdir_p

NUM_BINS = 100_000
BUFF_SIZE = 1_000


def _make_tracker() -> BucketsValTracker:
return BucketsValTracker(NUM_BINS, BUFF_SIZE)
def _make_tracker(type_: str = "fixed", **kwargs: int) -> BaseBucketApi:
"""Make a tracker of given type. Choose between `infer` or `fixed`"""
if type_ == "infer":
return InferBucketsValTracker(**{"n": NUM_BINS, "b": BUFF_SIZE, **kwargs})
elif type_ == "fixed":
return FixedBucketsValTracker(**{"n": NUM_BINS, **kwargs})
else:
raise ValueError(f"Unknown tracker type {type_}")


class SummarySpec(msgspec.Struct):
Expand All @@ -33,24 +38,21 @@ class SummarySpec(msgspec.Struct):
bins: List[float]

@classmethod
def from_tracker(self, name: str, tracker: "BucketsValTracker", n: int) -> "SummarySpec":
def from_tracker(cls, name: str, tracker: "BaseBucketApi", n: int) -> "SummarySpec":
counts, bins = tracker.summarize(n=n)
return SummarySpec(name=name, counts=counts, bins=bins)

def to_tracker(self) -> "BucketsValTracker":
def to_tracker(self) -> "BaseBucketApi":
tracker = _make_tracker()
try:
tracker.add_many(values=self.bins, counts=self.counts)
except ValueError:
breakpoint()
tracker.add_many(values=self.bins, counts=self.counts)
return tracker


class AnalyzerProcessor(BaseParallelProcessor):
@classmethod
def increment_progressbar( # type: ignore
cls,
queue, # queue must be the first argument, and it should be a positional-only argument
queue: QueueType, # queue must be the first argument, and it should be a positional-only argument
/,
files: int = 0,
documents: int = 0,
Expand All @@ -66,7 +68,7 @@ def process_single(
cls,
source_path: str,
destination_path: str,
queue: "Queue",
queue: QueueType,
**kwargs,
):
# instantiate a decoder for faster decoding
Expand All @@ -79,7 +81,7 @@ def process_single(
name_regex = re.compile(r) if (r := kwargs.get("name_regex", None)) else None

# keep track of the length and score of each attribute
trackers: Dict[str, BucketsValTracker] = {}
trackers: Dict[str, BaseBucketApi] = {}

# interval at which to update the progress bar; will double if queue is too full
update_interval = 1
Expand Down Expand Up @@ -142,7 +144,7 @@ def process_single(

def aggregate_summaries(summaries_path: str, num_bins: int = 1000) -> List[SummarySpec]:
# keep track of the length and score of each attribute
trackers: Dict[str, BucketsValTracker] = {}
trackers: Dict[str, BaseBucketApi] = {}

# instantiate a decoder for faster decoding
decoder = Decoder(SummarySpec)
Expand Down
Loading