Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/954 llama cpp #1000

Merged
merged 34 commits into from
Jan 9, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
c9ed5fd
Support embeddings generation using llama_cpp
bikash119 Sep 24, 2024
c3464bc
Added llama-cpp-python as optional dependency
bikash119 Sep 24, 2024
582ca40
- Added normalize_embeddings argument to allow user to pass if the em…
bikash119 Sep 25, 2024
fba8ada
Update pyproject.toml
bikash119 Sep 26, 2024
e288b31
- Updated test to allow developer to define test model location.
bikash119 Sep 26, 2024
d6d4352
Merge remote-tracking branch 'upstream/develop' into feat/954_llama-cpp
bikash119 Sep 26, 2024
a936a39
- Made the test session scope
bikash119 Sep 26, 2024
316afa0
- Reverted the changes made to model_path
bikash119 Sep 26, 2024
7137883
- Implement test_encode_batch to verify various batch sizes
bikash119 Sep 26, 2024
2d0aa76
- Included LlamaCppEmbeddings to __ini__.py
bikash119 Sep 26, 2024
778532f
- Use HF_TOKEN to download model from hub to generate embeddings.
bikash119 Sep 30, 2024
55c3a0d
- Download from hub is now available through mixin
bikash119 Oct 2, 2024
935cdb8
Revert "- Download from hub is now available through mixin"
bikash119 Oct 3, 2024
29a8d56
Revert "- Use HF_TOKEN to download model from hub to generate embeddi…
bikash119 Oct 3, 2024
b40b0d2
- Removed mixin implemenation to download the model
bikash119 Oct 3, 2024
b08f3ae
- Additional example added for private / public model
bikash119 Oct 4, 2024
a49363c
- The tests can now be configured to use cpu or gpu based on paramete…
bikash119 Oct 4, 2024
575f48e
- repo_id or model_path : one of the parameters is mandatory
bikash119 Oct 4, 2024
48dce7b
Added description to attribute : model
bikash119 Oct 4, 2024
0e1fb8e
- Fixed examples
bikash119 Oct 4, 2024
f72ef30
Updated examples
bikash119 Oct 4, 2024
8218242
Update src/distilabel/embeddings/llamacpp.py
bikash119 Oct 14, 2024
db00482
Update src/distilabel/embeddings/llamacpp.py
bikash119 Oct 14, 2024
0fb7f15
Update src/distilabel/embeddings/llamacpp.py
bikash119 Oct 14, 2024
155feb2
Updated test to set disable_cuda_device_placement=True when testing f…
bikash119 Oct 14, 2024
b218b44
Merge branch 'develop' into feat/954_llama-cpp
bikash119 Oct 14, 2024
58aa996
Merge branch 'develop' into feat/954_llama-cpp
bikash119 Oct 16, 2024
3659400
testcase will by default load the model to cpu
bikash119 Oct 16, 2024
92481b0
Merge branch 'feat/954_llama-cpp' of github.com:bikash119/distilabel …
bikash119 Oct 16, 2024
ef98d63
Merge branch 'develop' into feat/954_llama-cpp
bikash119 Oct 19, 2024
2258190
Updated import statements to allign with new folder structure
bikash119 Oct 26, 2024
da92cc9
example code updated
bikash119 Oct 26, 2024
09dd551
examples fixed
bikash119 Oct 26, 2024
b9c5305
Merge branch 'develop' into feat/954_llama-cpp
bikash119 Dec 2, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -77,4 +77,3 @@ venv.bak/
# Other
*.log
*.swp
.DS_Store
2 changes: 2 additions & 0 deletions src/distilabel/models/embeddings/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
# limitations under the License.

from distilabel.models.embeddings.base import Embeddings
from distilabel.models.embeddings.llamacpp import LlamaCppEmbeddings
from distilabel.models.embeddings.sentence_transformers import (
SentenceTransformerEmbeddings,
)
Expand All @@ -22,4 +23,5 @@
"Embeddings",
"SentenceTransformerEmbeddings",
"vLLMEmbeddings",
"LlamaCppEmbeddings",
]
237 changes: 237 additions & 0 deletions src/distilabel/models/embeddings/llamacpp.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,237 @@
# Copyright 2023-present, Argilla, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from pathlib import Path
from typing import TYPE_CHECKING, Any, Dict, List, Optional, Union

from pydantic import Field, PrivateAttr

from distilabel.mixins.runtime_parameters import RuntimeParameter
from distilabel.models.embeddings.base import Embeddings
from distilabel.models.mixins.cuda_device_placement import CudaDevicePlacementMixin

if TYPE_CHECKING:
from llama_cpp import Llama


class LlamaCppEmbeddings(Embeddings, CudaDevicePlacementMixin):
"""`LlamaCpp` library implementation for embedding generation.

Attributes:
model_name: contains the name of the GGUF quantized model, compatible with the
installed version of the `llama.cpp` Python bindings.
model_path: contains the path to the GGUF quantized model, compatible with the
installed version of the `llama.cpp` Python bindings.
repo_id: the Hugging Face Hub repository id.
verbose: whether to print verbose output. Defaults to `False`.
n_gpu_layers: number of layers to run on the GPU. Defaults to `-1` (use the GPU if available).
disable_cuda_device_placement: whether to disable CUDA device placement. Defaults to `True`.
normalize_embeddings: whether to normalize the embeddings. Defaults to `False`.
seed: RNG seed, -1 for random
n_ctx: Text context, 0 = from model
n_batch: Prompt processing maximum batch size
extra_kwargs: additional dictionary of keyword arguments that will be passed to the
`Llama` class of `llama_cpp` library. Defaults to `{}`.

Runtime parameters:
- `n_gpu_layers`: the number of layers to use for the GPU. Defaults to `-1`.
- `verbose`: whether to print verbose output. Defaults to `False`.
- `normalize_embeddings`: whether to normalize the embeddings. Defaults to `False`.
- `extra_kwargs`: additional dictionary of keyword arguments that will be passed to the
`Llama` class of `llama_cpp` library. Defaults to `{}`.

References:
- [Offline inference embeddings](https://llama-cpp-python.readthedocs.io/en/stable/#embeddings)

Examples:
Generate sentence embeddings using a local model:

```python
from pathlib import Path
from distilabel.models.embeddings import LlamaCppEmbeddings

# You can follow along this example downloading the following model running the following
# command in the terminal, that will download the model to the `Downloads` folder:
# curl -L -o ~/Downloads/all-MiniLM-L6-v2-Q2_K.gguf https://huggingface.co/second-state/All-MiniLM-L6-v2-Embedding-GGUF/resolve/main/all-MiniLM-L6-v2-Q2_K.gguf

model_path = "Downloads/"
model = "all-MiniLM-L6-v2-Q2_K.gguf"
embeddings = LlamaCppEmbeddings(
model=model,
model_path=str(Path.home() / model_path),
)

embeddings.load()

results = embeddings.encode(inputs=["distilabel is awesome!", "and Argilla!"])
print(results)
embeddings.unload()
```

Generate sentence embeddings using a HuggingFace Hub model:

```python
from distilabel.models.embeddings import LlamaCppEmbeddings
# You need to set environment variable to download private model to the local machine

repo_id = "second-state/All-MiniLM-L6-v2-Embedding-GGUF"
model = "all-MiniLM-L6-v2-Q2_K.gguf"
embeddings = LlamaCppEmbeddings(model=model,repo_id=repo_id)

embeddings.load()

results = embeddings.encode(inputs=["distilabel is awesome!", "and Argilla!"])
print(results)
embeddings.unload()
# [
# [-0.05447685346007347, -0.01623094454407692, ...],
# [4.4889533455716446e-05, 0.044016145169734955, ...],
# ]
```

Generate sentence embeddings with cpu:

```python
from pathlib import Path
from distilabel.models.embeddings import LlamaCppEmbeddings

# You can follow along this example downloading the following model running the following
# command in the terminal, that will download the model to the `Downloads` folder:
# curl -L -o ~/Downloads/all-MiniLM-L6-v2-Q2_K.gguf https://huggingface.co/second-state/All-MiniLM-L6-v2-Embedding-GGUF/resolve/main/all-MiniLM-L6-v2-Q2_K.gguf

model_path = "Downloads/"
model = "all-MiniLM-L6-v2-Q2_K.gguf"
embeddings = LlamaCppEmbeddings(
model=model,
model_path=str(Path.home() / model_path),
n_gpu_layers=0,
disable_cuda_device_placement=True,
)

embeddings.load()

results = embeddings.encode(inputs=["distilabel is awesome!", "and Argilla!"])
print(results)
embeddings.unload()
# [
# [-0.05447685346007347, -0.01623094454407692, ...],
# [4.4889533455716446e-05, 0.044016145169734955, ...],
# ]
```


"""

model: str = Field(
description="The name of the model to use for embeddings.",
)

model_path: RuntimeParameter[str] = Field(
default=None,
description="The path to the GGUF quantized model, compatible with the installed version of the `llama.cpp` Python bindings.",
)

repo_id: RuntimeParameter[str] = Field(
default=None, description="The Hugging Face Hub repository id.", exclude=True
)

n_gpu_layers: RuntimeParameter[int] = Field(
default=-1,
description="The number of layers that will be loaded in the GPU.",
)

n_ctx: int = 512
n_batch: int = 512
seed: int = 4294967295

normalize_embeddings: RuntimeParameter[bool] = Field(
default=False,
description="Whether to normalize the embeddings.",
)
verbose: RuntimeParameter[bool] = Field(
default=False,
description="Whether to print verbose output from llama.cpp library.",
)
extra_kwargs: Optional[RuntimeParameter[Dict[str, Any]]] = Field(
default_factory=dict,
description="Additional dictionary of keyword arguments that will be passed to the"
" `Llama` class of `llama_cpp` library. See all the supported arguments at: "
"https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.__init__",
)
_model: Optional["Llama"] = PrivateAttr(...)

def load(self) -> None:
"""Loads the `gguf` model using either the path or the Hugging Face Hub repository id."""
super().load()
CudaDevicePlacementMixin.load(self)

try:
from llama_cpp import Llama
except ImportError as ie:
raise ImportError(
"`llama-cpp-python` package is not installed. Please install it using"
" `pip install llama-cpp-python`."
) from ie

if self.repo_id is not None:
# use repo_id to download the model
from huggingface_hub.utils import validate_repo_id

validate_repo_id(self.repo_id)
self._model = Llama.from_pretrained(
repo_id=self.repo_id,
filename=self.model,
n_gpu_layers=self.n_gpu_layers,
seed=self.seed,
n_ctx=self.n_ctx,
n_batch=self.n_batch,
verbose=self.verbose,
embedding=True,
kwargs=self.extra_kwargs,
)
elif self.model_path is not None:
self._model = Llama(
model_path=str(Path(self.model_path) / self.model),
n_gpu_layers=self.n_gpu_layers,
seed=self.seed,
n_ctx=self.n_ctx,
n_batch=self.n_batch,
verbose=self.verbose,
embedding=True,
kwargs=self.extra_kwargs,
)
else:
raise ValueError("Either 'model_path' or 'repo_id' must be provided")

def unload(self) -> None:
"""Unloads the `gguf` model."""
CudaDevicePlacementMixin.unload(self)
self._model.close()
super().unload()

@property
def model_name(self) -> str:
"""Returns the name of the model."""
return self.model

def encode(self, inputs: List[str]) -> List[List[Union[int, float]]]:
"""Generates embeddings for the provided inputs.

Args:
inputs: a list of texts for which an embedding has to be generated.

Returns:
The generated embeddings.
"""
return self._model.embed(inputs, normalize=self.normalize_embeddings)
35 changes: 35 additions & 0 deletions tests/unit/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,10 @@
# See the License for the specific language governing permissions and
# limitations under the License.

import atexit
import os
from typing import TYPE_CHECKING, Any, Dict, List, Union
from urllib.request import urlretrieve

import pytest
from pydantic import PrivateAttr
Expand Down Expand Up @@ -126,3 +129,35 @@ class DummyTaskOfflineBatchGeneration(DummyTask):
@pytest.fixture
def dummy_llm() -> AsyncLLM:
return DummyAsyncLLM()


@pytest.fixture(scope="session")
def local_llamacpp_model_path(tmp_path_factory):
"""
Session-scoped fixture that provides the local model path for LlamaCpp testing.

Download a small test model to a temporary directory.
The model is downloaded once per test session and cleaned up after all tests.

Args:
tmp_path_factory: Pytest fixture providing a temporary directory factory.

Returns:
str: The path to the local LlamaCpp model file.
"""
model_name = "all-MiniLM-L6-v2-Q2_K.gguf"
model_url = f"https://huggingface.co/second-state/All-MiniLM-L6-v2-Embedding-GGUF/resolve/main/{model_name}"
tmp_path = tmp_path_factory.getbasetemp()
model_path = tmp_path / model_name

if not model_path.exists():
urlretrieve(model_url, model_path)

def cleanup():
if model_path.exists():
os.remove(model_path)

# Register the cleanup function to be called at exit
atexit.register(cleanup)

return str(tmp_path)
Loading