Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New TruncateTextColumn to truncate the length of texts using the number of tokens or characters #902

Merged
merged 10 commits into from
Aug 14, 2024

Conversation

plaguss
Copy link
Contributor

@plaguss plaguss commented Aug 14, 2024

Description

This PR adds a new TruncateTextColumn step to truncate the text.

While developing pipelines that start from datasets with longer texts, the model can fail for longer texts than the max context length the model can handle. This step can be useful to avoid such errors:

  • Truncating a row to a given number of tokens:
from distilabel.steps import TruncateTextColumn

trunc = TruncateTextColumn(
    tokenizer="meta-llama/Meta-Llama-3.1-70B-Instruct",
    max_length=4,
    column="text"
)

trunc.load()

result = next(
    trunc.process(
        [
            {"text": "This is a sample text that is longer than 10 characters"}
        ]
    )
)
# result
# [{'text': 'This is a sample'}]
  • Truncating a row to a given number of characters
from distilabel.steps import TruncateTextColumn

trunc = TruncateTextColumn(max_length=10)

trunc.load()

result = next(
    trunc.process(
        [
            {"text": "This is a sample text that is longer than 10 characters"}
        ]
    )
)
# result
# [{'text': 'This is a '}]

@plaguss plaguss added the enhancement New feature or request label Aug 14, 2024
@plaguss plaguss added this to the 1.4.0 milestone Aug 14, 2024
@plaguss plaguss requested a review from gabrielmbmb August 14, 2024 08:45
@plaguss plaguss self-assigned this Aug 14, 2024
Copy link

Documentation for this PR has been built. You can view it at: https://distilabel.argilla.io/pr-902/

Copy link

codspeed-hq bot commented Aug 14, 2024

CodSpeed Performance Report

Merging #902 will not alter performance

Comparing step-truncate (8dd2790) with develop (c8df5a9)

Summary

✅ 1 untouched benchmarks

src/distilabel/steps/__init__.py Outdated Show resolved Hide resolved
src/distilabel/steps/truncate.py Outdated Show resolved Hide resolved
src/distilabel/steps/truncate.py Outdated Show resolved Hide resolved
@gabrielmbmb
Copy link
Member

Maybe TruncateTextColumn is a better name for this step?

@plaguss plaguss changed the title New TruncateRow to truncate the length of texts using the number of tokens or characters New TruncateTextColumn to truncate the length of texts using the number of tokens or characters Aug 14, 2024
@plaguss plaguss merged commit 4740063 into develop Aug 14, 2024
7 checks passed
@plaguss plaguss deleted the step-truncate branch August 14, 2024 14:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants