Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to turn token integers back into text #7

Closed
simonw opened this issue Jun 9, 2023 · 6 comments
Closed

Option to turn token integers back into text #7

simonw opened this issue Jun 9, 2023 · 6 comments
Labels
enhancement New feature or request

Comments

@simonw
Copy link
Owner

simonw commented Jun 9, 2023

The opposite of this:

echo "Show these tokens" | ttok --tokens
# Outputs: 7968 1521 11460 198
@simonw simonw added the enhancement New feature or request label Jun 9, 2023
@simonw
Copy link
Owner Author

simonw commented Jun 9, 2023

Potential names for this:

  • ttok --text (since it is the opposite of --tokens)
  • ttok --to-text
  • ttok --decode

I think I like --decode the most. It maps to the underlying .decode() method.

I could add --encode as an alias for --tokens for added consistency.

@simonw
Copy link
Owner Author

simonw commented Jun 9, 2023

This should support space separated, comma separated and JSON arrays or integers.

I think I'll just use a \d+ regular expressions to parse integers out of the input.

@simonw
Copy link
Owner Author

simonw commented Jun 9, 2023

Got GPT-4 to write this: https://chat.openai.com/share/a3c5da38-bfd0-423d-af7e-dbed7bfe5278

@click.option("--decode", "decode", is_flag=True, help="Decode token integers to text")

# ...

    if decode:
        # Use regex to find all integers in the input text
        tokens = [int(t) for t in re.findall(r'\d+', text)]
        decoded_text = encoding.decode(tokens)
        click.echo(decoded_text)

@simonw
Copy link
Owner Author

simonw commented Jul 10, 2023

I needed this to help test:

@simonw
Copy link
Owner Author

simonw commented Jul 10, 2023

Oops, did that work on the wrong branch.

@simonw simonw closed this as completed in ccebd84 Jul 10, 2023
@simonw
Copy link
Owner Author

simonw commented Jul 10, 2023

$ ttok --tokens show me the tokens
3528 757 279 11460
$ ttok --encode show me the tokens
3528 757 279 11460
$ ttok --decode 3528 757 279 11460
show me the tokens

simonw added a commit that referenced this issue Jul 10, 2023
simonw added a commit that referenced this issue Jul 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant