Build `Index` from regex #125

torymur · 2024-12-12T20:08:16Z

Closes Build index from regex #97
Closes Change vocabulary interface #92
Supersedes PR Port necessary interegular functionality #38 and closes Port necessary interegular functionality #8
Makes irrelevant and closes Create Rust-only index construction function #9
Makes irrelevant and closes Handle regex-to-FSM conversion in Rust #10
Makes irrelevant and closes Port walk_fsm tests to Rust #68
Makes irrelevant and closes Port create_fsm_index_end_to_end tests to Rust #69
Makes irrelevant and closes Move test_guide tests to test_regex #71

Rust

Python

Drop all intermediary structures related to interregular workflow
Fix existing tests to use different interface of Guide, Index & Vocabulary
Adjust benchmarks to use new interface

rlouf · 2024-12-13T12:48:40Z

Supersedes #38?

torymur · 2024-12-13T19:53:40Z

Yes, it will

torymur · 2025-01-09T18:40:47Z

tests/fsm/test_statistical.py

            mask: List[int] = [1 if s in allowed else 0 for s in range(1, n_tokens + 1)]
            tokens = model(tokens, mask=mask)
-            state = fsm.get_next_state(state, tokens[-1])
+            allowed = guide.read_next_token(tokens[-1])


@dpsimpson Could you, please, take a look at this failing test: https://github.com/dottxt-ai/outlines-core/actions/runs/12695987037/job/35389081344

Interface has changed and I updated it here somewhat accordingly, but it needs to be checked, for example I added third value to Vocabulary (instead of eos token before that) just for the sake of keeping the dimensions right, I suspect that could be incorrect 😅

But overall, since its statistical, it goes over my head to fully understand the intentions and adjust properly expected testing values or construction logic, so I would really appreciate your help here 🙏

torymur added the enhancement New feature or request label Dec 12, 2024

torymur force-pushed the index-from-regex-97 branch 2 times, most recently from fa36be9 to 5b8808d Compare December 18, 2024 15:50

torymur force-pushed the index-from-regex-97 branch 3 times, most recently from 88d609b to 52edc6d Compare January 7, 2025 17:54

torymur added 21 commits January 9, 2025 18:07

Build Index from regex

606b460

Test Index from regex in Guide

bdc120d

Use FxHash* as default Hash*

6c5b853

Cleaner from_regex logic

f349404

Use bytes as Token type, more tests for Index

15a85aa

Drop majority of intermediate structures

f02faec

Add PyGuide, use proper types for Index

70b4bc6

Provide basic Guide binding, test it

b477598

Improve Vocabulary python binding, add tests

f3266ee

Non-optional eos_token_id

7edb831

Stabilize vocabulary interface

03e5561

Add tests for Guide

64f0d73

Python vocabulary to accept pretrained params

2ab0007

Correct interface in pyi, reprs for all python bindings

063d1c2

Adjust benchmarks

f65d86f

Drop unused dependencies

30e29ef

Index by ref in Guide

e04e5be

Extend interface of python bindings

7b6781b

Disallow insert of eos token into Vocabulary

1fab872

Stabilize Index interfaces

15a45c0

Use new interface in statistical

bf6e8a6

torymur force-pushed the index-from-regex-97 branch from 1a8831a to bf6e8a6 Compare January 9, 2025 18:12

torymur commented Jan 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build `Index` from regex #125

Build `Index` from regex #125

torymur commented Dec 12, 2024 •

edited

Loading

rlouf commented Dec 13, 2024

torymur commented Dec 13, 2024

torymur Jan 9, 2025

Build Index from regex #125

Are you sure you want to change the base?

Build Index from regex #125

Conversation

torymur commented Dec 12, 2024 • edited Loading

Rust

Python

rlouf commented Dec 13, 2024

torymur commented Dec 13, 2024

torymur Jan 9, 2025

Choose a reason for hiding this comment

Build `Index` from regex #125

Build `Index` from regex #125

torymur commented Dec 12, 2024 •

edited

Loading