Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

better tokenization #162

Merged
merged 23 commits into from
Mar 18, 2021
Merged

better tokenization #162

merged 23 commits into from
Mar 18, 2021

Conversation

drahnr
Copy link
Owner

@drahnr drahnr commented Mar 17, 2021

What does this PR accomplish?

Replace the current naive tokenization with srx/nlprule based tokenization.

  • 🩹 Bug Fix
  • 🦚 Feature

Closes #160 .

Changes proposed by this PR:

Use nlprule for segmentation.

Notes to reviewer:

Currently this is still dysfunctional, since the Token<'t> ranges are not relative to the input text, but to the sentence.

Blocked by bminixhofer/nlprule#53

📜 Checklist

  • Works on the ./demo sub directory
  • Test coverage is excellent and passes
  • Documentation is thorough

Copy link
Collaborator

@KuabeM KuabeM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not test anything myself, but looks good to me from a quite high-level code perspective.

@drahnr drahnr force-pushed the bernhard-better-tokenization branch from 459db7d to ec7d74c Compare March 18, 2021 09:27
@drahnr drahnr force-pushed the bernhard-better-tokenization branch from 013351a to e7202b2 Compare March 18, 2021 12:46
@drahnr drahnr force-pushed the bernhard-better-tokenization branch from 3a6008e to 27bd2ba Compare March 18, 2021 15:29
@drahnr drahnr merged commit 8b67660 into master Mar 18, 2021
@drahnr drahnr deleted the bernhard-better-tokenization branch March 18, 2021 16:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Segmentation is too naive
2 participants