better tokenization #162

drahnr · 2021-03-17T16:10:08Z

What does this PR accomplish?

Replace the current naive tokenization with srx/nlprule based tokenization.

🩹 Bug Fix
🦚 Feature

Closes #160 .

Changes proposed by this PR:

Use nlprule for segmentation.

Notes to reviewer:

Currently this is still dysfunctional, since the Token<'t> ranges are not relative to the input text, but to the sentence.

Blocked by bminixhofer/nlprule#53

📜 Checklist

Works on the ./demo sub directory
Test coverage is excellent and passes
Documentation is thorough

KuabeM

I did not test anything myself, but looks good to me from a quite high-level code perspective.

Closes #163

drahnr mentioned this pull request Mar 17, 2021

Token as returned by pipe() is relative to the sentence boundaries bminixhofer/nlprule#53

Closed

drahnr requested a review from KuabeM March 17, 2021 17:45

drahnr force-pushed the bernhard-better-tokenization branch from 4671e6d to 459db7d Compare March 17, 2021 17:50

drahnr self-assigned this Mar 17, 2021

KuabeM approved these changes Mar 17, 2021

View reviewed changes

drahnr added 13 commits March 18, 2021 10:23

fix/literal: cover /*, /**, and /*!

1592f0e

fix/blockcomment: additional tests and logic for block comments

6a949e3

chore: cargo fmt

e56aa25

ci: make timeouts more generous to not fail builds on initial runs

e766459

silly attempt to improve tokenization

eeab31e

chore: bump deps bitflags + ra

1dba163

add extra tokenization split chars option

c2d29c9

drop deprecated languagetool backend

44999a2

remove unused srx crate

4bfd600

always tokenize based on nlprule backend

e47404f

new tokenizer also keeps marks

63639da

chore: cargo update

ee62aa3

chore: cargo fmt

ec7d74c

drahnr force-pushed the bernhard-better-tokenization branch from 459db7d to ec7d74c Compare March 18, 2021 09:27

drahnr added 3 commits March 18, 2021 11:30

fix: rebase fallout

49d1d0b

feat: add --jobs option and improve a few bug on messages

80e4d91

fix: adjust byte offset for new tokenizer

e7202b2

drahnr force-pushed the bernhard-better-tokenization branch from 013351a to e7202b2 Compare March 18, 2021 12:46

drahnr added 6 commits March 18, 2021 14:00

refactor: log messages about thread count / jobs

4378bcd

refactor/args: add logs

1d0d102

chore: cargo format use fs_err where possible

7a3117d

fix/tokenization: ignore known sentence ctrl characters

03a8f85

better logging for traversal

bb4d7f6

fix/traverse: make sure all modules are found

27bd2ba

Closes #163

drahnr force-pushed the bernhard-better-tokenization branch from 3a6008e to 27bd2ba Compare March 18, 2021 15:29

chore: cargo fmt

7bda5a4

drahnr merged commit 8b67660 into master Mar 18, 2021

drahnr deleted the bernhard-better-tokenization branch March 18, 2021 16:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

better tokenization #162

better tokenization #162

drahnr commented Mar 17, 2021 •

edited

Loading

KuabeM left a comment •

edited

Loading

better tokenization #162

better tokenization #162

Conversation

drahnr commented Mar 17, 2021 • edited Loading

What does this PR accomplish?

Changes proposed by this PR:

Notes to reviewer:

📜 Checklist

KuabeM left a comment • edited Loading

Choose a reason for hiding this comment

drahnr commented Mar 17, 2021 •

edited

Loading

KuabeM left a comment •

edited

Loading