-
-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parsing error for citations with defendant 'Thompson' #174
Comments
Any idea how easy this is to solve so that it identifies each? |
Per discussion today, seems to be happening when citations appear to overlap. The simple solution here is to find both citations that overlap and then filter out the one that's incomplete. |
The problem here is that we have Thompson as a nominative reporter (optional volume), that's why it is detected as a citation. We could detect the overlap and see which one is incomplete but according to the reporters-db Thompson regex, This can be replicated with other nominative reporters like:
|
Seems like some post processing could detect and remedy this. The Thompson one may be technically complete, but it's the same data. So maybe: If there's overlap and the data overlaps too, then.... |
@mlissner I like your solution - if there is an overlapping span - we can check if one is a complete citation and if one is nominative and choose the complete citation. |
In issue #3924, we identified a bug in Eyecite's parsing method when the defendant's last name is 'Thompson'.
For example, for the citation
'Shapiro v. Thompson, 394 U. S. 618'
:volume: 394, reporter: 'U.S.', page: '618'
volume: None, reporter: 'Thompson', page: '394'
Other examples of inputs that are incorrectly parsed are:
Adams v. Thompson, 560 F. Supp. 894
andMozena v. Thompson, 44 A.2d 276
.I've been using the first example to debug this issue, and noticed that Eyecite identifies two tokens within the input string: "Thompson's Unreported Cases (TN)" and "United States Supreme Court Reports.". The problem arises because these tokens overlap (both include "394") and Eyecite's tokenize method prioritizes the rightmost token when encountering overlaps, leading to this results.
The text was updated successfully, but these errors were encountered: