Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

expand scope of arxiv identifier matcher, and fix some training data annotations #858

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

bnewbold
Copy link
Contributor

The current arxiv.org identifier matcher regex requires an "arXiv:" prefix. This misses some un-ambiguous old-style identifiers in short citations, like these examples;

B.A. Dobrescu, hep-ph/9510424.
K.R. Dienes, C. Kolda and J. March-Russell, hep-ph/9610479.
S.P. Martin, hep-ph/9608224.

This PR extends the regex to allow these cases, by more conservatively matching the "section" prefix, and not allowing whitespace, to reduce the risk of matching on random garbled text. The above examples are added as citation training data.

I also noticed a handful of existing training citations which were missing type="arxiv" on the <idno> tag, so I added those.

There is some possible ambiguity about arxiv vs. arXiv in the TEI data: GROBID output has the capitalized X, while the training data is lower-case. Also, while all (or almost all) the citation training data has these annotations, none of the header <idno> annotations do. Not sure if it would be helpful to add those.

This is related to internetarchive/fatcat#84 and to #275

The simple part of this is allowing 'arxiv:' in addition to 'arXiv:'.

The more complex second part is to conservatively match "old" (pre-2008)
style identifiers which do not have a prefix. The conservative matching
is because there is less confidence that a string is actually an arxiv
identifier without the prefix. Explicit collection prefixes are included
(for those that existed pre-2008), internal whitespace is not allowed,
and the identifier must be separated from other alphabetic strings.
…fiers

These old-style arxiv identifiers have no prefix ("arxiv:"), but are
unambiguously arxiv.org identifiers.
@elonzh
Copy link
Contributor

elonzh commented Nov 13, 2021

FYI, https://github.com/mattbierbaum/arxiv-public-datasets/blob/master/arxiv_public_data/regex_arxiv.py

@kermitt2
Copy link
Owner

@bnewbold Thanks a lot for the improvement ! I think there are small issues in the new regex and I propose some changes in the review after some tests, if you could have a look?

@bnewbold
Copy link
Contributor Author

@kermitt2 thanks for reviewing.

I don't see any proposed changes or review notes here on github. Maybe I am missing something?

@kermitt2 kermitt2 self-requested a review December 16, 2021 07:07
static public final Pattern arXivPattern = Pattern
.compile("(arXiv\\s?(\\.org)?\\s?\\:\\s?\\d{4}\\s?\\.\\s?\\d{4,5}(v\\d+)?)|(arXiv\\s?(\\.org)?\\s?\\:\\s?[ a-zA-Z\\-\\.]*\\s?/\\s?\\d{7}(v\\d+)?)");
.compile("(ar[xX]iv\\s?(\\.org)?\\s?\\:\\s??\\d{4}\\s?\\.\\s?\\d{4,5}(v\\d+)?)|(ar[xX]iv\\s?(\\.org)?\\s?\\:\\s?[ a-zA-Z\\-\\.]{3,16}\\s?/\\s?\\d{7}(v\\d+)?)|([^a-zA-Z](math|hep|astro|cond|gr|nucl|quat|stat|physics|cs|nlim|q\\-bio|q\\-fin)[a-zA-Z\\-\\.]*/\\d{7}(v\\d+)?)");

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing a bit the regex:

  • there is a double ?? in the first component which is a typo I think?
  • better avoid having the captured parenthesis for the sub-term (math|hep|astro|cond|gr|nucl|quat|stat|ph...), we could add a non captured parenthesis
    • I don't understand the [^a-zA-Z] before the (math|hep|astro|cond|gr|nucl|quat|stat|physics|cs|nlim|q\\-bio|q\\-fin), it makes hep-lat/0509026 failing for example as it expects something before the arXiv identifier?

Suggested regex:

(ar[xX]iv\s?(\.org)?\s?\:\s?\d{4}\s?\.\s?\d{4,5}(v\d+)?)|(ar[xX]iv\s?(\.org)?\s?\:\s?[ a-zA-Z\-\.]{3,16}\s?/\s?\d{7}(v\d+)?)|((?:math|hep|astro|cond|gr|nucl|quat|stat|physics|cs|nlim|q\-bio|q\-fin)[a-zA-Z\-\.]*/\d{7}(v\d+)?)

In java syntax:

(ar[xX]iv\\s?(\\.org)?\\s?\\:\\s?\\d{4}\\s?\\.\\s?\\d{4,5}(v\\d+)?)|(ar[xX]iv\\s?(\\.org)?\\s?\\:\\s?[ a-zA-Z\\-\\.]{3,16}\\s?/\\s?\\d{7}(v\\d+)?)|((?:math|hep|astro|cond|gr|nucl|quat|stat|physics|cs|nlim|q\\-bio|q\\-fin)[a-zA-Z\\-\\.]*/\\d{7}(v\\d+)?)

We then have only one captured group for one arXiv identifier.

static public final Pattern arXivPattern = Pattern
.compile("(arXiv\\s?(\\.org)?\\s?\\:\\s?\\d{4}\\s?\\.\\s?\\d{4,5}(v\\d+)?)|(arXiv\\s?(\\.org)?\\s?\\:\\s?[ a-zA-Z\\-\\.]*\\s?/\\s?\\d{7}(v\\d+)?)");
.compile("(ar[xX]iv\\s?(\\.org)?\\s?\\:\\s??\\d{4}\\s?\\.\\s?\\d{4,5}(v\\d+)?)|(ar[xX]iv\\s?(\\.org)?\\s?\\:\\s?[ a-zA-Z\\-\\.]{3,16}\\s?/\\s?\\d{7}(v\\d+)?)|([^a-zA-Z](math|hep|astro|cond|gr|nucl|quat|stat|physics|cs|nlim|q\\-bio|q\\-fin)[a-zA-Z\\-\\.]*/\\d{7}(v\\d+)?)");
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to https://github.com/mattbierbaum/arxiv-public-datasets/blob/master/arxiv_public_data/regex_arxiv.py#L12 we have much more categories possible. Like for the sub-categories, we might want to simply have a free range of characters?
e.g. instead of

(math|hep|astro|cond|gr|nucl|quat|stat|physics|cs|nlim|q\\-bio|q\\-fin)[a-zA-Z\\-\\.]*

having simply:

[a-zA-Z\\-\\.]*

I am a bit worry then about the robustness of the regex for ill-formed input wrt. catastrophic backtracking as we have seen elsewhere. Maybe best alternative would be to enumerate all the possibilities?

@kermitt2
Copy link
Owner

Ok sorry I think I did not submit the review, can you see my comments now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants