-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
💫 Train parser and NER with regression objective, to make scores express expected parse quality #881
Comments
Was it implemented and how we can use it? @honnibal |
I never shipped the linear mode regression loss because I couldn't get the memory use under control -- the loss produced very non-sparse solutions, and it was taking too many experiments to find the right regularisation. Instead I focussed on the experiments for spaCy 2. Current versions of spaCy 2 support beam search decoding, which lets you get probabilities by asking how many beam parses the entity occurred in. We don't currently have a model trained with the beam objective online yet, so the probabilities aren't so well calibrated. You'll have to try and see. Here's a current example. # Number of alternate analyses to consider. More is slower, and not necessarily better -- you need to experiment on your problem.
beam_width = 16
# This clips solutions at each step. We multiply the score of the top-ranked action by this value, and use the result as a threshold. This prevents the parser from exploring options that look very unlikely, saving a bit of efficiency. Accuracy may also improve, because we've trained on greedy objective.
beam_density = 0.0001
nlp = spacy.load('en_core_web_sm')
with nlp.disable_pipes('ner'):
docs = list(nlp.pipe(texts))
beams = nlp.entity.beam_parse(docs, beam_width=beam_width, beam_density=beam_density)
for doc, beam in zip(docs, beams):
entity_scores = defaultdict(float)
for score, ents in nlp.entity.moves.get_beam_parses(beam):
for start, end, label in ents:
entity_scores[(start, end, label)] += score |
Btw, on the off-chance anyone reading this could tell me what's wrong with the regression loss here: spaCy/spacy/syntax/nn_parser.pyx Line 216 in cdb2d83
I could rerun the regression-loss experiments using the neural network, where the sparsity problem wouldn't be an issue. What we want is an output vector of The function should be computing the gradient of the loss of this regression problem. Only some actions are valid. The gradient for an invalid action should always be 0. |
Hi! I tried to implement your example in spaCy 2.0.3 but got
on the line |
@jorgeaguiar I had a problem in my example -- fixed. Instead of |
@honnibal thanks! There are still two things that don't seem to add up, though:
Maybe it's easier to explain with some code...
This will yield
No apparent connection to the detected entities... Any hints? Thanks! |
@jorgeaguiar by adding little code at the end of your script, like this
I get the probability of each entity in for each characters, like this Any suggestions to further processing this result @honnibal , Thanks! |
@Zhenshan-Jin I think I'm getting somewhere now. This works:
Only problem now is, my models are trained to use the standard NER and, probably because of that, most entities detected with beam search are wrong. |
Instead of training on the regression objective or a beam searching algorithm, a second pass calibration could help to determine the mapping between the scores and the precision probabilities. For example, a Precision-Coverage curve drawn on a test set could tell, e.g., any parse with a score higher than 0.005 has 80% chance to be correct. In one of my use cases, I would like to set a very high precision, e.g. 95% and ignore any example with a parsing score lower than the threshold for the 95% precision. I might still get a pretty good coverage of my data, e.g., 50%. But with high-quality parse. |
Also code suggested here produces memory leaks. |
@jorgeaguiar thanks for that snippet that does seem to work for me most of the time, but sometimes I get negative end values - any idea how to interpret these? Also how come you do Also @honnibal does this beam parsing only work on single words, or also on compound words? E.g. for |
I think we should use end instead of end-1 The only inconsistency in the picture is that this beam decoding shows some ents with probs of 0.9 - which are no predicted by the NER model otherwise. Also there are ones which are predicted but have very low prob. Maybe this has to do with what @jorgeaguiar mentioned: |
For me it seems its working well, but as you said, I am using a custom NER model, trained from a blank one. |
How should I do if I wanted to add those entities above certain threshold, let say those with for pred in preds:
pick_from_probs = get_probatilities...
for p in pick_from_probs:
indexes = [set(range(ent.start, ent.end)) for ent in pred.ents]
start, end = probs[p][0], probs[p][1]
# Make sure the new span does not overlap any current one
if not any([x.intersection(range(start, end)) for x in indexes]):
span = Span(pred, start, end, label=p)
pred.ents = pred.ents + (span,) # IS THIS RIGHT? |
I'm going to close this enhancement issue, because the regression objective idea just doesn't work. Confidence-sensitive NER is still a nice idea and we should investigate other ways of achieving it, but the discussion in this issue is old and kind of misleading now. |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
1 similar comment
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
More and more people have been asking about confidence scores for the parser and NER. The current model can't answer this, so I decided to dust off some almost-complete research from last year to fix this.
This work is almost complete, and should be up on master within a day or two. 🎉. Here's how it works.Edit: I spoke too soon....The problem was that the regression loss objective I describe here produced extremely non-sparse solutions with the linear model. It should be possible to find a good compromise with L1 regularisation, but I switched efforts to the v2 experiments instead.Edit 2: spaCy 2 uses neural networks, so the sparsity isn't a problem. But I haven't been able to get the regression loss working well at all. I think something's wrong with my implementation.
v2 now has beam parsing implemented, which supports one way to get quality estimates for parses --- see below. However, I'd like to resume efforts on the regression loss objective. I think there's a bug in the current implementation of this loss function. See below.
Currently the parser and NER are trained with a hinge-loss objective (specifically, using the averaged perceptron update rule). At each word, the model asks "What's the highest scoring action?". It makes its prediction, and then it asks the oracle to assign a cost to each action, where the cost represents the number of new errors that will be introduced if that action is taken. For instance, if we're at the start of an ORG entity, and we perform the action O, we introduce two errors: we miss the entity, and we miss the label. The actions B-PER and U-ORG each introduce one, and the action B-ORG introduces zero. If our predicted action isn't zero-cost, we update the weights such that in future this action will score a bit lower for this example, and the best zero-cost action will score a bit higher.
If we're only looking at the quality of the output parse, this setup performs well. But it means the scores on the actions have no particular interpretation. We don't force them into any useful scale, and we don't train them to reflect the wider parse quality. If the parser is in a bad state, it's not trained to give uniformly lower scores. It's trained to make the best of bad situations.
The changes I'm merging improve this in two ways. They're looking forwards to the spaCy 2.0 neural network models, but they're useful with the current linear models too, so I decided to get them in early.
1. Beam search with global objective
This is the standard solution: use a global objective, so that the parser model is trained to prefer parses that are better overall. Keep N different candidates, and output the best one. This can be used to support confidence by looking at the alternate analyses in the beam. If an entity occurs in every analysis, the NER is more confident it's correct.
2. Optimize the negative cost explicitly (i.e. do numeric regression, not multiclass classification)
This idea has been kicking around for a while. I think a few people have tried it with negative results. It was first raised to me in 2015 by Mark Johnson. I guess to a lot of folks it's obvious.
The idea is this: we have an oracle that tells us the number of errors an action will introduce. Instead of arbitrary high/low scores, we try to make the model output a score that matches the oracle's output. This means that if an action would introduce 2 errors, we want to predict "2". We don't just want it to score lower than some other class, that would introduce 0 errors. It's handy to flip the sign on this, so that we're still taking an argmax to choose the action.
In my previous experiments, this regression loss produced parse accuracies that were very slightly worse --- the difference in accuracy was 0.2%. In parsing research, this is indeed a negative result :).
However, this difference in accuracy doesn't matter at all --- and the upside of the regression setup is quite significant! With the regression model, the scores output by the parser have a meaningful interpretation: the sum of the scores is the expected number of errors in the analysis. This is exactly what people are looking for, and it comes with no increase in complexity or run-time. It's just a change to the objective used to train the model.
The text was updated successfully, but these errors were encountered: