💫 Train parser and NER with regression objective, to make scores express expected parse quality #881

honnibal · 2017-03-10T10:43:45Z

More and more people have been asking about confidence scores for the parser and NER. The current model can't answer this, so I decided to dust off some almost-complete research from last year to fix this.

~~This work is almost complete, and should be up on master within a day or two. 🎉. Here's how it works.~~ Edit: I spoke too soon....The problem was that the regression loss objective I describe here produced extremely non-sparse solutions with the linear model. It should be possible to find a good compromise with L1 regularisation, but I switched efforts to the v2 experiments instead.

Edit 2: spaCy 2 uses neural networks, so the sparsity isn't a problem. But I haven't been able to get the regression loss working well at all. I think something's wrong with my implementation.

v2 now has beam parsing implemented, which supports one way to get quality estimates for parses --- see below. However, I'd like to resume efforts on the regression loss objective. I think there's a bug in the current implementation of this loss function. See below.

Currently the parser and NER are trained with a hinge-loss objective (specifically, using the averaged perceptron update rule). At each word, the model asks "What's the highest scoring action?". It makes its prediction, and then it asks the oracle to assign a cost to each action, where the cost represents the number of new errors that will be introduced if that action is taken. For instance, if we're at the start of an ORG entity, and we perform the action O, we introduce two errors: we miss the entity, and we miss the label. The actions B-PER and U-ORG each introduce one, and the action B-ORG introduces zero. If our predicted action isn't zero-cost, we update the weights such that in future this action will score a bit lower for this example, and the best zero-cost action will score a bit higher.

If we're only looking at the quality of the output parse, this setup performs well. But it means the scores on the actions have no particular interpretation. We don't force them into any useful scale, and we don't train them to reflect the wider parse quality. If the parser is in a bad state, it's not trained to give uniformly lower scores. It's trained to make the best of bad situations.

The changes I'm merging improve this in two ways. They're looking forwards to the spaCy 2.0 neural network models, but they're useful with the current linear models too, so I decided to get them in early.

1. Beam search with global objective

This is the standard solution: use a global objective, so that the parser model is trained to prefer parses that are better overall. Keep N different candidates, and output the best one. This can be used to support confidence by looking at the alternate analyses in the beam. If an entity occurs in every analysis, the NER is more confident it's correct.

2. Optimize the negative cost explicitly (i.e. do numeric regression, not multiclass classification)

This idea has been kicking around for a while. I think a few people have tried it with negative results. It was first raised to me in 2015 by Mark Johnson. I guess to a lot of folks it's obvious.

The idea is this: we have an oracle that tells us the number of errors an action will introduce. Instead of arbitrary high/low scores, we try to make the model output a score that matches the oracle's output. This means that if an action would introduce 2 errors, we want to predict "2". We don't just want it to score lower than some other class, that would introduce 0 errors. It's handy to flip the sign on this, so that we're still taking an argmax to choose the action.

In my previous experiments, this regression loss produced parse accuracies that were very slightly worse --- the difference in accuracy was 0.2%. In parsing research, this is indeed a negative result :).

However, this difference in accuracy doesn't matter at all --- and the upside of the regression setup is quite significant! With the regression model, the scores output by the parser have a meaningful interpretation: the sum of the scores is the expected number of errors in the analysis. This is exactly what people are looking for, and it comes with no increase in complexity or run-time. It's just a change to the objective used to train the model.

vhwen · 2017-06-14T00:08:51Z

Would you please give an example of how to get the score? Couldn't find anything in documentation. @honnibal @ines

fansg · 2017-10-02T06:17:07Z

Was it implemented and how we can use it? @honnibal

honnibal · 2017-10-02T10:43:15Z

I never shipped the linear mode regression loss because I couldn't get the memory use under control -- the loss produced very non-sparse solutions, and it was taking too many experiments to find the right regularisation.

Instead I focussed on the experiments for spaCy 2. Current versions of spaCy 2 support beam search decoding, which lets you get probabilities by asking how many beam parses the entity occurred in. We don't currently have a model trained with the beam objective online yet, so the probabilities aren't so well calibrated. You'll have to try and see. Here's a current example.

# Number of alternate analyses to consider. More is slower, and not necessarily better -- you need to experiment on your problem.
beam_width = 16
# This clips solutions at each step. We multiply the score of the top-ranked action by this value, and use the result as a threshold. This prevents the parser from exploring options that look very unlikely, saving a bit of efficiency. Accuracy may also improve, because we've trained on greedy objective.
beam_density = 0.0001 
nlp = spacy.load('en_core_web_sm')

with nlp.disable_pipes('ner'):
    docs = list(nlp.pipe(texts))
beams = nlp.entity.beam_parse(docs, beam_width=beam_width, beam_density=beam_density)

for doc, beam in zip(docs, beams):
    entity_scores = defaultdict(float)
    for score, ents in nlp.entity.moves.get_beam_parses(beam):
        for start, end, label in ents:
            entity_scores[(start, end, label)] += score

honnibal · 2017-10-02T10:51:45Z

Btw, on the off-chance anyone reading this could tell me what's wrong with the regression loss here:

spaCy/spacy/syntax/nn_parser.pyx

Line 216 in cdb2d83

cdef void cpu_regression_loss(float* d_scores,

I could rerun the regression-loss experiments using the neural network, where the sparsity problem wouldn't be an issue.

What we want is an output vector of scores length N, for our N parser/NER transition actions. Each unit scores[i] should reflect the cost of taking that action, where the cost is defined as the number of newly unreachable gold-standard arcs or entities. The costs are passed into the function as an array.

The function should be computing the gradient of the loss of this regression problem. Only some actions are valid. The gradient for an invalid action should always be 0.

jorgeaguiar · 2017-11-23T10:27:34Z

Hi! I tried to implement your example in spaCy 2.0.3 but got

AttributeError: 'spacy.pipeline.EntityRecognizer' object has no attribute 'get_beam_parses'

on the line nlp.entity.get_beam_parses(beam)
Is the example still valid for current versions of spaCy ? Or am I missing something here?

honnibal · 2017-11-29T01:39:39Z

@jorgeaguiar I had a problem in my example -- fixed. Instead of nlp.entity.get_beam_parses it's nlp.entity.moves.get_beam_parses().

jorgeaguiar · 2017-11-30T10:55:37Z

@honnibal thanks! There are still two things that don't seem to add up, though:

Your code seems to imply that nlp.entity.beam_parse() returns a list of Beam objects; however, what it does return is a tuple, with the expected list of Beam objects as its first element, and an array (?) as its second element. So, I had to change that call to (beams, somethingelse) = nlp.entity.beam_parse(...), otherwise Python would complain later. What is that somethingelse in the returned tuple? Is it important for calculating probabilities?
The (start, end, label) tuples in the entities returned by nlp.entity.moves.get_beam_parses() all have start set to 0 and end set to either -1 or 1... what do these mean?

Maybe it's easier to explain with some code...

import spacy
import sys
from collections import defaultdict

nlp = spacy.load('en')
text = u'Japan in the European Union ?'
doc = nlp(text)
for ent in doc.ents:
    print '%d %s %s' % (ent.start_char, ent.text, ent.label_)

docs = list(nlp.pipe(list(text), disable=['ner']))
(beams, somethingelse) = nlp.entity.beam_parse(docs, beam_width=16, beam_density=0.0001)

for beam in beams:
    for score, ents in nlp.entity.moves.get_beam_parses(beam):
        print score, ents
        entity_scores = defaultdict(float)
        for start, end, label in ents:
            entity_scores[(start, end, label)] += score

This will yield

0 Japan GPE
9 the European Union ORG
0.999939530063 []
4.54607413676e-05 [(0, 1, u'CARDINAL')]
5.81728399866e-06 [(0, -1, u'PRODUCT')]
3.61413102284e-06 [(0, -1, u'ORG')]
2.50448729089e-06 [(0, -1, u'DATE')]
1.84200212209e-06 [(0, -1, u'PERSON')]
8.25288554732e-07 [(0, -1, u'CARDINAL')]
1.2167600756e-07 [(0, -1, u'GPE')]
5.78341708966e-08 [(0, 1, u'PERSON')]
4.83908104929e-08 [(0, -1, u'PERCENT')]
(...)

No apparent connection to the detected entities... Any hints? Thanks!

Zhenshan-Jin · 2018-01-15T14:42:22Z

@jorgeaguiar by adding little code at the end of your script, like this

for doc, beam in zip(docs, beams):
    entity_scores = defaultdict(float)
    for score, ents in nlp.entity.moves.get_beam_parses(beam):
        for start, end, label in ents:
            entity_scores[(doc, start, end, label)] += score
    if not doc.text:
        print(doc.text)
        print(pprint(entity_scores))

I get the probability of each entity in for each characters, like this

Then I guess there should be some way to further process this character based NER result to get the real NER probability. But still not sure.

Any suggestions to further processing this result @honnibal , Thanks!

jorgeaguiar · 2018-01-17T18:26:11Z

@Zhenshan-Jin I think I'm getting somewhere now. This works:

import spacy
import sys
from collections import defaultdict

nlp = spacy.load('en')
text = u'Will Japan join the European Union ?'
doc = nlp(text)

print '--- Tokens ---'
for tok in doc:
    print tok.i, tok   
print ''

print '--- Entities (detected with standard NER) ---'
for ent in doc.ents:
    print '%d to %d: %s (%s)' % (ent.start, ent.end - 1, ent.label_, ent.text)
print ''

# notice these 2 lines - if they're not here, standard NER
# will be used and all scores will be 1.0
with nlp.disable_pipes('ner'):
    doc = nlp(text)

(beams, somethingelse) = nlp.entity.beam_parse([ doc ], beam_width = 16, beam_density = 0.0001)

entity_scores = defaultdict(float)
for beam in beams:
    for score, ents in nlp.entity.moves.get_beam_parses(beam):
        for start, end, label in ents:
            entity_scores[(start, end, label)] += score

print '--- Entities and scores (detected with beam search) ---'
for key in entity_scores:
    start, end, label = key
    print '%d to %d: %s (%f)' % (start, end - 1, label, entity_scores[key])

Only problem now is, my models are trained to use the standard NER and, probably because of that, most entities detected with beam search are wrong.
@honnibal any hints on how to properly train a model to use beam search, like you mentioned in https://support.prodi.gy/t/accessing-probabilities-in-ner/94/2 ?

xu-neva · 2018-04-02T05:45:46Z

Instead of training on the regression objective or a beam searching algorithm, a second pass calibration could help to determine the mapping between the scores and the precision probabilities. For example, a Precision-Coverage curve drawn on a test set could tell, e.g., any parse with a score higher than 0.005 has 80% chance to be correct.
then the parse can be tagged with 0.8 instead of 0.005 as the probability outputs to indicate the confidence.

In one of my use cases, I would like to set a very high precision, e.g. 95% and ignore any example with a parsing score lower than the threshold for the 95% precision. I might still get a pretty good coverage of my data, e.g., 50%. But with high-quality parse.

usamec · 2018-09-07T16:50:36Z

Also code suggested here produces memory leaks.
You should do cleanup like this:
https://github.com/explosion/spaCy/blob/master/spacy/syntax/nn_parser.pyx#L383

Globegitter · 2018-11-09T11:11:19Z

@jorgeaguiar thanks for that snippet that does seem to work for me most of the time, but sometimes I get negative end values - any idea how to interpret these? Also how come you do end - 1 rather than just using end?

Also @honnibal does this beam parsing only work on single words, or also on compound words? E.g. for new york it seems to give me confidence values for new and york it seems but not new york, if I am interpreting the results correctly. But maybe that has something to do with me getting negative values for end?

rbhambriiit · 2018-11-19T04:16:16Z

I think we should use end instead of end-1
Those are the correct indices as displayed by the vanilla end attribute inside doc.ents

The only inconsistency in the picture is that this beam decoding shows some ents with probs of 0.9 - which are no predicted by the NER model otherwise.

Also there are ones which are predicted but have very low prob.

Maybe this has to do with what @jorgeaguiar mentioned:
Only problem now is, my models are trained to use the standard NER and, probably because of that, most entities detected with beam search are wrong.

elbaulp · 2019-07-19T07:12:51Z

@jorgeaguiar

Only problem now is, my models are trained to use the standard NER and, probably because of that, most entities detected with beam search are wrong.

For me it seems its working well, but as you said, I am using a custom NER model, trained from a blank one.

elbaulp · 2019-07-19T12:19:28Z

How should I do if I wanted to add those entities above certain threshold, let say those with probs > 70%, I am doing it this way:

for pred in preds:
    pick_from_probs = get_probatilities...
    for p in pick_from_probs:
        indexes = [set(range(ent.start, ent.end)) for ent in pred.ents]
        start, end = probs[p][0], probs[p][1]
        # Make sure the new span does not overlap any current one
        if not any([x.intersection(range(start, end)) for x in indexes]):
            span = Span(pred, start, end, label=p)
            pred.ents = pred.ents + (span,) # IS THIS RIGHT?

honnibal · 2020-05-21T08:55:14Z

I'm going to close this enhancement issue, because the regression objective idea just doesn't work. Confidence-sensitive NER is still a nice idea and we should investigate other ways of achieving it, but the discussion in this issue is old and kind of misleading now.

lock · 2020-06-24T23:51:48Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

lock · 2020-06-24T23:52:11Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

github-actions · 2021-11-04T00:02:06Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

This was referenced Mar 10, 2017

Is there a way to find the probability or confidence score of the extracted named entities? #831

Closed

Is there a way to extract the "confidence" of the parse? #406

Closed

Confidence score for NER #87

Closed

ines added the enhancement Feature requests and improvements label Mar 10, 2017

jcbgamboa mentioned this issue Mar 13, 2017

Is the NER *actually* retrainable? #887

Closed

ines closed this as completed Apr 16, 2017

honnibal reopened this Oct 2, 2017

joey234 mentioned this issue Mar 26, 2018

official NER score from Spacy is not yet implemented longhbnguyen/nerthesis#2

Open

ines added feat / ner Feature: Named Entity Recognizer feat / parser Feature: Dependency Parser labels Mar 28, 2018

kororo mentioned this issue Jul 24, 2018

Universe update - ExcelCy #2579

Merged

3 tasks

honnibal mentioned this issue Dec 6, 2018

Can we get matching probability of NER label's value ? #3016

Closed

sshegheva mentioned this issue Dec 12, 2018

thinc.extra.search.Beam.advance has assertion error with custom entity labels #3047

Closed

honnibal mentioned this issue Mar 10, 2019

Looking for documentation on loss function for NER #3360

Closed

reckart mentioned this issue Jun 14, 2019

No score produced by recommender inception-project/external-recommender-spacy#1

Open

svlandeg mentioned this issue Aug 1, 2019

Active Learning using uncertainty #4060

Closed

DDouteaux mentioned this issue Oct 11, 2019

Memory leak with beam_parse method #4432

Closed

davidbren mentioned this issue Oct 15, 2019

Training a blank NER with custom beam settings #4450

Closed

honnibal closed this as completed May 21, 2020

gsevrodrigues mentioned this issue Aug 13, 2020

Confidence Score for NER #5917

Closed

gennsev mentioned this issue Dec 28, 2020

NER Scores at spaCy 3.0 ? #6644

Closed

github-actions bot locked as resolved and limited conversation to collaborators Nov 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

💫 Train parser and NER with regression objective, to make scores express expected parse quality #881

💫 Train parser and NER with regression objective, to make scores express expected parse quality #881

honnibal commented Mar 10, 2017 •

edited

Loading

vhwen commented Jun 14, 2017 •

edited

Loading

fansg commented Oct 2, 2017

honnibal commented Oct 2, 2017 •

edited

Loading

honnibal commented Oct 2, 2017

jorgeaguiar commented Nov 23, 2017

honnibal commented Nov 29, 2017

jorgeaguiar commented Nov 30, 2017

Zhenshan-Jin commented Jan 15, 2018

jorgeaguiar commented Jan 17, 2018 •

edited

Loading

xu-neva commented Apr 2, 2018

usamec commented Sep 7, 2018

Globegitter commented Nov 9, 2018 •

edited

Loading

rbhambriiit commented Nov 19, 2018

elbaulp commented Jul 19, 2019

elbaulp commented Jul 19, 2019

honnibal commented May 21, 2020

lock bot commented Jun 24, 2020

lock bot commented Jun 24, 2020

github-actions bot commented Nov 4, 2021

💫 Train parser and NER with regression objective, to make scores express expected parse quality #881

💫 Train parser and NER with regression objective, to make scores express expected parse quality #881

Comments

honnibal commented Mar 10, 2017 • edited Loading

1. Beam search with global objective

2. Optimize the negative cost explicitly (i.e. do numeric regression, not multiclass classification)

vhwen commented Jun 14, 2017 • edited Loading

fansg commented Oct 2, 2017

honnibal commented Oct 2, 2017 • edited Loading

honnibal commented Oct 2, 2017

jorgeaguiar commented Nov 23, 2017

honnibal commented Nov 29, 2017

jorgeaguiar commented Nov 30, 2017

Zhenshan-Jin commented Jan 15, 2018

jorgeaguiar commented Jan 17, 2018 • edited Loading

xu-neva commented Apr 2, 2018

usamec commented Sep 7, 2018

Globegitter commented Nov 9, 2018 • edited Loading

rbhambriiit commented Nov 19, 2018

elbaulp commented Jul 19, 2019

elbaulp commented Jul 19, 2019

honnibal commented May 21, 2020

lock bot commented Jun 24, 2020

lock bot commented Jun 24, 2020

github-actions bot commented Nov 4, 2021

honnibal commented Mar 10, 2017 •

edited

Loading

vhwen commented Jun 14, 2017 •

edited

Loading

honnibal commented Oct 2, 2017 •

edited

Loading

jorgeaguiar commented Jan 17, 2018 •

edited

Loading

Globegitter commented Nov 9, 2018 •

edited

Loading