Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement multi-f0 frame level eval metrics #186

Closed
wants to merge 20 commits into from

Conversation

rabitt
Copy link
Contributor

@rabitt rabitt commented Mar 16, 2016

Implements #185.
The following metrics are implemented for raw and chroma-wrapped multipitch estimates:

  • Accuracy
  • Precision
  • Recall
  • Total Error (E_tot)
  • Substitution Error (E_sub)
  • Miss Error (E_miss)
  • False Alarm Error (E_fa)

@craffel's to-do list:

  • metrics implemented in mir_eval/metric_name.py, plus validate method which makes sure annotations are in a sane format and evaluate method which calls all metrics and returns an ordered dict of the results (mir_eval/multipitch.py)
  • loader in io which loads in MIREX-format data (mir_eval.io.load_ragged_time_series)
  • evaluator in the evaluators folder which loads in data, calls evaluate, and writes out a json results file (evaluators.multipitch_eval.py)
  • Docs everywhere in numpydoc format
  • auto-module hook added to docs/index.rst
  • Unit tests and regression tests for all metrics (tests/test_multipitch.py)
  • a regression test data ideally coming from some real algorithm's output and human-annotated examples (tests/data/multipitch/, test_multipitch.regression_test_evaluate)
  • a generating hook for regression output in tests/generate_data.py

@rabitt rabitt force-pushed the master branch 2 times, most recently from 916d005 to 3cc9535 Compare March 16, 2016 03:26
filename : str
Path to the annotation file
dtype : function
Data type to apply to columns 1 through n.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this 0- or 1-indexed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 indexed. I've updated the doc to make this more clear

@coveralls
Copy link

Coverage Status

Coverage increased (+0.05%) to 98.117% when pulling 24cc2f3 on rabitt:master into 53c16af on craffel:master.

@rabitt
Copy link
Contributor Author

rabitt commented Mar 19, 2016

Alright @craffel as promised, a comparison with previous mirex results.

Including the important plot here for reference...

screenshot 2016-03-19 18 22 49

Notice that all of the 'raw' metrics are very close, while the chroma metrics are pretty different. I stared at the mirex implementation for a while and compared it with my version. I don't see any obvious reasons based on the two implementations why they should be so different. One interesting thing to note is that both "False alarm error" and "Miss error" are computed in terms of n_ref and n_est only, and therefore should be identical for chroma and raw metrics. In the mirex scores they are not identical... My guess is that somehow, they compute an n_ref and n_est for chroma, but I don't see that anywhere in the implementation.

@craffel
Copy link
Collaborator

craffel commented Mar 20, 2016

Notice that all of the 'raw' metrics are very close, while the chroma metrics are pretty different. I stared at the mirex implementation for a while and compared it with my version. I don't see any obvious reasons based on the two implementations why they should be so different. One interesting thing to note is that both "False alarm error" and "Miss error" are computed in terms of n_ref and n_est only, and therefore should be identical for chroma and raw metrics. In the mirex scores they are not identical... My guess is that somehow, they compute an n_ref and n_est for chroma, but I don't see that anywhere in the implementation.

It's this right? I can take a look when I am done traveling. One thing that has helped me is to port the java code line-for-line to Python and then compare intermediate values (of course if you are comfortable enough in Java you can just compare intermediate values without porting). I am assuming the values on your graph are not percentages? Are they just the difference between mir_eval and mirex scores? If so, I would say for sure that .1 is much too big, we need to investigate!

@rabitt
Copy link
Contributor Author

rabitt commented Mar 20, 2016

It's this right?

That's not the reference implementation I was using - I got the code used for mirex from Li Su who ran the multi-f0 task last year. The code is based on Zhiyao Duan's code here. The code you pointed to doesn't have all of the measures reported in mirex - in particular the only chroma score it reports is chroma accuracy, so I don't think it's the same.

I am assuming the values on your graph are not percentages? Are they just the difference between mir_eval and mirex scores? If so, I would say for sure that .1 is much too big, we need to investigate!

Yep, the plot is just mir_eval scores minus mirex scores. I agree that 0.1 is too big, but....

After more digging, I think there is an inconsistency in the mirex results, and that my implementation might be ok after all. All of the metrics are functions of some combination of TP (number of true positives), n_ref (number of reference pitches) and n_est (number of estimated pitches). Breaking down each metric, they are functions of:

  • Accuracy: TP, n_ref, n_est
  • Precision: TP, n_est
  • Recall: TP, n_ref
  • Total Error: TP, n_ref, n_est
  • Substitution Error: TP, n_ref, n_est
  • False Alarm Error: n_ref, n_est
  • Miss Error: n_ref, n_est

The only difference (in my understanding) between chroma metrics and raw metrics is that TP changes. If that's true, then Substitution Error Chroma should be identical to Substitution Error, and similarly the False alarm error for chroma should be the same as the raw false alarm error. The results for the "MIREX" dataset do indeed have E_sub and E_fa as being identical for raw and chroma metrics. For the Su dataset however, (results), the chroma and raw metrics are different for E_sub and E_fa. (The notebook I sent was comparing against the results of the Su dataset.) I looked a few years back and the 2014, 2013, and 2012 results also have E_sub chroma = E_sub, and E_fa chroma = E_fa.

All of this to say, I think the Su 2015 chroma results have some mistake that explains the large difference in metrics.

I should also mention that I'm trying to tie a bow on this and use the "official" implementation for my ismir paper. I know you're traveling and can use the implementation in the branch if I need to, but if this can be merged sooner that would be ideal.

@craffel
Copy link
Collaborator

craffel commented Mar 20, 2016

After more digging, I think there is an inconsistency in the mirex results, and that my implementation might be ok after all.

Ok, we should confirm what the inconsistency in their code is then (potentially with them).

The only difference (in my understanding) between chroma metrics and raw metrics is that TP changes. If that's true, then Substitution Error Chroma should be identical to Substitution Error, and similarly the False alarm error for chroma should be the same as the raw false alarm error.

Do you mean False Alarm Error and Miss Error? It looks like those are the ones which do not depend on TP in your list.

The results for the "MIREX" dataset do indeed have E_sub and E_fa as being identical for raw and chroma metrics.

Can you compare your scores on the MIREX dataset then rather than the Su dataset and see if they are the same? Or are you already doing that?

For the Su dataset however, (results), the chroma and raw metrics are different for E_sub and E_fa.

Yes, that looks like a bug then. You should tell them about it.

All of this to say, I think the Su 2015 chroma results have some mistake that explains the large difference in metrics.

That explains those two, but not "chroma total error", which also differs quite a bit. Ideally we'd know why there too.

I should also mention that I'm trying to tie a bow on this and use the "official" implementation for my ismir paper. I know you're traveling and can use the implementation in the branch if I need to, but if this can be merged sooner that would be ideal.

For sure, I don't want to drag heels but hopefully you understand my being pedantic about making sure there are no unknown differences in implementation. Ideally I'd like to have another code review before merging, preferably by someone else although I bet everyone is in ISMIR deadline mode. I will be able to look it over again before the end of the week, but as I said I'd like to get implementation differences squared away first.

@rabitt
Copy link
Contributor Author

rabitt commented Mar 21, 2016

After more digging, I think there is an inconsistency in the mirex results, and that my implementation might be ok after all.

Ok, we should confirm what the inconsistency in their code is then (potentially with them).

Ok, I'll drop them a line.

The only difference (in my understanding) between chroma metrics and raw metrics is that TP changes. If that's true, then Substitution Error Chroma should be identical to Substitution Error, and similarly the False alarm error for chroma should be the same as the raw false alarm error.

Do you mean False Alarm Error and Miss Error? It looks like those are the ones which do not depend on TP in your list.

Ah yes, sorry, I did mean False Alarm error and Miss Error.

The results for the "MIREX" dataset do indeed have E_sub and E_fa as being identical for raw and chroma metrics.

Can you compare your scores on the MIREX dataset then rather than the Su dataset and see if they are the same? Or are you already doing that?

I don't have the algorithm's output for MIREX- only for the Su dataset. I'll also ask for the MIREX results.

For the Su dataset however, (results), the chroma and raw metrics are different for E_sub and E_fa.

Yes, that looks like a bug then. You should tell them about it.

Will do.

All of this to say, I think the Su 2015 chroma results have some mistake that explains the large difference in metrics.

That explains those two, but not "chroma total error", which also differs quite a bit. Ideally we'd know why there too.

I think it actually explains all of the differing chroma metrics. If n_ref and n_est are different in their implementation, then all of the other chroma metrics will be different from mine since they all are functions of at least n_ref and n_est.

I should also mention that I'm trying to tie a bow on this and use the "official" implementation for my ismir paper. I know you're traveling and can use the implementation in the branch if I need to, but if this can be merged sooner that would be ideal.

For sure, I don't want to drag heels but hopefully you understand my being pedantic about making sure there are no unknown differences in implementation.

Yep, I do, and appreciate it ;).

Ideally I'd like to have another code review before merging, preferably by someone else although I bet everyone is in ISMIR deadline mode. I will be able to look it over again before the end of the week, but as I said I'd like to get implementation differences squared away first.

Ok, given the current timeline, I'll probably bug the mirex folks after the ismir deadline and we can sort this then.

@craffel
Copy link
Collaborator

craffel commented Mar 22, 2016

I think it actually explains all of the differing chroma metrics. If n_ref and n_est are different in their implementation, then all of the other chroma metrics will be different from mine since they all are functions of at least n_ref and n_est.

Makes sense. Let me know what you find out!

@rabitt
Copy link
Contributor Author

rabitt commented Apr 4, 2016

@craffel - an update on this:
There was indeed a bug in the 2015 Su results, but the scores are still quite different from the ones I'm getting. After discussing with Li Su, it looks like it's a difference in how the metrics are aggregated for multiple tracks, which brings up a bigger issue:

The mirex scores, at least for this task, are a "meta" metric, created by combining the intermediate scores for each track. For example, in onset detection, the "meta" score would take the number of TP/FP/FN for each track, sum them, and use the new value to compute "meta" P/R/F-measure.
On the other hand, I computed an average of the metrics across tracks.

We're in the process of digging into the individual track level scores and will see if those are equivalent. If they are, I guess I'm wondering what your stance is on how metrics are aggregated for multiple tracks. Computing metrics the mirex way would require somewhat substantial changes to the code structure overall, wouldn't it?

@craffel
Copy link
Collaborator

craffel commented Apr 4, 2016

Computing metrics the mirex way would require somewhat substantial changes to the code structure overall, wouldn't it?

Good catch. Yes, we assume all evaluations are on a track level, and then aggregated across tracks, as opposed to computing on a corpus level. Unless there is major disagreement from the community, I'd like to keep it that way.

@justinsalamon
Copy link
Collaborator

Unless there is major disagreement from the community, I'd like to keep it that way.

+1

@rabitt
Copy link
Contributor Author

rabitt commented Apr 5, 2016

Li Su was kind enough to send me the mean-aggregated mirex scores - the comparison is here. Important plot here:
results

For 5 out of the 7 algorithms, the results are all within 1/100th of a percent of the mirex scores. For the other two (red and green), the difference is between 1/10th and 1 percentage point. I expect that this is due to the optimal vs. greedy graph matching.

Are these scores close enough to pass the Raffel quality control test? :) Or should I dig into the differences a bit more?

@craffel
Copy link
Collaborator

craffel commented Apr 5, 2016

Nice, looking good. If you're willing, usually I like to compute the absolute difference and normalize by the score mean, e.g. |mirex_score - mir_eval_score|/(mirex_score + mir_eval_score)/2. I think most of your scores, apart from CB1 and CB2, are at that point within rounding error. It would be great to look at the actual matchings produced for CB1 and CB2 to see if there's a difference between those matchings and greedy matchings (or compute them with a greedy matcher instead), just to isolate that difference, but yeah otherwise it looks good. I would like to do another code review before merging, but am in thesis lockdown mode. It would be ideal to have someone else to do it either way. Can you track down someone to do a second one before merging?

@rabitt
Copy link
Contributor Author

rabitt commented Apr 5, 2016

Nice, looking good. If you're willing, usually I like to compute the absolute difference and normalize by the score mean, e.g. |mirex_score - mir_eval_score|/(mirex_score + mir_eval_score)/2

Here you go:
results

Can you track down someone to do a second one before merging?

Sure thing. Pinging @bmcfee ?

@craffel
Copy link
Collaborator

craffel commented Apr 5, 2016

Here you go:

Thanks! Given the 3%+ difference there, it would be great to get a cursory confirmation that it's because of optimal matching, even if that just means checking that the algorithm output for those algorithms produces many more notes (which would be likely to cause a greedy vs. non-greedy difference), for example.

@rabitt
Copy link
Contributor Author

rabitt commented Apr 5, 2016

Thanks! Given the 3%+ difference there, it would be great to get a cursory confirmation that it's because of optimal matching, even if that just means checking that the algorithm output for those algorithms produces many more notes (which would be likely to cause a greedy vs. non-greedy difference), for example.

Will do!

r"""Utility function for loading in data from a delimited time series
annotation file with a variable number of columns.
Assumes that column 0 contains time stamps and columns 1 through n contain
values. n may be variable from time stamp to time stamp.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A small (3-line) example input here might be helpful.

@bmcfee
Copy link
Collaborator

bmcfee commented Apr 6, 2016

Sure thing. Pinging @bmcfee ?

Just took another pass over it. Looks mostly good -- the things I think need "fixing" are the handling of existing file descriptors, and more specific exception handling. There are also a few dangling comments from earlier reviews that don't seem to have been addressed yet.

@rabitt
Copy link
Contributor Author

rabitt commented Apr 30, 2016

@craffel Sorry for the delay. Updates:

  1. I thing I've incorporated all the comments from your and brian's code review

  2. I found the reason for the 3% difference. The output from the "CB1" and "CB2" submissions had time stamps starting at 0.0, while the mirex ground truth (and all other submissions) have time stamps starting at 0.01. The mirex eval code assumes the stamps are correct while the mir_eval implementation resamples, resulting in the "correct" alignment with mir_eval and an off-by-one alignment between the reference and estimate in the mirex eval.

The "raw" percentage differences between implementations are here:
results
(Note that the 3% differences are for CB1 and CB2)

Here are the differences again if I shift the estimates for CB1 and CB2 by one sample to match the mirex behavior:
results-shift

@craffel
Copy link
Collaborator

craffel commented May 2, 2016

Nice sleuthing! Ok, I am happy to merge this unless there are any pending TODOs we need to take care of that I forgot about.

@rabitt
Copy link
Contributor Author

rabitt commented May 2, 2016

@craffel I think I covered everything. Just made a second pass through the comments and didn't see anything outstanding.

@craffel
Copy link
Collaborator

craffel commented May 3, 2016

Ok, merged, thanks again!

@craffel craffel closed this May 3, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants