Implement multi-f0 frame level eval metrics #186

rabitt · 2016-03-16T03:12:27Z

Implements #185.
The following metrics are implemented for raw and chroma-wrapped multipitch estimates:

@craffel's to-do list:

metrics implemented in mir_eval/metric_name.py, plus validate method which makes sure annotations are in a sane format and evaluate method which calls all metrics and returns an ordered dict of the results (mir_eval/multipitch.py)
loader in io which loads in MIREX-format data (mir_eval.io.load_ragged_time_series)
evaluator in the evaluators folder which loads in data, calls evaluate, and writes out a json results file (evaluators.multipitch_eval.py)
Docs everywhere in numpydoc format
auto-module hook added to docs/index.rst
Unit tests and regression tests for all metrics (tests/test_multipitch.py)
a regression test data ideally coming from some real algorithm's output and human-annotated examples (tests/data/multipitch/, test_multipitch.regression_test_evaluate)
a generating hook for regression output in tests/generate_data.py

craffel · 2016-03-16T08:55:30Z

mir_eval/io.py

+    filename : str
+        Path to the annotation file
+    dtype : function
+        Data type to apply to columns 1 through n.


Is this 0- or 1-indexed?

0 indexed. I've updated the doc to make this more clear

coveralls · 2016-03-19T22:28:02Z

Coverage increased (+0.05%) to 98.117% when pulling 24cc2f3 on rabitt:master into 53c16af on craffel:master.

rabitt · 2016-03-19T23:32:49Z

Alright @craffel as promised, a comparison with previous mirex results.

Including the important plot here for reference...

Notice that all of the 'raw' metrics are very close, while the chroma metrics are pretty different. I stared at the mirex implementation for a while and compared it with my version. I don't see any obvious reasons based on the two implementations why they should be so different. One interesting thing to note is that both "False alarm error" and "Miss error" are computed in terms of n_ref and n_est only, and therefore should be identical for chroma and raw metrics. In the mirex scores they are not identical... My guess is that somehow, they compute an n_ref and n_est for chroma, but I don't see that anywhere in the implementation.

craffel · 2016-03-20T01:06:13Z

Notice that all of the 'raw' metrics are very close, while the chroma metrics are pretty different. I stared at the mirex implementation for a while and compared it with my version. I don't see any obvious reasons based on the two implementations why they should be so different. One interesting thing to note is that both "False alarm error" and "Miss error" are computed in terms of n_ref and n_est only, and therefore should be identical for chroma and raw metrics. In the mirex scores they are not identical... My guess is that somehow, they compute an n_ref and n_est for chroma, but I don't see that anywhere in the implementation.

It's this right? I can take a look when I am done traveling. One thing that has helped me is to port the java code line-for-line to Python and then compare intermediate values (of course if you are comfortable enough in Java you can just compare intermediate values without porting). I am assuming the values on your graph are not percentages? Are they just the difference between mir_eval and mirex scores? If so, I would say for sure that .1 is much too big, we need to investigate!

rabitt · 2016-03-20T05:31:49Z

It's this right?

That's not the reference implementation I was using - I got the code used for mirex from Li Su who ran the multi-f0 task last year. The code is based on Zhiyao Duan's code here. The code you pointed to doesn't have all of the measures reported in mirex - in particular the only chroma score it reports is chroma accuracy, so I don't think it's the same.

I am assuming the values on your graph are not percentages? Are they just the difference between mir_eval and mirex scores? If so, I would say for sure that .1 is much too big, we need to investigate!

Yep, the plot is just mir_eval scores minus mirex scores. I agree that 0.1 is too big, but....

After more digging, I think there is an inconsistency in the mirex results, and that my implementation might be ok after all. All of the metrics are functions of some combination of TP (number of true positives), n_ref (number of reference pitches) and n_est (number of estimated pitches). Breaking down each metric, they are functions of:

Accuracy: TP, n_ref, n_est
Precision: TP, n_est
Recall: TP, n_ref
Total Error: TP, n_ref, n_est
Substitution Error: TP, n_ref, n_est
False Alarm Error: n_ref, n_est
Miss Error: n_ref, n_est

The only difference (in my understanding) between chroma metrics and raw metrics is that TP changes. If that's true, then Substitution Error Chroma should be identical to Substitution Error, and similarly the False alarm error for chroma should be the same as the raw false alarm error. The results for the "MIREX" dataset do indeed have E_sub and E_fa as being identical for raw and chroma metrics. For the Su dataset however, (results), the chroma and raw metrics are different for E_sub and E_fa. (The notebook I sent was comparing against the results of the Su dataset.) I looked a few years back and the 2014, 2013, and 2012 results also have E_sub chroma = E_sub, and E_fa chroma = E_fa.

All of this to say, I think the Su 2015 chroma results have some mistake that explains the large difference in metrics.

I should also mention that I'm trying to tie a bow on this and use the "official" implementation for my ismir paper. I know you're traveling and can use the implementation in the branch if I need to, but if this can be merged sooner that would be ideal.

craffel · 2016-03-20T23:45:25Z

After more digging, I think there is an inconsistency in the mirex results, and that my implementation might be ok after all.

Ok, we should confirm what the inconsistency in their code is then (potentially with them).

The only difference (in my understanding) between chroma metrics and raw metrics is that TP changes. If that's true, then Substitution Error Chroma should be identical to Substitution Error, and similarly the False alarm error for chroma should be the same as the raw false alarm error.

Do you mean False Alarm Error and Miss Error? It looks like those are the ones which do not depend on TP in your list.

The results for the "MIREX" dataset do indeed have E_sub and E_fa as being identical for raw and chroma metrics.

Can you compare your scores on the MIREX dataset then rather than the Su dataset and see if they are the same? Or are you already doing that?

For the Su dataset however, (results), the chroma and raw metrics are different for E_sub and E_fa.

Yes, that looks like a bug then. You should tell them about it.

All of this to say, I think the Su 2015 chroma results have some mistake that explains the large difference in metrics.

That explains those two, but not "chroma total error", which also differs quite a bit. Ideally we'd know why there too.

I should also mention that I'm trying to tie a bow on this and use the "official" implementation for my ismir paper. I know you're traveling and can use the implementation in the branch if I need to, but if this can be merged sooner that would be ideal.

For sure, I don't want to drag heels but hopefully you understand my being pedantic about making sure there are no unknown differences in implementation. Ideally I'd like to have another code review before merging, preferably by someone else although I bet everyone is in ISMIR deadline mode. I will be able to look it over again before the end of the week, but as I said I'd like to get implementation differences squared away first.

rabitt · 2016-03-21T00:30:03Z

After more digging, I think there is an inconsistency in the mirex results, and that my implementation might be ok after all.

Ok, we should confirm what the inconsistency in their code is then (potentially with them).

Ok, I'll drop them a line.

The only difference (in my understanding) between chroma metrics and raw metrics is that TP changes. If that's true, then Substitution Error Chroma should be identical to Substitution Error, and similarly the False alarm error for chroma should be the same as the raw false alarm error.

Do you mean False Alarm Error and Miss Error? It looks like those are the ones which do not depend on TP in your list.

Ah yes, sorry, I did mean False Alarm error and Miss Error.

The results for the "MIREX" dataset do indeed have E_sub and E_fa as being identical for raw and chroma metrics.

Can you compare your scores on the MIREX dataset then rather than the Su dataset and see if they are the same? Or are you already doing that?

I don't have the algorithm's output for MIREX- only for the Su dataset. I'll also ask for the MIREX results.

For the Su dataset however, (results), the chroma and raw metrics are different for E_sub and E_fa.

Yes, that looks like a bug then. You should tell them about it.

Will do.

All of this to say, I think the Su 2015 chroma results have some mistake that explains the large difference in metrics.

That explains those two, but not "chroma total error", which also differs quite a bit. Ideally we'd know why there too.

I think it actually explains all of the differing chroma metrics. If n_ref and n_est are different in their implementation, then all of the other chroma metrics will be different from mine since they all are functions of at least n_ref and n_est.

I should also mention that I'm trying to tie a bow on this and use the "official" implementation for my ismir paper. I know you're traveling and can use the implementation in the branch if I need to, but if this can be merged sooner that would be ideal.

For sure, I don't want to drag heels but hopefully you understand my being pedantic about making sure there are no unknown differences in implementation.

Yep, I do, and appreciate it ;).

Ideally I'd like to have another code review before merging, preferably by someone else although I bet everyone is in ISMIR deadline mode. I will be able to look it over again before the end of the week, but as I said I'd like to get implementation differences squared away first.

Ok, given the current timeline, I'll probably bug the mirex folks after the ismir deadline and we can sort this then.

craffel · 2016-03-22T06:30:58Z

I think it actually explains all of the differing chroma metrics. If n_ref and n_est are different in their implementation, then all of the other chroma metrics will be different from mine since they all are functions of at least n_ref and n_est.

Makes sense. Let me know what you find out!

rabitt · 2016-04-04T19:42:40Z

@craffel - an update on this:
There was indeed a bug in the 2015 Su results, but the scores are still quite different from the ones I'm getting. After discussing with Li Su, it looks like it's a difference in how the metrics are aggregated for multiple tracks, which brings up a bigger issue:

The mirex scores, at least for this task, are a "meta" metric, created by combining the intermediate scores for each track. For example, in onset detection, the "meta" score would take the number of TP/FP/FN for each track, sum them, and use the new value to compute "meta" P/R/F-measure.
On the other hand, I computed an average of the metrics across tracks.

We're in the process of digging into the individual track level scores and will see if those are equivalent. If they are, I guess I'm wondering what your stance is on how metrics are aggregated for multiple tracks. Computing metrics the mirex way would require somewhat substantial changes to the code structure overall, wouldn't it?

craffel · 2016-04-04T20:00:52Z

Computing metrics the mirex way would require somewhat substantial changes to the code structure overall, wouldn't it?

Good catch. Yes, we assume all evaluations are on a track level, and then aggregated across tracks, as opposed to computing on a corpus level. Unless there is major disagreement from the community, I'd like to keep it that way.

justinsalamon · 2016-04-04T20:05:00Z

Unless there is major disagreement from the community, I'd like to keep it that way.

+1

rabitt · 2016-04-05T21:38:15Z

Li Su was kind enough to send me the mean-aggregated mirex scores - the comparison is here. Important plot here:

For 5 out of the 7 algorithms, the results are all within 1/100th of a percent of the mirex scores. For the other two (red and green), the difference is between 1/10th and 1 percentage point. I expect that this is due to the optimal vs. greedy graph matching.

Are these scores close enough to pass the Raffel quality control test? :) Or should I dig into the differences a bit more?

craffel · 2016-04-05T22:06:37Z

Nice, looking good. If you're willing, usually I like to compute the absolute difference and normalize by the score mean, e.g. |mirex_score - mir_eval_score|/(mirex_score + mir_eval_score)/2. I think most of your scores, apart from CB1 and CB2, are at that point within rounding error. It would be great to look at the actual matchings produced for CB1 and CB2 to see if there's a difference between those matchings and greedy matchings (or compute them with a greedy matcher instead), just to isolate that difference, but yeah otherwise it looks good. I would like to do another code review before merging, but am in thesis lockdown mode. It would be ideal to have someone else to do it either way. Can you track down someone to do a second one before merging?

rabitt · 2016-04-05T22:26:11Z

Nice, looking good. If you're willing, usually I like to compute the absolute difference and normalize by the score mean, e.g. |mirex_score - mir_eval_score|/(mirex_score + mir_eval_score)/2

Here you go:

Can you track down someone to do a second one before merging?

Sure thing. Pinging @bmcfee ?

craffel · 2016-04-05T22:31:59Z

Here you go:

Thanks! Given the 3%+ difference there, it would be great to get a cursory confirmation that it's because of optimal matching, even if that just means checking that the algorithm output for those algorithms produces many more notes (which would be likely to cause a greedy vs. non-greedy difference), for example.

rabitt · 2016-04-05T22:33:41Z

Thanks! Given the 3%+ difference there, it would be great to get a cursory confirmation that it's because of optimal matching, even if that just means checking that the algorithm output for those algorithms produces many more notes (which would be likely to cause a greedy vs. non-greedy difference), for example.

Will do!

bmcfee · 2016-04-06T12:07:12Z

mir_eval/io.py

+    r"""Utility function for loading in data from a delimited time series
+    annotation file with a variable number of columns.
+    Assumes that column 0 contains time stamps and columns 1 through n contain
+    values. n may be variable from time stamp to time stamp.


A small (3-line) example input here might be helpful.

bmcfee · 2016-04-06T12:23:05Z

Sure thing. Pinging @bmcfee ?

Just took another pass over it. Looks mostly good -- the things I think need "fixing" are the handling of existing file descriptors, and more specific exception handling. There are also a few dangling comments from earlier reviews that don't seem to have been addressed yet.

rabitt · 2016-04-30T22:19:49Z

@craffel Sorry for the delay. Updates:

I thing I've incorporated all the comments from your and brian's code review
I found the reason for the 3% difference. The output from the "CB1" and "CB2" submissions had time stamps starting at 0.0, while the mirex ground truth (and all other submissions) have time stamps starting at 0.01. The mirex eval code assumes the stamps are correct while the mir_eval implementation resamples, resulting in the "correct" alignment with mir_eval and an off-by-one alignment between the reference and estimate in the mirex eval.

The "raw" percentage differences between implementations are here:

(Note that the 3% differences are for CB1 and CB2)

Here are the differences again if I shift the estimates for CB1 and CB2 by one sample to match the mirex behavior:

craffel · 2016-05-02T00:02:47Z

Nice sleuthing! Ok, I am happy to merge this unless there are any pending TODOs we need to take care of that I forgot about.

rabitt · 2016-05-02T00:30:16Z

@craffel I think I covered everything. Just made a second pass through the comments and didn't see anything outstanding.

craffel · 2016-05-03T16:22:47Z

Ok, merged, thanks again!

rabitt added 13 commits March 14, 2016 23:56

Added loader for ragged data

0c3b864

Added test for load_ragged_time_series

9ee9296

pep8

c1c6e06

Added method to validate frequencies

8abe235

Added tests for validators.

3a8e221

force values to be an np.ndarray

4e11b2d

Updated load_ragged_time_series test

77bd09a

import multipitch

a242c17

Added default value for dtype

010f4cd

Implemented multipitch evaluation metrics

e4fbf60

Tests for multipitch evaluation metrics

8cc5bd0

Added hook for multipitch

9aeba3c

Added multipitch evaluator

5a0d8ad

rabitt force-pushed the master branch 2 times, most recently from 916d005 to 3cc9535 Compare March 16, 2016 03:26

Added docstrings and cleanup.

5880a48

rabitt force-pushed the master branch from 3cc9535 to 5880a48 Compare March 16, 2016 03:49

craffel reviewed Mar 16, 2016
View reviewed changes

rabitt force-pushed the master branch from 17a9f3b to 24cc2f3 Compare March 19, 2016 22:24

Abstracted distance in match_events.

24cc2f3

bmcfee reviewed Apr 6, 2016
View reviewed changes

Incorporated bmcfees code review

8febe08

craffel closed this May 3, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement multi-f0 frame level eval metrics #186

Implement multi-f0 frame level eval metrics #186

rabitt commented Mar 16, 2016

craffel Mar 16, 2016

rabitt Mar 16, 2016

coveralls commented Mar 19, 2016

rabitt commented Mar 19, 2016

craffel commented Mar 20, 2016

rabitt commented Mar 20, 2016

craffel commented Mar 20, 2016

rabitt commented Mar 21, 2016

craffel commented Mar 22, 2016

rabitt commented Apr 4, 2016

craffel commented Apr 4, 2016

justinsalamon commented Apr 4, 2016

rabitt commented Apr 5, 2016

craffel commented Apr 5, 2016

rabitt commented Apr 5, 2016

craffel commented Apr 5, 2016

rabitt commented Apr 5, 2016

bmcfee Apr 6, 2016

bmcfee commented Apr 6, 2016

rabitt commented Apr 30, 2016 •

edited

Loading

craffel commented May 2, 2016

rabitt commented May 2, 2016

craffel commented May 3, 2016

Implement multi-f0 frame level eval metrics #186

Implement multi-f0 frame level eval metrics #186

Conversation

rabitt commented Mar 16, 2016

craffel Mar 16, 2016

Choose a reason for hiding this comment

rabitt Mar 16, 2016

Choose a reason for hiding this comment

coveralls commented Mar 19, 2016

rabitt commented Mar 19, 2016

craffel commented Mar 20, 2016

rabitt commented Mar 20, 2016

craffel commented Mar 20, 2016

rabitt commented Mar 21, 2016

craffel commented Mar 22, 2016

rabitt commented Apr 4, 2016

craffel commented Apr 4, 2016

justinsalamon commented Apr 4, 2016

rabitt commented Apr 5, 2016

craffel commented Apr 5, 2016

rabitt commented Apr 5, 2016

craffel commented Apr 5, 2016

rabitt commented Apr 5, 2016

bmcfee Apr 6, 2016

Choose a reason for hiding this comment

bmcfee commented Apr 6, 2016

rabitt commented Apr 30, 2016 • edited Loading

craffel commented May 2, 2016

rabitt commented May 2, 2016

craffel commented May 3, 2016

rabitt commented Apr 30, 2016 •

edited

Loading