-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement multi-f0 frame level eval metrics #186
Conversation
916d005
to
3cc9535
Compare
filename : str | ||
Path to the annotation file | ||
dtype : function | ||
Data type to apply to columns 1 through n. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this 0- or 1-indexed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0 indexed. I've updated the doc to make this more clear
Alright @craffel as promised, a comparison with previous mirex results. Including the important plot here for reference... Notice that all of the 'raw' metrics are very close, while the chroma metrics are pretty different. I stared at the mirex implementation for a while and compared it with my version. I don't see any obvious reasons based on the two implementations why they should be so different. One interesting thing to note is that both "False alarm error" and "Miss error" are computed in terms of |
It's this right? I can take a look when I am done traveling. One thing that has helped me is to port the java code line-for-line to Python and then compare intermediate values (of course if you are comfortable enough in Java you can just compare intermediate values without porting). I am assuming the values on your graph are not percentages? Are they just the difference between mir_eval and mirex scores? If so, I would say for sure that .1 is much too big, we need to investigate! |
That's not the reference implementation I was using - I got the code used for mirex from Li Su who ran the multi-f0 task last year. The code is based on Zhiyao Duan's code here. The code you pointed to doesn't have all of the measures reported in mirex - in particular the only chroma score it reports is chroma accuracy, so I don't think it's the same.
Yep, the plot is just mir_eval scores minus mirex scores. I agree that 0.1 is too big, but.... After more digging, I think there is an inconsistency in the mirex results, and that my implementation might be ok after all. All of the metrics are functions of some combination of
The only difference (in my understanding) between chroma metrics and raw metrics is that All of this to say, I think the Su 2015 chroma results have some mistake that explains the large difference in metrics. I should also mention that I'm trying to tie a bow on this and use the "official" implementation for my ismir paper. I know you're traveling and can use the implementation in the branch if I need to, but if this can be merged sooner that would be ideal. |
Ok, we should confirm what the inconsistency in their code is then (potentially with them).
Do you mean False Alarm Error and Miss Error? It looks like those are the ones which do not depend on TP in your list.
Can you compare your scores on the MIREX dataset then rather than the Su dataset and see if they are the same? Or are you already doing that?
Yes, that looks like a bug then. You should tell them about it.
That explains those two, but not "chroma total error", which also differs quite a bit. Ideally we'd know why there too.
For sure, I don't want to drag heels but hopefully you understand my being pedantic about making sure there are no unknown differences in implementation. Ideally I'd like to have another code review before merging, preferably by someone else although I bet everyone is in ISMIR deadline mode. I will be able to look it over again before the end of the week, but as I said I'd like to get implementation differences squared away first. |
Ok, I'll drop them a line.
Ah yes, sorry, I did mean False Alarm error and Miss Error.
I don't have the algorithm's output for MIREX- only for the Su dataset. I'll also ask for the MIREX results.
Will do.
I think it actually explains all of the differing chroma metrics. If
Yep, I do, and appreciate it ;).
Ok, given the current timeline, I'll probably bug the mirex folks after the ismir deadline and we can sort this then. |
Makes sense. Let me know what you find out! |
@craffel - an update on this: The mirex scores, at least for this task, are a "meta" metric, created by combining the intermediate scores for each track. For example, in onset detection, the "meta" score would take the number of TP/FP/FN for each track, sum them, and use the new value to compute "meta" P/R/F-measure. We're in the process of digging into the individual track level scores and will see if those are equivalent. If they are, I guess I'm wondering what your stance is on how metrics are aggregated for multiple tracks. Computing metrics the mirex way would require somewhat substantial changes to the code structure overall, wouldn't it? |
Good catch. Yes, we assume all evaluations are on a track level, and then aggregated across tracks, as opposed to computing on a corpus level. Unless there is major disagreement from the community, I'd like to keep it that way. |
+1 |
Li Su was kind enough to send me the mean-aggregated mirex scores - the comparison is here. Important plot here: For 5 out of the 7 algorithms, the results are all within 1/100th of a percent of the mirex scores. For the other two (red and green), the difference is between 1/10th and 1 percentage point. I expect that this is due to the optimal vs. greedy graph matching. Are these scores close enough to pass the Raffel quality control test? :) Or should I dig into the differences a bit more? |
Nice, looking good. If you're willing, usually I like to compute the absolute difference and normalize by the score mean, e.g. |
Sure thing. Pinging @bmcfee ? |
Thanks! Given the 3%+ difference there, it would be great to get a cursory confirmation that it's because of optimal matching, even if that just means checking that the algorithm output for those algorithms produces many more notes (which would be likely to cause a greedy vs. non-greedy difference), for example. |
Will do! |
r"""Utility function for loading in data from a delimited time series | ||
annotation file with a variable number of columns. | ||
Assumes that column 0 contains time stamps and columns 1 through n contain | ||
values. n may be variable from time stamp to time stamp. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A small (3-line) example input here might be helpful.
Just took another pass over it. Looks mostly good -- the things I think need "fixing" are the handling of existing file descriptors, and more specific exception handling. There are also a few dangling comments from earlier reviews that don't seem to have been addressed yet. |
@craffel Sorry for the delay. Updates:
The "raw" percentage differences between implementations are here: Here are the differences again if I shift the estimates for CB1 and CB2 by one sample to match the mirex behavior: |
Nice sleuthing! Ok, I am happy to merge this unless there are any pending TODOs we need to take care of that I forgot about. |
@craffel I think I covered everything. Just made a second pass through the comments and didn't see anything outstanding. |
Ok, merged, thanks again! |
Implements #185.
The following metrics are implemented for raw and chroma-wrapped multipitch estimates:
@craffel's to-do list:
mir_eval/multipitch.py
)mir_eval.io.load_ragged_time_series
)evaluators.multipitch_eval.py
)tests/test_multipitch.py
)tests/data/multipitch/
,test_multipitch.regression_test_evaluate
)