-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Could bss_eval return min and max as well? #192
Comments
Hey, thanks for the kind words. Usually we focus on implementing evaluation metrics as widely used/specifically defined in research, so extensions like this are a little out of the scope of |
Well, most recent papers seem to introduce their own additional evaluation aside from SDR, SIR, SAR, and many rely on listening tests still, but there's also this more recent framework (2011), PEASS, with objective and subjective metrics that promises to be more inline with perceptual quality, by some of the same people that proposed bss_eval originally back in 2006, including Vincent. PEASS has three parts, one with objective metrics, one with subjective metrics derived from listening tests, and one part for making new listening tests. The first part could be a welcome addition to mir_eval maybe?
I'd be pretty content with a minimum SDR, SIR and SAR though, since that at least gives an idea of a lower bound on how poorly a separation method could be in a given time frame. I think many papers don't provide variance because they want to hide the fact that their model is not robust and has adversarial examples. For example when sources have similar energy distributions and the model discards phase information. Like, lets say a singer is particularly good at being in tune with the instrument and also just happens to display similar formants as the instruments' resonances. It seems to be surprisingly problematic, in my tiny experience. Also part of the issue stems from that bss_eval is often used for speech enhancement as well as music source separation, in which it's probably pretty unlikely for the speech to have a similar energy distribution as the background noise in practice. Thus there's not a strong tradition of reporting variance across the test samples because it's typically insignificantly small in many cases (I'm guessing). Since mir_eval focuses on music though, I think the priorities should be different! |
Yes, in fact we have a TODO issue about adding PEASS (and other metrics) #68! A PR would be welcome.
For now, unless we have some examples of papers/contests which are using minimum (or maximum, or variance), I'd prefer to do the community standard. It should be pretty straightforward to create a custom version of the function for your own purposes, which is one of the intentions of |
Cool! 👍
Sensible! Let me try to nag research teams, MIREX and SiSEC about it and I'll get back to this hopefully. It seems like bad practice to not estimate worst-case over a common test set, and it makes the different methods hard to compare for real-world usage. |
I'm curious though, the paper sparking BSS_EVAL mentions "local performance measures" for calculating SDR, SIR, SAR when the performance is expected to vary noticeably over time. Essentially it's a windowing function and sliding it across the signal, just like for STFT vs. FFT. This doesn't seem to be provided in the actual implementation though. Correct? |
Yes, I don't know of this being implemented/used, although it's useful. @dawenl do you know? |
Maybe it's time to jump in here, I will be helping out @aliutkus organising the upcoming SISEC evaluation and we will be releasing a modified version of the matlab bss_eval code as well together with a python wrapper. Both will be on github in couple of days. @carlthome As far as I understand your comment: yes, the upcoming bss_eval matlab version does also allow you to output the instantaneous SDR values for a given window size. You can have a look in a couple of days. Also I am working on a pure python based version of |
That would be great, thanks! |
@faroit Cool. 😄 Looking forward to the Python wrapper. Do you have any thoughts on why SDR, SIR, SAR variance of the local measures is seldom listed in papers (such as Huang, Erdogan, Weninger, the Wangs, etc.), by the way? Seems most go for a point-estimate of the average "global" SDR, SIR, SAR over all estimated sources in a test set. |
It's been a while since the last time I touched upon anything related to source separation, but it seems that everything is solved (or hopefully will be resolved soon)? |
thanks so much @faroit for handling all this python evaluation voodoo magic ! the upcoming dsd100 package he's been preparing for a while makes it very practical to test separation stuff in python indeed ! concerning everything being solved, we're not quite exactly there yet, but yeah, the community has been very active and it's a pleasure to see things working quite well now =D |
@craffel @carlthome see here. This is what will be used for the upcoming SISEC. |
Has this been covered by the recent enhancements to separation eval (e.g. by using framewise eval and computing max/min by hand)? If so, can you close? |
Yes. Great. |
First of all, love this package! 💃
Curious if
bss_eval
could return the minimum and maximum SDR, SIR, SAR as well as the average? I realize its based on Emmanuel Vincent's method but variance seems important in many cases. Lets say average SIR over a song is 10 dB. Roughly speaking, it could be the case that the chorus with lots of harmonic and percussive elements is easy to separate while the verse with, lets say less instrumentation, is harder to separate. Obviously one could pre-segment the signal but I figured asking doesn't hurt.The text was updated successfully, but these errors were encountered: