Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could bss_eval return min and max as well? #192

Closed
carlthome opened this issue May 16, 2016 · 14 comments
Closed

Could bss_eval return min and max as well? #192

carlthome opened this issue May 16, 2016 · 14 comments

Comments

@carlthome
Copy link
Contributor

First of all, love this package! 💃

Curious if bss_eval could return the minimum and maximum SDR, SIR, SAR as well as the average? I realize its based on Emmanuel Vincent's method but variance seems important in many cases. Lets say average SIR over a song is 10 dB. Roughly speaking, it could be the case that the chorus with lots of harmonic and percussive elements is easy to separate while the verse with, lets say less instrumentation, is harder to separate. Obviously one could pre-segment the signal but I figured asking doesn't hurt.

@craffel
Copy link
Collaborator

craffel commented May 17, 2016

Hey, thanks for the kind words. Usually we focus on implementing evaluation metrics as widely used/specifically defined in research, so extensions like this are a little out of the scope of mir_eval unless they are starting to be utilized in publications or MIREX (or DCASE as the case may be). Do you know of people reporting min/max too? Otherwise, another goal of mir_eval is to make it very easy to modify the metrics so that they can be customized :)

@carlthome
Copy link
Contributor Author

carlthome commented May 18, 2016

Well, most recent papers seem to introduce their own additional evaluation aside from SDR, SIR, SAR, and many rely on listening tests still, but there's also this more recent framework (2011), PEASS, with objective and subjective metrics that promises to be more inline with perceptual quality, by some of the same people that proposed bss_eval originally back in 2006, including Vincent.

PEASS has three parts, one with objective metrics, one with subjective metrics derived from listening tests, and one part for making new listening tests. The first part could be a welcome addition to mir_eval maybe?

Similarly to BSS Eval, the distortion signal is decomposed into three components: target distortion, interference, artifacts. These components are then used to compute four quality scores, namely OPS (Overall Perceptual Score), TPS (Target-related Perceptual Score), IPS (Interference-related Perceptual Score), APS (Artifact-related Perceptual Score). These scores better correlate with human assessments than the SDR/ISR/SIR/SAR measures of BSS Eval.

I'd be pretty content with a minimum SDR, SIR and SAR though, since that at least gives an idea of a lower bound on how poorly a separation method could be in a given time frame. I think many papers don't provide variance because they want to hide the fact that their model is not robust and has adversarial examples. For example when sources have similar energy distributions and the model discards phase information. Like, lets say a singer is particularly good at being in tune with the instrument and also just happens to display similar formants as the instruments' resonances. It seems to be surprisingly problematic, in my tiny experience.

Also part of the issue stems from that bss_eval is often used for speech enhancement as well as music source separation, in which it's probably pretty unlikely for the speech to have a similar energy distribution as the background noise in practice. Thus there's not a strong tradition of reporting variance across the test samples because it's typically insignificantly small in many cases (I'm guessing). Since mir_eval focuses on music though, I think the priorities should be different!

@craffel
Copy link
Collaborator

craffel commented May 18, 2016

Well, most recent papers seem to introduce their own additional evaluation aside from SDR, SIR, SAR, and many rely on listening tests still, but there's also this more recent framework (2011), PEASS, with objective and subjective metrics that promises to be more inline with perceptual quality, by some of the same people that proposed bss_eval originally back in 2006, including Vincent.

Yes, in fact we have a TODO issue about adding PEASS (and other metrics) #68! A PR would be welcome.

I'd be pretty content with a minimum SDR, SIR and SAR though, since that at least gives an idea of a lower bound on how poorly a separation method could be in a given time frame.

For now, unless we have some examples of papers/contests which are using minimum (or maximum, or variance), I'd prefer to do the community standard. It should be pretty straightforward to create a custom version of the function for your own purposes, which is one of the intentions of mir_eval.

@carlthome
Copy link
Contributor Author

carlthome commented May 18, 2016

Yes, in fact we have a TODO issue about adding PEASS (and other metrics) #68! A PR would be welcome.

Cool! 👍

For now, unless we have some examples of papers/contests which are using minimum (or maximum, or variance), I'd prefer to do the community standard. It should be pretty straightforward to create a custom version of the function for your own purposes, which is one of the intentions of mir_eval.

Sensible! Let me try to nag research teams, MIREX and SiSEC about it and I'll get back to this hopefully. It seems like bad practice to not estimate worst-case over a common test set, and it makes the different methods hard to compare for real-world usage.

@carlthome
Copy link
Contributor Author

carlthome commented May 18, 2016

I'm curious though, the paper sparking BSS_EVAL mentions "local performance measures" for calculating SDR, SIR, SAR when the performance is expected to vary noticeably over time. Essentially it's a windowing function and sliding it across the signal, just like for STFT vs. FFT. This doesn't seem to be provided in the actual implementation though. Correct?

@craffel
Copy link
Collaborator

craffel commented May 19, 2016

This doesn't seem to be provided in the actual implementation though. Correct?

Yes, I don't know of this being implemented/used, although it's useful. @dawenl do you know?

@faroit
Copy link

faroit commented May 19, 2016

Maybe it's time to jump in here, I will be helping out @aliutkus organising the upcoming SISEC evaluation and we will be releasing a modified version of the matlab bss_eval code as well together with a python wrapper. Both will be on github in couple of days.

@carlthome As far as I understand your comment: yes, the upcoming bss_eval matlab version does also allow you to output the instantaneous SDR values for a given window size. You can have a look in a couple of days.

Also I am working on a pure python based version of bss_eval_images as well as a cython version, should also be finished soon and could be merged into mireval eventually later.

@craffel
Copy link
Collaborator

craffel commented May 19, 2016

Also I am working on a pure python based version of bss_eval_images as well as a cython version, should also be finished soon and could be merged into mireval eventually later.

That would be great, thanks!

@carlthome
Copy link
Contributor Author

carlthome commented May 19, 2016

@faroit Cool. 😄 Looking forward to the Python wrapper.

Do you have any thoughts on why SDR, SIR, SAR variance of the local measures is seldom listed in papers (such as Huang, Erdogan, Weninger, the Wangs, etc.), by the way? Seems most go for a point-estimate of the average "global" SDR, SIR, SAR over all estimated sources in a test set.

@dawenl
Copy link
Collaborator

dawenl commented May 19, 2016

It's been a while since the last time I touched upon anything related to source separation, but it seems that everything is solved (or hopefully will be resolved soon)?

@aliutkus
Copy link

thanks so much @faroit for handling all this python evaluation voodoo magic ! the upcoming dsd100 package he's been preparing for a while makes it very practical to test separation stuff in python indeed !

concerning everything being solved, we're not quite exactly there yet, but yeah, the community has been very active and it's a pleasure to see things working quite well now =D

@faroit
Copy link

faroit commented Jun 2, 2016

Yes, I don't know of this being implemented/used, although it's useful. @dawenl do you know?

@craffel @carlthome see here. This is what will be used for the upcoming SISEC.

@craffel
Copy link
Collaborator

craffel commented Aug 19, 2016

Has this been covered by the recent enhancements to separation eval (e.g. by using framewise eval and computing max/min by hand)? If so, can you close?

@carlthome
Copy link
Contributor Author

Yes. Great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants