Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update rates-based statistics to be modular #4608

Open
wants to merge 37 commits into
base: master
Choose a base branch
from

Conversation

GarethCabournDavies
Copy link
Contributor

@GarethCabournDavies GarethCabournDavies commented Jan 24, 2024

Overview of changes

The exponential fit statistics are all similar, with simple factors added or subtracted, as discussed in #4594

I have refactored the ExpFitStatistic to be able to use the different features using a --statistic-features option. These then all use exp_fit as the --ranking-statistic option

The available features are:

Feature Description Notes
phasetd Apply a factor of how the phase, time and amplitude differences match up to what is expected for signals.
kde Apply a factor according to a kernel density estimate of the ratio of signal and noise distributions for each template.
dq Apply a factor according to any data quality channel information.
chirp_mass Apply a reweighting according to the chirp mass of the template.
sensitive_volume Apply a factor of the log of sensitive volume (compared to a median for the template). This means that the statistic takes into account any changes in the detector sensitivity and that, e.g., we expect to see more events in the HL network than HV coincidences.
normalize_fit_rate Normalize the rates fits by the analysis time. This is needed so that the statistic is comparable over chunks of different lengths. This was done for all exp_fit statistics, but is implemented here explicitly so that the exp_fit_csnr statistic can reuse the lognoiserate function from the ExpFitStatistic

I have also removed the different treatment of triggers with sngl_ranking below threshold; this is now required explicitly as --statistic-keywords alpha_below_threshold:6. Again this was so that the exp_fit_csnr statistic can reuse the lognoiserate function from the ExpFitStatistic

In addition, there are some minor changes as well to fix some of the statistics which didn't work at all on previous master. There is a (just for reference) PR at #4607 to show these.

Testing

I have tested all existing (and working) statistics to check that given the appropriate features, the output remains identical.

The testing is done against the codes in #4607, so that we can test statistics against what they should be, rather than what they are.

Initial tests with a very small fraction of the bank have shown that the SNR-like statistics output files have identical hashes. The exp_fit statistics outputs are all the same to within a numpy.isclose test, i.e. ~1e-6 difference for values O(1). But I will add the results of more stringent testing here.

Statistic New statistic Features Keywords pycbc_sngls_findtrigs max stat difference pycbc_coinc_findtrigs max stat difference
quadsum quadsum File hash the same File hash the same
single_ranking_only single_ranking_only -- -- File hash the same File hash the same
phasetd phasetd -- -- File hash the same File hash the same
exp_fit_csnr exp_fit_csnr -- -- File hash the same File hash the same
phasetd_exp_fit_fgbg_norm exp_fit phasetd sensitive_volume normalize_fit_rate alpha_below_thresh:6 9.5e-7 3.8e-6
phasetd_exp_fit_fgbg_bbh_norm exp_fit phasetd sensitive_volume normalize_fit_rate chirp_mass alpha_below_thresh:6 1.9e-6 3.8e-6
phasetd_exp_fit_fgbg_kde exp_fit phasetd sensitive_volume normalize_fit_rate kde alpha_below_thresh:6 1.9e-6 3.8e-6
dq_phasetd_exp_fit_fgbg_norm exp_fit phasetd sensitive_volume normalize_fit_rate dq alpha_below_thresh:6 9.5e-7 3.8e-6
dq_phasetd_exp_fit_fgbg_kde exp_fit phasetd sensitive_volume normalize_fit_rate dq kde alpha_below_thresh:6 1.9e-6 3.8e-6

for feature in opts.statistic_features:
if feature not in _allowed_statistic_features:
err_msg = f"--statistic-feature {feature} not recognised"
raise NotImplementedError(err_msg)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't actually happen due to the argparse choices, but safety is best

**extra_kwargs)
trigger_times = sds['end_time']
stat_t = rank_method.rank_stat_single((ifo, sds))
trigger_times = trigs['end_time'][:][trigger_keep_ids]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some singles objects don't have the end time included

@@ -33,9 +33,18 @@
from .eventmgr_cython import logsignalrateinternals_computepsignalbins
from .eventmgr_cython import logsignalrateinternals_compute2detrate

_allowed_statistic_features = [
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure where is best to describe each feature here to be honest

# Assume best case scenario and use maximum signal rate
s1 -= 2. * self.hist_max
s1[s1 < 0] = 0
return s1 ** 0.5


class ExpFitStatistic(QuadratureSumStatistic):
class ExpFitStatistic(PhaseTDStatistic):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Subclassing PhaseTDStatistic here in order to get the phasetd stuff in init

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this then not require the PhaseTD files, which I thought ExpFit statistic might not use if not using the "PhaseTD" feature?

@GarethCabournDavies
Copy link
Contributor Author

The numbers in the comparison table have been updated

I am sure it is not a coincidence that the error in the coincs is double that of the singles, but I think a ~4e-6 difference is not too important given the dynamic range of the statistic.

@GarethCabournDavies
Copy link
Contributor Author

I am adding a description of the statistics to the docs - I am writing it in the pycbc_make_offline_search_workflow documentation at the moment, but that can be moved if requested.

@GarethCabournDavies
Copy link
Contributor Author

Note that I found and fixed a bug in the way that sngl_ranking_ keywords are handled and passed through to the ranking module

@GarethCabournDavies
Copy link
Contributor Author

GarethCabournDavies commented May 21, 2024

I thought it best to check the memory usage, and for pycbc_coinc_findtrigs with 1/140 of the bank and the dq_phasetd_exp_fit_fgbg_norm (and modular equivalent) statistic, we see:
NEW:

	User time (seconds): 3586.64
	System time (seconds): 197.45
	Maximum resident set size (kbytes): 432688

OLD:

	System time (seconds): 184.80
	Maximum resident set size (kbytes): 414844

For the same statistic with pycbc_sngls_findtrigs and 1/1400 of the bank, we get:
NEW:

	User time (seconds): 74.30
	System time (seconds): 13.22
	Maximum resident set size (kbytes): 277556

OLD:

	User time (seconds): 58.19
	System time (seconds): 13.82
	Maximum resident set size (kbytes): 279236

The increase in user time seems to be because of a lower % of CPU (27 vs 75)

So basically this looks like it doesn't change anything with regard to performance, as I would expect

@maxtrevor
Copy link
Contributor

maxtrevor commented May 21, 2024

Noting here that it was suggested on today's call that we should wait to merge this until after creating the new PyCBC Live branch intended for the rest of O4

@GarethCabournDavies
Copy link
Contributor Author

I think this is ready for review now. All tests are passing

I haven't been able to test how this works with Live as well as I would like to; there seems to be a problem with my testing environment

@GarethCabournDavies
Copy link
Contributor Author

Poke @spxiwh post-winter break

Copy link
Contributor

@spxiwh spxiwh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments below, but nothing significant. It's hard to review the bulk of changes here because of the amount of code being moved around, but the sanity checks look convincing. @ahnitz do you any concerns/comments?

- If using the ``chirp_mass`` feature, this chirp mass defines a maximum weighting which can be applied to the statistic.
* - ``sngl_ranking_*``
- This is used to provide the keyword arguments to functions in `the events.ranking module <https://pycbc.org/pycbc/latest/html/_modules/pycbc/events/ranking.html>`_. For example, to use a different psdvar threshold in the newsnr_sgveto_psdvar_threshold function, we would use ``sngl_ranking_psd_var_val_threshold:10``.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouild this documentation like in the middle of explanation of make_offline_search_workflow? Would this be better as a separate documentation page on the stat module itself?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've put it into a separate page within a "search" section of the docs page - the rest of this would obviously need building up more to be particularly useful!

pycbc/events/ranking.py Outdated Show resolved Hide resolved
pycbc/events/ranking.py Outdated Show resolved Hide resolved
pycbc/events/ranking.py Outdated Show resolved Hide resolved
pycbc/events/ranking.py Outdated Show resolved Hide resolved
pycbc/events/stat.py Outdated Show resolved Hide resolved
pycbc/events/stat.py Outdated Show resolved Hide resolved
pycbc/events/stat.py Outdated Show resolved Hide resolved
pycbc/events/stat.py Outdated Show resolved Hide resolved
# Assume best case scenario and use maximum signal rate
s1 -= 2. * self.hist_max
s1[s1 < 0] = 0
return s1 ** 0.5


class ExpFitStatistic(QuadratureSumStatistic):
class ExpFitStatistic(PhaseTDStatistic):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this then not require the PhaseTD files, which I thought ExpFit statistic might not use if not using the "PhaseTD" feature?

@GarethCabournDavies
Copy link
Contributor Author

Does this then not require the PhaseTD files, which I thought ExpFit statistic might not use if not using the "PhaseTD" feature?

This will cause a problem if using the pregenerate_hist keyword supplied to PhaseTDStatistic, but from a quick git grep, it doesn't look like that is used anywhere, but it would be good to ensure that works

(for some reason I can't reply directly to the comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants