improve plot description

cms-btv-pog · Jan 21, 2025 · f1ac017 · f1ac017
1 parent 21c8694
commit f1ac017
Show file tree

Hide file tree

Showing 19 changed files with 130 additions and 79 deletions.
diff --git a/docs/_static/figs/example_jetpt.png b/docs/_static/figs/example_jetpt.png
diff --git a/docs/_static/figs/example_rebin2_jetpt.png b/docs/_static/figs/example_rebin2_jetpt.png
diff --git a/docs/_static/figs/example_rebin_jetpt.png b/docs/_static/figs/example_rebin_jetpt.png
diff --git a/docs/_static/figs/example_sample_jetpt.png b/docs/_static/figs/example_sample_jetpt.png
diff --git a/docs/_static/figs/example_samplesplit_jetpt.png b/docs/_static/figs/example_samplesplit_jetpt.png
diff --git a/docs/auto.md b/docs/auto.md
@@ -1,4 +1,4 @@
-## Automation
+# Automation
 
 
 At the moment the automation is limited with the computing resources using gitlab ci [autobtv](https://gitlab.cern.ch/cms-analysis/btv/software-and-algorithms/autobtv). 

diff --git a/docs/developer.md b/docs/developer.md
@@ -1,4 +1,4 @@
-## For developers: Add new workflow
+# For developers: Add new workflow
 
 
 The BTV tutorial for coffea part is under [`notebooks`](https://github.com/cms-btv-pog/BTVNanoCommissioning/tree/master/notebooks) and the template to construct new workflow is [`src/BTVNanoCommissioning/workflows/example.py`](https://github.com/cms-btv-pog/BTVNanoCommissioning/blob/master/src/BTVNanoCommissioning/workflows/example.py)
@@ -8,7 +8,7 @@ The BTV tutorial for coffea part is `notebooks/BTV_commissiong_tutorial-coffea.i
 
 Use the `example.py` as template to develope new workflow.
 
-### 0. Add new workflow info to `workflows/__init__.py` 
+## 0. Add new workflow info to `workflows/__init__.py` 
 
 
 ```python
@@ -25,7 +25,7 @@ workflows["ctag_ttsemilep_sf"] = partial(
 ```
 Notice that if you are working on a WP SFs, please put **WP** in the name.
 
-### 1. Add histogram collections to `utils/histogrammer.py`
+## 1. Add histogram collections to `utils/histogrammer.py`
 
 The histograms are use the [`hist`](https://hist.readthedocs.io/en/latest/) in this framework. This can be easily to convert to root histogram by `uproot` or numpy histograms.  For quick start of hist can be found [here](https://hist.readthedocs.io/en/latest/user-guide/quickstart.html)
 
@@ -46,7 +46,7 @@ _hist_dict["mujet_pt"] = Hist.Hist(
 The kinematic variables/workflow specific variables are defined first, then it takes the common collections of input variables from the common defintion. 
 In case you want to add common variables use for all the workflow, you can go to [`helper/definition.py`](#add-new-common-variables)
 
-###  2. Selections: Implemented selections on events (`workflow/`)
+##  2. Selections: Implemented selections on events (`workflow/`)
 
 Create `boolean` arrays along event axis. Also check whether some common selctions already in `utils/selection.py`
 
@@ -83,7 +83,7 @@ if self.selMod=="WcM":
   event_level = req_trig & req_lumi & req_jet & req_muon & req_ele & req_leadlep_pt& req_Wc
 ```
 
-###  3. Selected objects: Pruned objects with reduced event_level
+##  3. Selected objects: Pruned objects with reduced event_level
 Store the selected objects to event-based arrays. The selected object must contains **Sel**, for the muon-enriched jet and soft muon is **MuonJet** and **SoftMu**, the kinematics will store. The cross-object variables need to create entry specifically. 
 
 ```python
@@ -136,7 +136,7 @@ if self.isArray:
 </details>
 
 
-### 4. Setup CI pipeline `.github/workflow`
+## 4. Setup CI pipeline `.github/workflow`
 
 The actions are checking the changes would break the framework. The actions are collected in `.github/workflow`
 You can simply include a workflow by adding the entries with name
@@ -187,7 +187,7 @@ Yout can find the secret configuration in the direcotry : `Settings>>Secrets>>Ac
 </details>
 
 
-### 5. Refine used MC as input `sample.py`
+## 5. Refine used MC as input `sample.py`
 The `sample.py` collects the samples (dataset name) used in the workflow. This collections are use to create the dataset json file.
 - `data` : data sample (MuonEG, Muon0....)
 - `MC`: main MC used for the workflow
@@ -223,8 +223,8 @@ Here's the example for BTA_ttbar
     },
 ```
 
-### Optional changes
-#### Add workflow to `scripts/suball.py` 
+## Optional changes
+### Add workflow to `scripts/suball.py` 
 The `suball.py` summarize the steps to obtain the result.
 In case your task requires to run several workflows, you can wrapped them as `dict` of the workflows
 ```python
@@ -244,7 +244,7 @@ scheme = {
         ],
     }
 ```
-#### Add new common variables in `helper/definition.py`
+### Add new common variables in `helper/definition.py`
 
 In the `definition.py` we collect the axis definition, name and label of tagger scores/input variables 
 ```python
@@ -263,7 +263,7 @@ definitions_dict = {
     ...
 }
 ```
-#### Additional corrections and uncertainty variations not in the framework
+### Additional corrections and uncertainty variations not in the framework
 The corrections are collected in `utils/correction.py`.  There are two types of the variation: weight varations, i.e. SFs, ueps weight, or object energy scale/resolution variations: JES/JER. Here's an example to add new corrections 
 
 1. Add new info `utils/AK4_parameter.py` 

diff --git a/docs/index.md b/docs/index.md
@@ -40,9 +40,10 @@ Currently the available workflows are summarized
 installation.md
 user.md
 developer.md
+structure.md
+scaleout.md
 wf.md
 scripts.md
-scaleout.md
 auto.md
 api.rst
 ```
diff --git a/docs/run.md b/docs/run.md
diff --git a/docs/scaleout.md b/docs/scaleout.md
@@ -27,14 +27,14 @@ WJets_inc (Nano_v11)|1183MB	|630MB	|1180MB|
 
 
 
-#### Condor@FNAL (CMSLPC)
+#### dask: Condor@FNAL (CMSLPC)
 Follow setup instructions at https://github.com/CoffeaTeam/lpcjobqueue. After starting 
 the singularity container run with 
 ```bash
 python runner.py --wf ttcom --executor dask/lpc
 ```
 
-#### Condor@CERN (lxplus)
+#### daskLCondor@CERN (lxplus)
 Only one port is available per node, so its possible one has to try different nodes until hitting
 one with `8786` being open. Other than that, no additional configurations should be necessary.
 
@@ -62,7 +62,7 @@ python runner.py --wf ttcom --executor dask/casa
 Authentication is handled automatically via login auth token instead of a proxy. File paths need to replace xrootd redirector with "xcache", `runner.py` does this automatically.
 
 
-#### Condor@DESY 
+#### parsl/dask with Condor 
 ```bash
 python runner.py --wf ttcom --executor dask/condor(parsl/condor)
 ```
@@ -95,12 +95,21 @@ After executing the command, a new folder will be created, preparing the submiss
 
 ::: {admonition} Frequent issues for standalone condor jobs submission
 
-
-
 1. CMS Connect provides a condor interface where one can submit jobs to all resources available in the CMS Global Pool. See [WorkBookCMSConnect Twiki](https://twiki.cern.ch/twiki/bin/view/CMSPublic/WorkBookCMSConnect#Requesting_different_Operative_S) for the instructions if you use it for the first time.
 2. The submitted jobs are of the kind which requires a proper setup of the X509 proxy, to use the XRootD service to access and store data. In the generated `.jdl` file, you may see a line configured for this purpose `use_x509userproxy = true`. If you have not submitted jobs of this kind on lxplus condor, we recommend you to add a line
    ```bash
    export X509_USER_PROXY=$HOME/x509up_u`id -u`
    ```
    to `.bashrc` and run it so the proxy file will be stored in your AFS folder instead of in your `/tmp/USERNAME` folder. For submission on cmsconnect, no specific action is required.
 :::
+
+
+### FAQ for submission
+
+- All jobs held: might indicate environment setup issue→ check out the condor err/out for parsl jobs the info are in `runinfo/JOBID/submit_scripts/`
+- Exit without complain: might be huge memory consumption: 
+   - Reduce `--chunk`, especially JERC variation are memory intense
+   - check the memory usage by calling `memory_usage_psutil`
+- partially failed/held: 	
+   - could be temporarily unavailable of the files/site. If the retries not work, considering obtained failure file list and resubmit.
+   - error of certain files→ check the failed files and run it locally with `--executor iterative`
diff --git a/docs/scripts.md b/docs/scripts.md
@@ -1,9 +1,9 @@
-## Scripts for prepre input & process output
+# Scripts for prepre input & process output
 
 Here lists scripts can be used for BTV tasks
 
 
-### `fetch.py` : create input json 
+## `fetch.py` : create input json 
 
 
 Use `fetch.py` in folder `scripts/` to obtain your samples json files. You can create `$input_list` ,which can be a list of datasets taken from CMS DAS or names of dataset(need to specify campaigns explicity), and create the json contains `dataset_name:[filelist]`. One can specify the local path in that input list for samples not published in CMS DAS.
@@ -13,25 +13,22 @@ The `--whitelist_sites, --blacklist_sites` are considered for fetch dataset if m
 
 
 
-
-
-
-### `dump_prescale.py`: Get Prescale weights
+## `dump_prescale.py`: Get Prescale weights
 
 :::{caution}
 Only works if `/cvmfs` is binding in the system
 :::
 
 Generate prescale weights using `brilcalc`
 
-```python
+```bash
 python scripts/dump_prescale.py --HLT $HLT --lumi $LUMIMASK
 # HLT : put prescaled triggers
 # lumi: golden lumi json
 ```
 
 
-### Get processed information
+## Get processed information
 
 Get the run & luminosity information for the processed events from the coffea output files. When you use `--skipbadfiles`, the submission will ignore files not accesible(or time out) by `xrootd`. This script helps you to dump the processed luminosity into a json file which can be calculated by `brilcalc` tool and provide a list of failed lumi sections by comparing the original json input to the one from the `.coffea` files.
 
@@ -41,7 +38,7 @@ Get the run & luminosity information for the processed events from the coffea ou
 python scripts/dump_processed.py -c $COFFEA_FILES -n $OUTPUT_NAME (-j $ORIGINAL_JSON -t [all,lumi,failed])
 ```
 
-### `make_template.py`: Store histograms from coffea file
+## `make_template.py`: Store histograms from coffea file
 
 Use `scripts/make_template.py` to dump 1D/2D histogram from `.coffea` to `TH1D/TH2D` with hist. MC histograms can be reweighted to according to luminosity value given via `--lumi`. You can also merge several files 
 
@@ -65,25 +62,62 @@ python scripts/make_template.py -i "testfile/*.coffea" --lumi 7650 -o test.root
 
 
 
-### Plotting code
-#### data/MC comparisons
-:exclamation_mark: If using wildcard for input, do not forget the quoatation marks! (see 2nd example below)
-
-You can specify `-v all` to plot all the variables in the `coffea` file, or use wildcard options (e.g. `-v "*DeepJet*"` for the input variables containing `DeepJet`)
+## Plotting code
+### data/MC comparisons
 
-:new: non-uniform rebinning is possible, specify the bins with  list of edges `--autorebin 50,80,81,82,83,100.5`
+Obtain the data MC comparisons from the input coffea files by normalized MC to corresponding luminosity.
+You can specify `-v all` to plot all the variables in the `coffea` file, or use wildcard options (e.g. `-v "*DeepJet*"` for the input variables containing `DeepJet`). Individual variables can be also specify by splitting with `,`.
 
 ```bash
+python scripts/plotdataMC.py -i $COFFEA --lumi $LUMI_IN_invPB -p $WORKFLOW -v $VARIABLE --autorebin $REBIN_OPTION --split $SPLIT_OPTION 
 python scripts/plotdataMC.py -i a.coffea,b.coffea --lumi 41500 -p ttdilep_sf -v z_mass,z_pt  
 python scripts/plotdataMC.py -i "test*.coffea" --lumi 41500 -p ttdilep_sf -v z_mass,z_pt # with wildcard option need ""
 ```
 
+There are a few options supply for the splitting scheme based on jet flavor or sample. 
+
+<div style="display: flex; justify-content: space-around; align-items: center;">
+  <figure style="text-align: center;">
+    <img src="_static/figs/example_rebin_jetpt.png" alt="Picture 1" width="300" height="auto" style="display: block; margin: 0 auto" />
+    <figcaption>Default: split by jet flavor</figcaption>
+  </figure>
+
+  <figure style="text-align: center;">
+    <img src="_static/figs/example_sample_jetpt.png" alt="Picture 2" width="300" height="auto" style="display: block; margin: 0 auto" />
+    <figcaption>--split sample: split by MC samples</figcaption>
+  </figure>
+
+  <figure style="text-align: center;">
+    <img src="_static/figs/example_samplesplit_jetpt.png" alt="Picture 3" width="300" height="auto" style="display: block; margin: 0 auto" />
+    <figcaption>--split sample: split by MC samples</figcaption>
+  </figure>
+
+</div>
+
+It also supports rebinning. Integer input refers the the rebinning through merging bins `--rebin 2`.  It also supports non-uniform rebinning, specify the bins with a list of edges `--autorebin 30,36,42,48,54,60,66,72,78,84,90,96,102,114,126,144,162,180,210,240,300`
+
+<div style="display: flex; justify-content: space-around; align-items: center;">
+  <figure style="text-align: center;">
+    <img src="_static/figs/example_rebin_jetpt.png" alt="Picture 1" width="300" height="auto" style="display: block; margin: 0 auto" />
+    <figcaption>Default</figcaption>
+  </figure>
+  <figure style="text-align: center;">
+    <img src="_static/figs/example_rebin2_jetpt.png" alt="Picture 1" width="300" height="auto" style="display: block; margin: 0 auto" />
+    <figcaption>merge neighboring bins</figcaption>
+  </figure>
+  <figure style="text-align: center;">
+    <img src="_static/figs/example_rebin_jetpt.png" alt="Picture 2" width="300" height="auto" style="display: block; margin: 0 auto" />
+    <figcaption>non-uniform rebin</figcaption>
+  </figure>
+</div>
+
 
 
-```
+
+
+```python
 
 options:
-  -h, --help            show this help message and exit
   --lumi LUMI           luminosity in /pb
   --com COM             sqrt(s) in TeV
   -p {ttdilep_sf,ttsemilep_sf,ctag_Wc_sf,ctag_DY_sf,ctag_ttsemilep_sf,ctag_ttdilep_sf}, --phase {dilep_sf,ttsemilep_sf,ctag_Wc_sf,ctag_DY_sf,ctag_ttsemilep_sf,ctag_ttdilep_sf}
@@ -112,10 +146,10 @@ options:
 
 
 
-#### data/data, MC/MC comparisons
+### data/data, MC/MC comparisons
 
 You can specify `-v all` to plot all the variables in the `coffea` file, or use wildcard options (e.g. `-v "*DeepJet*"` for the input variables containing `DeepJet`)
-:exclamation_mark: If using wildcard for input, do not forget the quoatation marks! (see 2nd example below)
+
 
 ```bash
 # with merge map, compare ttbar with data
@@ -158,19 +192,15 @@ options:
 
 
 
-#### ROCs & efficiency plots
+### ROCs & efficiency plots
 
 Extract the ROCs for different tagger and efficiencies from validation workflow
 
-```python
+```bash
 python scripts/validation_plot.py -i  $INPUT_COFFEA -v $VERSION
 ```
 
 
-
-
-
-
 ```json
 {
     "WJets": ["WJetsToLNu_TuneCP5_13p6TeV-madgraphMLM-pythia8"],
@@ -181,15 +211,15 @@ python scripts/validation_plot.py -i  $INPUT_COFFEA -v $VERSION
 }
 ```
 
-#### `correlation_plots.py` : get linear correlation from arrays
+### `correlation_plots.py` : get linear correlation from arrays
 
 You can perform a study of linear correlations of b-tagging input variables. Additionally, soft muon variables may be added into the study by requesting `--SMu` argument. If you wan to limit the outputs only to DeepFlavB, PNetB and RobustParTAK4B, you can use the `--limit_outputs` option. If you want to use only the set of variables used for tagger training, not just all the input variables, then use the option `--limit_inputs`. To limit number of files read, make use of option `--max_files`. In case your study requires splitting samples by flavour, use `--flavour_split`. `--split_region_b` performs a sample splitting based on the DeepFlavB >/< 0.5. 
 
 :::{caution}
 For Data/MC comparison purpose pay attention - change ranking factors (xs/sumw) in L420! 
 :::
 
-```python
+```bash
 python correlation_plots.py $input_folder [--max_files $nmax_files --SMu --limit_inputs --limit_outputs --specify_MC --flavour_split --split_region_b]
 ```
 
@@ -198,6 +228,6 @@ python correlation_plots.py $input_folder [--max_files $nmax_files --SMu --limit
 
 To further investigate the correlations, one can create the 2D plots of the variables used in this study. Inputs and optional arguments are the same as for the correlation plots study.
 
-```python
+```bash
 python 2Dhistogramms.py $input_folder [--max_files $nmax_files --SMu --limit_inputs --limit_outputs --specify_MC --flavour_split --split_region_b]
 ```