diff --git a/part3/commonstatsmethods/index.html b/part3/commonstatsmethods/index.html index 26193316cf6..72d7d76166c 100644 --- a/part3/commonstatsmethods/index.html +++ b/part3/commonstatsmethods/index.html @@ -2479,7 +2479,7 @@

Goodness of fit tests

AD: Compute a goodness-of-fit measure for binned fits using the Anderson-Darling test. It is based on the integral of the difference between the cumulative distribution function and the empirical distribution function over all bins. It also gives the tail ends of the distribution a higher weighting.

-

The output tree will contain a branch called limit, which contains the value of the test statistic in each toy. You can make a histogram of this test statistic \(t\). From the distribution that is obtained in this way (\(f(t)\)) and the single value obtained by running on the observed data (\(t_{0}\)) you can calculate the p-value $$p = \int_{t=t_{0}}^{\mathrm{+inf}} f(t) dt $$. Note: in rare cases the test statistic value for the toys can be undefined (for AS and KD). In this case we set the test statistic value to -1. When plotting the test statistic distribution, those toys should be excluded. This is automatically taken care of if you use the GoF collection script in CombineHarvester, which is described below.

+

The output tree will contain a branch called limit, which contains the value of the test statistic in each toy. You can make a histogram of this test statistic \(t\). From the distribution that is obtained in this way (\(f(t)\)) and the single value obtained by running on the observed data (\(t_{0}\)) you can calculate the p-value \(p = \int_{t=t_{0}}^{\mathrm{+inf}} f(t) dt\). Note: in rare cases the test statistic value for the toys can be undefined (for AS and KD). In this case we set the test statistic value to -1. When plotting the test statistic distribution, those toys should be excluded. This is automatically taken care of if you use the GoF collection script in CombineHarvester, which is described below.

When generating toys, the default behavior will be used. See the section on toy generation for options that control how nuisance parameters are generated and fitted in these tests. It is recommended to use frequentist toys (--toysFreq) when running the saturated model, and the default toys for the other two tests.

Further goodness-of-fit methods could be added on request, especially if volunteers are available to code them. The output limit tree will contain the value of the test statistic in each toy (or the data)

diff --git a/search/search_index.json b/search/search_index.json index bf80b179289..031449adbaf 100644 --- a/search/search_index.json +++ b/search/search_index.json @@ -1 +1 @@ -{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Introduction","text":"

These pages document the RooStats / RooFit - based software tool used for statistical analysis within the CMS experiment - Combine. Note that while this tool was originally developed in the Higgs Physics Analysis Group (PAG), its usage is now widespread within CMS.

Combine provides a command-line interface to many different statistical techniques, available inside RooFit/RooStats, that are used widely inside CMS.

The package exists on GitHub under https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit

For more information about Git, GitHub and its usage in CMS, see http://cms-sw.github.io/cmssw/faq.html

The code can be checked out from GitHub and compiled on top of a CMSSW release that includes a recent RooFit/RooStats, or via standalone compilation without CMSSW dependencies. See the instructions for installation of Combine below.

"},{"location":"#installation-instructions","title":"Installation instructions","text":"

Installation instructions and recommended versions can be found below. Since v9.0.0, the versioning follows the semantic versioning 2.0.0 standard. Earlier versions are not guaranteed to follow the standard.

"},{"location":"#within-cmssw-recommended-for-cms-users","title":"Within CMSSW (recommended for CMS users)","text":"

The instructions below are for installation within a CMSSW environment. For end users that do not need to commit or do any development, the following recipes should be sufficient. To choose a release version, you can find the latest releases on github under https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/releases

"},{"location":"#combine-v9-recommended-version","title":"Combine v9 - recommended version","text":"

The nominal installation method is inside CMSSW. The current release targets the CMSSW 11_3_X series because this release has both python2 and python3 ROOT bindings, allowing a more gradual migration of user code to python3. Combine is fully python3-compatible and, with some adaptations, can also work in 12_X releases.

CMSSW 11_3_X runs on slc7, which can be setup using apptainer (see detailed instructions):

cmssw-el7\ncmsrel CMSSW_11_3_4\ncd CMSSW_11_3_4/src\ncmsenv\ngit clone https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit.git HiggsAnalysis/CombinedLimit\ncd HiggsAnalysis/CombinedLimit\n

Update to a recommended tag - currently the recommended tag is v9.2.0: see release notes

cd $CMSSW_BASE/src/HiggsAnalysis/CombinedLimit\ngit fetch origin\ngit checkout v9.2.0\nscramv1 b clean; scramv1 b # always make a clean build\n
"},{"location":"#combine-v8-cmssw_10_2_x-release-series","title":"Combine v8: CMSSW_10_2_X release series","text":"

Setting up the environment (once):

cmssw-el7\ncmsrel CMSSW_10_2_13\ncd CMSSW_10_2_13/src\ncmsenv\ngit clone https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit.git HiggsAnalysis/CombinedLimit\ncd HiggsAnalysis/CombinedLimit\n

Update to a recommended tag - currently the recommended tag is v8.2.0: see release notes

cd $CMSSW_BASE/src/HiggsAnalysis/CombinedLimit\ngit fetch origin\ngit checkout v8.2.0\nscramv1 b clean; scramv1 b # always make a clean build\n
"},{"location":"#slc6cc7-release-cmssw_8_1_x","title":"SLC6/CC7 release CMSSW_8_1_X","text":"

Setting up OS using apptainer (see detailed instructions):

# For CC7:\ncmssw-el7\n# For SLC6:\ncmssw-el6\n\ncmsrel CMSSW_8_1_0\ncd CMSSW_8_1_0/src\ncmsenv\ngit clone https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit.git HiggsAnalysis/CombinedLimit\ncd HiggsAnalysis/CombinedLimit\n

Update to a recommended tag - currently the recommended tag for CMSSW_8_1_X is v7.0.13:

cd $CMSSW_BASE/src/HiggsAnalysis/CombinedLimit\ngit fetch origin\ngit checkout v7.0.13\nscramv1 b clean; scramv1 b # always make a clean build\n
"},{"location":"#oustide-of-cmssw-recommended-for-non-cms-users","title":"Oustide of CMSSW (recommended for non-CMS users)","text":"

Pre-compiled versions of the tool are available as container images from the CMS cloud. These containers can be downloaded and run using Docker. If you have docker running you can pull and run the latest image using,

docker run --name combine -it gitlab-registry.cern.ch/cms-cloud/combine-standalone:latest\n

You will now have the compiled Combine binary available as well as the complete package of tool. The container can be re-started using docker start -i combine.

"},{"location":"#standalone-compilation","title":"Standalone compilation","text":"

The standalone version can be easily compiled using cvmfs as it relies on dependencies that are already installed at /cvmfs/cms.cern.ch/. Access to /cvmfs/cms.cern.ch/ can be obtained from lxplus machines or via CernVM. See CernVM for further details on the latter. In case you do not want to use the cvmfs area, you will need to adapt the locations of the dependencies listed in both the Makefile and env_standalone.sh files.

git clone https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit.git HiggsAnalysis/CombinedLimit\ncd HiggsAnalysis/CombinedLimit/ \n# git checkout <some release>\n. env_standalone.sh\nmake -j 4\n

You will need to source env_standalone.sh each time you want to use the package, or add it to your login environment.

"},{"location":"#standalone-compilation-with-lcg","title":"Standalone compilation with LCG","text":"

For compilation outside of CMSSW, for example to use ROOT versions not yet available in CMSSW, one can compile against LCG releases. The current default is to compile with LCG_102, which contains ROOT 6.26:

git clone https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit.git HiggsAnalysis/CombinedLimit\ncd HiggsAnalysis/CombinedLimit\nsource env_lcg.sh \nmake LCG=1 -j 8\n

To change the LCG version, edit env_lcg.sh.

The resulting binaries can be moved for use in a batch job if the following files are included in the job tarball:

tar -zcf Combine_LCG_env.tar.gz build interface src/classes.h --exclude=obj\n
"},{"location":"#standalone-compilation-with-conda","title":"Standalone compilation with conda","text":"

This recipe will work both for linux and MacOS

git clone https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit.git HiggsAnalysis/CombinedLimit\ncd HiggsAnalysis/CombinedLimit\n\nconda install --name base mamba # faster conda\nmamba env create -f conda_env.yml\n\nconda activate combine\nsource set_conda_env_vars.sh\n# Need to reactivate\nconda deactivate\nconda activate combine\n\nmake CONDA=1 -j 8\n

Using Combine from then on should only require sourcing the conda environment

conda activate combine\n

Note: on OS X, Combine can only accept workspaces, so run text2workspace.py first. This is due to an issue with child processes and LD_LIBRARY_PATH (see note in Makefile)

"},{"location":"#standalone-compilation-with-cernvm","title":"Standalone compilation with CernVM","text":"

Combine, either standalone or not, can be compiled via CVMFS using access to /cvmfs/cms.cern.ch/ obtained using a virtual machine - CernVM. To use CernVM You should have access to CERN IT resources. If you are a CERN user you can use your account, otherwise you can request a lightweight account. If you have a CERN user account, we strongly suggest you simply run one of the other standalone installations, which are simpler and faster than using a VM.

You should have a working VM on your local machine, compatible with CernVM, such as VirtualBox. All the required software can be downloaded here. At least 2GB of disk space should be reserved on the virtual machine for Combine to work properly and the machine must be contextualized to add the CMS group to CVMFS. A minimal working setup is described below.

  1. Download the CernVM-launcher for your operating system, following the instructions available [here] for your operating system (https://cernvm.readthedocs.io/en/stable/cpt-launch.html#installation

  2. Prepare a CMS context. You can use the CMS open data one already available on gitHub: wget https://raw.githubusercontent.com/cernvm/public-contexts/master/cms-opendata-2011.context)

  3. Launch the virtual machine cernvm-launch create --name combine --cpus 2 cms-opendata-2011.context

  4. In the VM, proceed with an installation of combine

Installation through CernVM is maintained on a best-effort basis and these instructions may not be up to date.

"},{"location":"#what-has-changed-between-tags","title":"What has changed between tags?","text":"

You can generate a diff of any two tags (eg for v9.1.0 and v9.0.0) by using the following url:

https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/compare/v9.0.0...v9.1.0

Replace the tag names in the url to any tags you would like to compare.

"},{"location":"#for-developers","title":"For developers","text":"

We use the Fork and Pull model for development: each user creates a copy of the repository on GitHub, commits their requests there, and then sends pull requests for the administrators to merge.

Prerequisites

  1. Register on GitHub, as needed anyway for CMSSW development: http://cms-sw.github.io/cmssw/faq.html

  2. Register your SSH key on GitHub: https://help.github.com/articles/generating-ssh-keys

  3. Fork the repository to create your copy of it: https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/fork (more documentation at https://help.github.com/articles/fork-a-repo )

You will now be able to browse your fork of the repository from https://github.com/your-github-user-name/HiggsAnalysis-CombinedLimit

We strongly encourage you to contribute any developments you make back to the main repository. See contributing.md for details about contributing.

"},{"location":"#combineharvestercombinetools","title":"CombineHarvester/CombineTools","text":"

CombineTools is an additional tool for submitting Combine jobs to batch systems or crab, which was originally developed in the context of Higgs to tau tau analyses. Since the repository contains a certain amount of analysis-specific code, the following scripts can be used to clone it with a sparse checkout for just the core CombineHarvester/CombineTools subpackage, speeding up the checkout and compile times:

git clone via ssh:

bash <(curl -s https://raw.githubusercontent.com/cms-analysis/CombineHarvester/main/CombineTools/scripts/sparse-checkout-ssh.sh)\n

git clone via https:

bash <(curl -s https://raw.githubusercontent.com/cms-analysis/CombineHarvester/main/CombineTools/scripts/sparse-checkout-https.sh)\n

make sure to run scram to compile the CombineTools package.

See the CombineHarvester documentation pages for more details on using this tool and additional features available in the full package.

"},{"location":"CernVM/","title":"CernVM","text":""},{"location":"CernVM/#standalone-use-inside-cernvm","title":"Standalone use inside CernVM","text":"

Standalone by adding the CMS group to the CVMFS Configuration. A minimal CernVM working context setup can be found in the CernVM Marketplace under Experimental/HiggsCombine or at https://cernvm-online.cern.ch/context/view/9ee5960ce4b143f5829e72bbbb26d382. At least 2GB of disk space should be reserved on the virtual machine for Combine to work properly.

"},{"location":"CernVM/#available-machines-for-standalone-combine","title":"Available machines for standalone combine","text":"

The standalone version can be easily compiled via CVMFS as it relies on dependencies which are already installed at /cvmfs/cms.cern.ch/. Access to /cvmfs/cms.cern.ch/ can be obtained from lxplus machines or via CernVM. The only requirement will be to add the CMS group to the CVMFS configuration as shown in the picture

At least 2GB of disk space should be reserved on the virtual machine for combine to work properly. A minimal CernVM working context setup can be found in the CernVM Marketplace under Experimental/HiggsCombine.

To use this predefined context, first locally launch the CernVM (eg you can use the .ova with VirtualBox, by downloading from here and launching the downloaded file. You can click on \"pair an instance of CernVM\" from the cernvm-online dashboard, which displays a PIN. In the VirtualBox terminal, pair the virtual machine with this PIN code (enter in the terminal using #PIN eg #123456. After this, you will be asked again for username (use user) and then a password (use hcomb).

In case you do not want to use the cvmfs area, you will need to adapt the location of the dependencies listed in both the Makefile and env_standalone.sh files.

"},{"location":"releaseNotes/","title":"Release notes","text":""},{"location":"releaseNotes/#cmssw-10_2_x-v800","title":"CMSSW 10_2_X - v8.0.0","text":"

This release contains all of the changes listed for v7.0.13 below. In addition:

"},{"location":"releaseNotes/#cmssw-8_1_x-v7013","title":"CMSSW 8_1_X - v7.0.13","text":""},{"location":"part2/bin-wise-stats/","title":"Automatic statistical uncertainties","text":""},{"location":"part2/bin-wise-stats/#introduction","title":"Introduction","text":"

The text2workspace.py script is able to produce a type of workspace, using a set of new histogram classes, in which bin-wise statistical uncertainties are added automatically. This can be built for shape-based datacards where the inputs are in TH1 format. Datacards that use RooDataHists are not supported. The bin errrors (i.e. values returned by TH1::GetBinError) are used to model the uncertainties.

By default the script will attempt to assign a single nuisance parameter to scale the sum of the process yields in each bin, constrained by the total uncertainty, instead of requiring separate parameters, one per process. This is sometimes referred to as the Barlow-Beeston-lite approach, and is useful as it minimises the number of parameters required in the maximum likelihood fit. A useful description of this approach may be found in section 5 of this report.

"},{"location":"part2/bin-wise-stats/#usage-instructions","title":"Usage instructions","text":"

The following line should be added at the bottom of the datacard, underneath the systematics, to produce a new-style workspace and optionally enable the automatic bin-wise uncertainties:

[channel] autoMCStats [threshold] [include-signal = 0] [hist-mode = 1]\n

The first string channel should give the name of the channels (bins) in the datacard for which the new histogram classes should be used. The wildcard * is supported for selecting multiple channels in one go. The value of threshold should be set to a value greater than or equal to zero to enable the creation of automatic bin-wise uncertainties, or -1 to use the new histogram classes without these uncertainties. A positive value sets the threshold on the effective number of unweighted events above which the uncertainty will be modeled with the Barlow-Beeston-lite approach described above. Below the threshold an individual uncertainty per-process will be created. The algorithm is described in more detail below.

The last two settings are optional. The first of these, include-signal has a default value of 0 but can be set to 1 as an alternative. By default, the total nominal yield and uncertainty used to test the threshold excludes signal processes. The reason for this is that typically the initial signal normalization is arbitrary, and could unduly lead to a bin being considered well-populated despite poorly populated background templates. Setting this flag will include the signal processes in the uncertainty analysis. Note that this option only affects the logic for creating a single Barlow-Beeston-lite parameter vs. separate per-process parameters - the uncertainties on all signal processes are always included in the actual model! The second flag changes the way the normalization effect of shape-altering uncertainties is handled. In the default mode (1) the normalization is handled separately from the shape morphing via a an asymmetric log-normal term. This is identical to how Combine has always handled shape morphing. When set to 2, the normalization will be adjusted in the shape morphing directly. Unless there is a strong motivation we encourage users to leave this on the default setting.

"},{"location":"part2/bin-wise-stats/#description-of-the-algorithm","title":"Description of the algorithm","text":"

When threshold is set to a number of effective unweighted events greater than or equal to zero, denoted \\(n^{\\text{threshold}}\\), the following algorithm is applied to each bin:

  1. Sum the yields \\(n_{i}\\) and uncertainties \\(e_{i}\\) of each background process \\(i\\) in the bin. Note that the \\(n_i\\) and \\(e_i\\) include the nominal effect of any scaling parameters that have been set in the datacard, for example rateParams. \\(n_{\\text{tot}} = \\sum_{i\\,\\in\\,\\text{bkg}}n_i\\), \\(e_{\\text{tot}} = \\sqrt{\\sum_{i\\,\\in\\,\\text{bkg}}e_i^{2}}\\)
  2. If \\(e_{\\text{tot}} = 0\\), the bin is skipped and no parameters are created. If this is the case, it is a good idea to check why there is no uncertainty in the background prediction in this bin!
  3. The effective number of unweighted events is defined as \\(n_{\\text{tot}}^{\\text{eff}} = n_{\\text{tot}}^{2} / e_{\\text{tot}}^{2}\\), rounded to the nearest integer.
  4. If \\(n_{\\text{tot}}^{\\text{eff}} \\leq n^{\\text{threshold}}\\): separate uncertainties will be created for each process. Processes where \\(e_{i} = 0\\) are skipped. If the number of effective events for a given process is lower than \\(n^{\\text{threshold}}\\) a Poisson-constrained parameter will be created. Otherwise a Gaussian-constrained parameter is used.
  5. If \\(n_{\\text{tot}}^{\\text{eff}} \\gt n^{\\text{threshold}}\\): A single Gaussian-constrained Barlow-Beeston-lite parameter is created that will scale the total yield in the bin.
  6. Note that the values of \\(e_{i}\\), and therefore \\(e_{tot}\\), will be updated automatically in the model whenever the process normalizations change.
  7. A Gaussian-constrained parameter \\(x\\) has a nominal value of zero and scales the yield as \\(n_{\\text{tot}} + x \\cdot e_{\\text{tot}}\\). The Poisson-constrained parameters are expressed as a yield multiplier with nominal value one: \\(n_{\\text{tot}} \\cdot x\\).

The output from text2workspace.py will give details on how each bin has been treated by this algorithm, for example:

Show example output
============================================================\nAnalysing bin errors for: prop_binhtt_et_6_7TeV\nPoisson cut-off: 10\nProcesses excluded for sums: ZH qqH WH ggH\n============================================================\nBin        Contents        Error           Notes\n0          0.000000        0.000000        total sum\n0          0.000000        0.000000        excluding marked processes\n  => Error is zero, ignore\n------------------------------------------------------------\n1          0.120983        0.035333        total sum\n1          0.120983        0.035333        excluding marked processes\n1          12.000000       3.464102        Unweighted events, alpha=0.010082\n  => Total parameter prop_binhtt_et_6_7TeV_bin1[0.00,-7.00,7.00] to be gaussian constrained\n------------------------------------------------------------\n2          0.472198        0.232096        total sum\n2          0.472198        0.232096        excluding marked processes\n2          4.000000        2.000000        Unweighted events, alpha=0.118049\n  => Number of weighted events is below poisson threshold\n    ZH                   0.000000        0.000000\n      => Error is zero, ignore\n  ----------------------------------------------------------\n    W                    0.050606        0.029220\n                         3.000000        1.732051        Unweighted events, alpha=0.016869\n      => Product of prop_binhtt_et_6_7TeV_bin2_W[1.00,0.00,12.15] and const [3] to be poisson constrained\n  ----------------------------------------------------------\n    ZJ                   0.142444        0.140865\n                         1.000000        1.000000        Unweighted events, alpha=0.142444\n      => Product of prop_binhtt_et_6_7TeV_bin2_ZJ[1.00,0.00,30.85] and const [1] to be poisson constrained\n  ----------------------------------------------------------\n"},{"location":"part2/bin-wise-stats/#analytic-minimisation","title":"Analytic minimisation","text":"

One significant advantage of the Barlow-Beeston-lite approach is that the maximum likelihood estimate of each nuisance parameter has a simple analytic form that depends only on \\(n_{\\text{tot}}\\), \\(e_{\\text{tot}}\\) and the observed number of data events in the relevant bin. Therefore when minimising the negative log-likelihood of the whole model it is possible to remove these parameters from the fit and set them to their best-fit values automatically. For models with large numbers of bins this can reduce the fit time and increase the fit stability. The analytic minimisation is enabled by default starting in combine v8.2.0, you can disable it by adding the option --X-rtd MINIMIZER_no_analytic when running Combine.

\n

The figure below shows a performance comparison of the analytical minimisation versus the number of bins in the likelihood function. The real time (in sections) for a typical minimisation of a binned likelihood is shown as a function of the number of bins when invoking the analytic minimisation of the nuisance parameters versus the default numerical approach.

\n\nShow Comparison\n

"},{"location":"part2/bin-wise-stats/#technical-details","title":"Technical details","text":"

Up until recently text2workspace.py would only construct the PDF for each channel using a RooAddPdf, i.e. each component process is represented by a separate PDF and normalization coefficient. However, in order to model bin-wise statistical uncertainties, the alternative RooRealSumPdf can be more useful, as each process is represented by a RooFit function object instead of a PDF, and we can vary the bin yields directly. As such, a new RooFit histogram class CMSHistFunc is introduced, which offers the same vertical template morphing algorithms offered by the current default histogram PDF, FastVerticalInterpHistPdf2. Accompanying this is the CMSHistErrorPropagator class. This evaluates a sum of CMSHistFunc objects, each multiplied by a coefficient. It is also able to scale the summed yield of each bin to account for bin-wise statistical uncertainty nuisance parameters.

\n\n

Warning

\n

One disadvantage of this new approach comes when evaluating the expectation for individual processes, for example when using the --saveShapes option in the FitDiagnostics mode of Combine. The Barlow-Beeston-lite parameters scale the sum of the process yields directly, so extra work is needed to distribute this total scaling back to each individual process. To achieve this, an additional class CMSHistFuncWrapper has been created that, given a particular CMSHistFunc, the CMSHistErrorPropagator will distribute an appropriate fraction of the total yield shift to each bin. As a consequence of the extra computation needed to distribute the yield shifts in this way, the evaluation of individual process shapes in --saveShapes can take longer then previously.

"},{"location":"part2/physicsmodels/","title":"Physics Models","text":"

Combine can be run directly on the text-based datacard. However, for more advanced physics models, the internal step to convert the datacard to a binary workspace should be performed by the user. To create a binary workspace starting from a datacard.txt, you can run

text2workspace.py datacard.txt -o workspace.root\n

By default (without the -o option), the binary workspace will be named datacard.root - i.e the .txt suffix will be replaced by .root.

A full set of options for text2workspace can be found by running text2workspace.py --help.

The default model that will be produced when running text2workspace is one in which all processes identified as signal are multiplied by a common multiplier r. This is all that is needed for simply setting limits or calculating significances.

text2workspace will convert the datacard into a PDF that summarizes the analysis. For example, let's take a look at the data/tutorials/counting/simple-counting-experiment.txt datacard.

# Simple counting experiment, with one signal and one background process\n# Extremely simplified version of the 35/pb H->WW analysis for mH = 200 GeV,\n# for 4th generation exclusion (EWK-10-009, arxiv:1102.5429v1)\nimax 1  number of channels\njmax 1  number of backgrounds\nkmax 2  number of nuisance parameters (sources of systematical uncertainties)\n------------\n# we have just one channel, in which we observe 0 events\nbin         1\nobservation 0\n------------\n# now we list the expected events for signal and all backgrounds in that bin\n# the second 'process' line must have a positive number for backgrounds, and 0 for signal\n# then we list the independent sources of uncertainties, and give their effect (syst. error)\n# on each process and bin\nbin             1      1\nprocess       ggh4G  Bckg\nprocess         0      1\nrate           4.76  1.47\n------------\ndeltaS  lnN    1.20    -    20% uncertainty on signal\ndeltaB  lnN      -   1.50   50% uncertainty on background\n

If we run text2workspace.py on this datacard and take a look at the workspace (w) inside the .root file produced, we will find a number of different objects representing the signal, background, and observed event rates, as well as the nuisance parameters and signal strength \\(r\\). Note that often in the statistics literature, this parameter is referred to as \\(\\mu\\).

From these objects, the necessary PDF has been constructed (named model_s). For this counting experiment we will expect a simple PDF of the form

\\[ p(n_{\\mathrm{obs}}| r,\\nu_{S},\\nu_{B})\\propto \\dfrac{[r\\cdot n_{S}(\\nu_{S})+n_{B}(\\nu_{B})]^{n_{\\mathrm{obs}}} } {n_{\\mathrm{obs}}!}e^{-[r\\cdot n_{S}(\\nu_{S})+n_{B}(\\nu_{B})]} \\cdot e^{-\\frac{1}{2}(\\nu_{S}- y_{S})^{2}} \\cdot e^{-\\frac{1}{2}(\\nu_{B}- y_{B})^{2}} \\]

where the expected signal and background rates are expressed as functions of the nuisance parameters, \\(n_{S}(\\nu_{S}) = 4.76(1+0.2)^{\\nu_{S}}~\\) and \\(~n_{B}(\\nu_{B}) = 1.47(1+0.5)^{\\nu_{B}}\\). The \\(y_{S},~y_{B}\\) are the auxiliary observables. In the code, these will have the same name as the corresponding nuisance parameter, with the extension _In.

The first term represents the usual Poisson expression for observing \\(n_{\\mathrm{obs}}\\) events, while the second two are the Gaussian constraint terms for the nuisance parameters. In this case \\({y_S}={y_B}=0\\), and the widths of both Gaussians are 1.

A combination of counting experiments (or a binned shape datacard) will look like a product of PDFs of this kind. For parametric/unbinned analyses, the PDF for each process in each channel is provided instead of the using the Poisson terms and a product runs over the bin counts/events.

"},{"location":"part2/physicsmodels/#model-building","title":"Model building","text":"

For more complex models, PhysicsModels can be produced. To use a different physics model instead of the default one, use the option -P as in

text2workspace.py datacard -P HiggsAnalysis.CombinedLimit.PythonFile:modelName\n

Generic models can be implemented by writing a python class that:

In the case of SM-like Higgs boson measurements, the class should inherit from SMLikeHiggsModel (redefining getHiggsSignalYieldScale), while beyond that one can inherit from PhysicsModel. You can find some examples in PhysicsModel.py.

In the 4-process model (PhysicsModel:floatingXSHiggs, you will see that each of the 4 dominant Higgs boson production modes get separate scaling parameters, r_ggH, r_qqH, r_ttH and r_VH (or r_ZH and r_WH) as defined in,

def doParametersOfInterest(self):\n  \"\"\"Create POI and other parameters, and define the POI set.\"\"\"\n  # --- Signal Strength as only POI ---\n  if \"ggH\" in self.modes: self.modelBuilder.doVar(\"r_ggH[1,%s,%s]\" % (self.ggHRange[0], self.ggHRange[1]))\n  if \"qqH\" in self.modes: self.modelBuilder.doVar(\"r_qqH[1,%s,%s]\" % (self.qqHRange[0], self.qqHRange[1]))\n  if \"VH\"  in self.modes: self.modelBuilder.doVar(\"r_VH[1,%s,%s]\"  % (self.VHRange [0], self.VHRange [1]))\n  if \"WH\"  in self.modes: self.modelBuilder.doVar(\"r_WH[1,%s,%s]\"  % (self.WHRange [0], self.WHRange [1]))\n  if \"ZH\"  in self.modes: self.modelBuilder.doVar(\"r_ZH[1,%s,%s]\"  % (self.ZHRange [0], self.ZHRange [1]))\n  if \"ttH\" in self.modes: self.modelBuilder.doVar(\"r_ttH[1,%s,%s]\" % (self.ttHRange[0], self.ttHRange[1]))\n  poi = \",\".join([\"r_\"+m for m in self.modes])\n  if self.pois: poi = self.pois\n  ...\n

The mapping of which POI scales which process is handled via the following function,

def getHiggsSignalYieldScale(self,production,decay, energy):\n  if production == \"ggH\": return (\"r_ggH\" if \"ggH\" in self.modes else 1)\n  if production == \"qqH\": return (\"r_qqH\" if \"qqH\" in self.modes else 1)\n  if production == \"ttH\": return (\"r_ttH\" if \"ttH\" in self.modes else (\"r_ggH\" if self.ttHasggH else 1))\n  if production in [ \"WH\", \"ZH\", \"VH\" ]: return (\"r_VH\" if \"VH\" in self.modes else 1)\n  raise RuntimeError, \"Unknown production mode '%s'\" % production\n

You should note that text2workspace will look for the python module in PYTHONPATH. If you want to keep your model local, you'll need to add the location of the python file to PYTHONPATH.

A number of models used in the LHC Higgs combination paper can be found in LHCHCGModels.py.

The models can be applied to the datacard by using the -P option, for example -P HiggsAnalysis.CombinedLimit.HiggsCouplings:c7, and others that are defined in HiggsCouplings.py.

Below are some (more generic) example models that also exist in GitHub.

"},{"location":"part2/physicsmodels/#multisignalmodel-ready-made-model-for-multiple-signal-processes","title":"MultiSignalModel ready made model for multiple signal processes","text":"

Combine already contains a model HiggsAnalysis.CombinedLimit.PhysicsModel:multiSignalModel that can be used to assign different signal strengths to multiple processes in a datacard, configurable from the command line.

The model is configured by passing one or more mappings in the form --PO 'map=bin/process:parameter' to text2workspace:

Passing the additional option --PO verbose will set the code to verbose mode, printing out the scaling factors for each process; we encourage the use this option to make sure that the processes are being scaled correctly.

The MultiSignalModel will define all parameters as parameters of interest, but that can be then changed from the command line, as described in the following subsection.

Some examples, taking as reference the toy datacard test/multiDim/toy-hgg-125.txt:

  $ text2workspace.py -P HiggsAnalysis.CombinedLimit.PhysicsModel:multiSignalModel  --PO verbose --PO 'map=.*/ggH:r[1,0,10]' --PO 'map=.*/qqH:r' toy-hgg-125.txt -o toy-1d.root\n  [...]\n  Will create a POI  r  with factory  r[1,0,10]\n  Mapping  r  to  ['.*/ggH']  patterns\n  Mapping  r  to  ['.*/qqH']  patterns\n  [...]\n  Will scale  incl/bkg  by  1\n  Will scale  incl/ggH  by  r\n  Will scale  incl/qqH  by  r\n  Will scale  dijet/bkg  by  1\n  Will scale  dijet/ggH  by  r\n  Will scale  dijet/qqH  by  r\n
  $ text2workspace.py -P HiggsAnalysis.CombinedLimit.PhysicsModel:multiSignalModel  --PO verbose --PO 'map=.*/ggH:r_ggH[1,0,10]' --PO 'map=.*/qqH:r_qqH[1,0,20]' toy-hgg-125.txt -o toy-2d.root\n  [...]\n  Will create a POI  r_ggH  with factory  r_ggH[1,0,10]\n  Mapping  r_ggH  to  ['.*/ggH']  patterns\n  Will create a POI  r_qqH  with factory  r_qqH[1,0,20]\n  Mapping  r_qqH  to  ['.*/qqH']  patterns\n  [...]\n  Will scale  incl/bkg  by  1\n  Will scale  incl/ggH  by  r_ggH\n  Will scale  incl/qqH  by  r_qqH\n  Will scale  dijet/bkg  by  1\n  Will scale  dijet/ggH  by  r_ggH\n  Will scale  dijet/qqH  by  r_qqH\n
  $ text2workspace.py -P HiggsAnalysis.CombinedLimit.PhysicsModel:multiSignalModel  --PO verbose --PO 'map=.*/ggH:1' --PO 'map=.*/qqH:r_qqH[1,0,20]' toy-hgg-125.txt -o toy-1d-qqH.root\n  [...]\n  Mapping  1  to  ['.*/ggH']  patterns\n  Will create a POI  r_qqH  with factory  r_qqH[1,0,20]\n  Mapping  r_qqH  to  ['.*/qqH']  patterns\n  [...]\n  Will scale  incl/bkg  by  1\n  Will scale  incl/ggH  by  1\n  Will scale  incl/qqH  by  r_qqH\n  Will scale  dijet/bkg  by  1\n  Will scale  dijet/ggH  by  1\n  Will scale  dijet/qqH  by  r_qqH\n
 $ text2workspace.py -P HiggsAnalysis.CombinedLimit.PhysicsModel:multiSignalModel  --PO verbose --PO 'map=.*/ggH:0' --PO 'map=.*/qqH:r_qqH[1,0,20]' toy-hgg-125.txt -o toy-1d-qqH0-only.root\n [...]\n Mapping  0  to  ['.*/ggH']  patterns\n Will create a POI  r_qqH  with factory  r_qqH[1,0,20]\n Mapping  r_qqH  to  ['.*/qqH']  patterns\n [...]\n Will scale  incl/bkg  by  1\n Will scale  incl/ggH  by  0\n Will scale  incl/qqH  by  r_qqH\n Will scale  dijet/bkg  by  1\n Will scale  dijet/ggH  by  0\n Will scale  dijet/qqH  by  r_qqH\n
"},{"location":"part2/physicsmodels/#two-hypothesis-testing","title":"Two Hypothesis testing","text":"

The PhysicsModel that encodes the signal model above is the twoHypothesisHiggs, which assumes signal processes with suffix _ALT will exist in the datacard. An example of such a datacard can be found under data/benchmarks/simple-counting/twoSignals-3bin-bigBSyst.txt

 $ text2workspace.py twoSignals-3bin-bigBSyst.txt -P HiggsAnalysis.CombinedLimit.HiggsJPC:twoHypothesisHiggs -m 125.7 --PO verbose -o jcp_hww.root\n\n MH (not there before) will be assumed to be 125.7\n Process  S  will get norm  not_x\n Process  S_ALT  will get norm  x\n Process  S  will get norm  not_x\n Process  S_ALT  will get norm  x\n Process  S  will get norm  not_x\n Process  S_ALT  will get norm  x\n

The two processes (S and S_ALT) will get different scaling parameters. The LEP-style likelihood for hypothesis testing can now be used by setting x or not_x to 1 and 0 and comparing the two likelihood evaluations.

"},{"location":"part2/physicsmodels/#signal-background-interference","title":"Signal-background interference","text":"

Since negative probability distribution functions do not exist, the recommended way to implement this is to start from the expression for the individual amplitudes \\(A\\) and the parameter of interest \\(k\\),

\\[ \\mathrm{Yield} = |k * A_{s} + A_{b}|^2 = k^2 * |A_{s}|^2 + k * 2 \\Re(A_{s}^* A_{b}) + |A_{b}|^2 = \\mu * S + \\sqrt{\\mu} * I + B \\]

where

\\(\\mu = k^2, ~S = |A_{s}|^2,~B = |A_b|^2\\) and \\(S+B+I = |A_s + A_b|^2\\).

With some algebra you can work out that,

\\(\\mathrm{Yield} = \\sqrt{\\mu} * \\left[S+B+I\\right] + (\\mu-\\sqrt{\\mu}) * \\left[S\\right] + (1-\\sqrt{\\mu}) * \\left[B\\right]\\)

where square brackets represent the input (histograms as TH1 or RooDataHists) that one needs to provide.

An example of this scheme is implemented in a HiggsWidth and is completely general, since all of the three components above are strictly positive. In this example, the POI is CMS_zz4l_mu and the equations for the three components are scaled (separately for the qqH and ggH processes) as,

 self.modelBuilder.factory_( \"expr::ggH_s_func(\\\"@0-sqrt(@0)\\\", CMS_zz4l_mu)\")\n self.modelBuilder.factory_(  \"expr::ggH_b_func(\\\"1-sqrt(@0)\\\", CMS_zz4l_mu)\")\n self.modelBuilder.factory_(  \"expr::ggH_sbi_func(\\\"sqrt(@0)\\\", CMS_zz4l_mu)\")\n\n self.modelBuilder.factory_( \"expr::qqH_s_func(\\\"@0-sqrt(@0)\\\", CMS_zz4l_mu)\")\n self.modelBuilder.factory_(  \"expr::qqH_b_func(\\\"1-sqrt(@0)\\\", CMS_zz4l_mu)\")\n self.modelBuilder.factory_(  \"expr::qqH_sbi_func(\\\"sqrt(@0)\\\", CMS_zz4l_mu)\")\n
"},{"location":"part2/physicsmodels/#multi-process-interference","title":"Multi-process interference","text":"

The above formulation can be extended to multiple parameters of interest (POIs). See AnalyticAnomalousCoupling for an example. However, the computational performance scales quadratically with the number of POIs, and can get extremely expensive for 10 or more, as may be encountered often with EFT analyses. To alleviate this issue, an accelerated interference modeling technique is implemented for template-based analyses via the interferenceModel physics model. In this model, each bin yield \\(y\\) is parameterized

\\[ y(\\vec{\\mu}) = y_0 (\\vec{\\mu}^\\top M \\vec{\\mu}) \\]

as a function of the POI vector \\(\\vec{\\mu}\\), a nominal template \\(y_0\\), and a scaling matrix \\(M\\). To see how this parameterization relates to that of the previous section, we can define:

\\[ y_0 = A_b^2, \\qquad M = \\frac{1}{A_b^2} \\begin{bmatrix} |A_s|^2 & \\Re(A_s^* A_b) \\\\ \\Re(A_s A_b^*) & |A_b|^2 \\end{bmatrix}, \\qquad \\vec{\\mu} = \\begin{bmatrix} \\sqrt{\\mu} \\\\ 1 \\end{bmatrix} \\]

which leads to the same parameterization. At present, this technique only works with CMSHistFunc-based workspaces, as these are the most common workspace types encountered and the default when using autoMCStats. To use this model, for each bin find \\(y_0\\) and put it into the datacard as a signal process, then find \\(M\\) and save the lower triangular component as an array in a scaling.json file with a syntax as follows:

[\n  {\n    \"channel\": \"my_channel\",\n    \"process\": \"my_nominal_process\",\n    \"parameters\": [\"sqrt_mu[1,0,2]\", \"Bscaling[1]\"],\n    \"scaling\": [\n      [0.5, 0.1, 1.0],\n      [0.6, 0.2, 1.0],\n      [0.7, 0.3, 1.0]\n    ]\n  }\n]\n

where the parameters are declared using RooFit's factory syntax and each row of the scaling field represents the scaling information of a bin, e.g. if \\(y_0 = |A_b|^2\\) then each row would contain three entries:

\\[ |A_s|^2 / |A_b|^2,\\quad \\Re(A_s^* A_b)/|A_b|^2,\\quad 1 \\]

For several coefficients, one would enumerate as follows:

scaling = []\nfor ibin in range(nbins):\n    binscaling = []\n    for icoef in range(ncoef):\n        for jcoef in range(icoef + 1):\n            binscaling.append(amplitude_squared_for(ibin, icoef, jcoef))\n    scaling.append(binscaling)\n

Then, to construct the workspace, run

text2workspace.py card.txt -P HiggsAnalysis.CombinedLimit.InterferenceModels:interferenceModel \\\n    --PO verbose --PO scalingData=scaling.json\n

For large amounts of scaling data, you can optionally use gzipped json (.json.gz) or pickle (.pkl.gz) files with 2D numpy arrays for the scaling coefficients instead of lists. The function numpy.tril_indices(ncoef) is helpful for extracting the lower triangle of a square matrix.

You could pick any nominal template, and adjust the scaling as appropriate. Generally it is advisable to use a nominal template corresponding to near where you expect the best-fit values of the POIs to be so that the shape systematic effects are well-modeled in that region.

It may be the case that the relative contributions of the terms are themselves a function of the POIs. For example, in VBF di-Higgs production, BSM modifications to the production rate can be parameterized in the \"kappa\" framework via three diagrams, with scaling coefficients \\(\\kappa_V \\kappa_\\lambda\\), \\(\\kappa_V^2\\), and \\(\\kappa_{2V}\\), respectively, that interfere. In that case, you can declare formulas with the factory syntax to represent each amplitude as follows:

[\n  {\n    \"channel\": \"a_vbf_channel\",\n    \"process\": \"VBFHH\",\n    \"parameters\": [\"expr::a0('@0*@1', kv[1,0,2], kl[1,0,2])\", \"expr::a1('@0*@0', kv[1,0,2])\", \"k2v[1,0,2]\"],\n    \"scaling\": [\n      [3.30353674666415, -8.54170982038222, 22.96464188467882, 4.2353483207128, -11.07996258835088, 5.504469544697623],\n      [2.20644332142891, -7.076836641962523, 23.50989689214267, 4.053185685866683, -13.08569222837996, 7.502346155380032]\n    ]\n  }\n]\n

However, you will need to manually specify what the POIs should be when creating the workspace using the POIs= physics option, e.g.

text2workspace.py card.txt -P HiggsAnalysis.CombinedLimit.InterferenceModels:interferenceModel \\\n  --PO scalingData=scaling.json --PO 'POIs=kl[1,0,2]:kv[1,0,2]:k2v[1,0,2]'\n
"},{"location":"part2/settinguptheanalysis/","title":"Preparing the datacard","text":"

The input to Combine, which defines the details of the analysis, is a plain ASCII file we will refer to as datacard. This is true whether the analysis is a simple counting experiment or a shape analysis.

"},{"location":"part2/settinguptheanalysis/#a-simple-counting-experiment","title":"A simple counting experiment","text":"

The file data/tutorials/counting/realistic-counting-experiment.txt shows an example of a counting experiment.

The first lines can be used to add some descriptive information. Those lines must start with a \"#\", and they are not parsed by Combine:

# Simple counting experiment, with one signal and a few background processes\n# Simplified version of the 35/pb H->WW analysis for mH = 160 GeV\n

Following this, one declares the number of observables, imax, that are present in the model used to set limits / extract confidence intervals. The number of observables will typically be the number of channels in a counting experiment. The value * can be specified for imax, which tells Combine to determine the number of observables from the rest of the datacard. In order to better catch mistakes, it is recommended to explicitly specify the value.

imax 1  number of channels\n

This declaration is followed by a specification of the number of background sources to be considered, jmax, and the number of independent sources of systematic uncertainty, kmax:

jmax 3  number of backgrounds\nkmax 5  number of nuisance parameters (sources of systematic uncertainty)\n

In the example there is 1 channel, there are 3 background sources, and there are 5 independent sources of systematic uncertainty.

After providing this information, the following lines describe what is observed in data: the number of events observed in each channel. The first line, starting with bin, defines the label used for each channel. In the example we have 1 channel, labelled 1, and in the following line, observation, the number of observed events is given: 0 in this example.

# we have just one channel, in which we observe 0 events\nbin bin1\nobservation 0\n

This is followed by information related to the expected number of events, for each bin and process, arranged in (#channels)*(#processes) columns.

bin          bin1     bin1     bin1     bin1\nprocess         ggH  qqWW  ggWW  others\nprocess          0     1     2     3\nrate           1.47  0.63  0.06  0.22\n

If a process does not contribute in a given bin, it can be removed from the datacard, or the rate can be set to 0.

The final section of the datacard describes the systematic uncertainties:

lumi    lnN    1.11    -   1.11    -    lumi affects both signal and gg->WW (mc-driven). lnN = lognormal\nxs_ggH  lnN    1.16    -     -     -    gg->H cross section + signal efficiency + other minor ones.\nWW_norm gmN 4    -   0.16    -     -    WW estimate of 0.64 comes from sidebands: 4 events in sideband times 0.16 (=> ~50% statistical uncertainty)\nxs_ggWW lnN      -     -   1.50    -    50% uncertainty on gg->WW cross section\nbg_others lnN    -     -     -   1.30   30% uncertainty on the rest of the backgrounds\n

In the example, there are 5 uncertainties:

"},{"location":"part2/settinguptheanalysis/#shape-analyses","title":"Shape analyses","text":"

The datacard has to be supplemented with two extensions:

The expected shape can be parametric, or not. In the first case the parametric PDFs have to be given as input to the tool. In the latter case, for each channel, histograms have to be provided for the expected shape of each process. The data have to be provided as input as a histogram to perform a binned shape analysis, and as a RooDataSet to perform an unbinned shape analysis.

Warning

If using RooFit-based inputs (RooDataHists/RooDataSets/RooAbsPdfs) then you need to ensure you are using different RooRealVars as the observable in each category entering the statistical analysis. It is possible to use the same RooRealVar if the observable has the same range (and binning if using binned data) in each category, although in most cases it is simpler to avoid doing this.

"},{"location":"part2/settinguptheanalysis/#rates-for-shape-analyses","title":"Rates for shape analyses","text":"

As with the counting experiment, the total nominal rate of a given process must be identified in the rate line of the datacard. However, there are special options for shape-based analyses, as follows:

"},{"location":"part2/settinguptheanalysis/#binned-shape-analyses","title":"Binned shape analyses","text":"

For each channel, histograms have to be provided for the observed shape and for the expected shape of each process.

The Combine tool can take as input histograms saved as TH1, as RooAbsHist in a RooFit workspace (an example of how to create a RooFit workspace and save histograms is available in github), or from a pandas dataframe (example).

The block of lines defining the mapping (first block in the datacard) contains one or more rows of the form

In this line,

In addition, user-defined keywords can be used. Any word in the datacard $WORD will be replaced by VALUE when including the option --keyword-value WORD=VALUE. This option can be repeated multiple times for multiple keywords.

"},{"location":"part2/settinguptheanalysis/#template-shape-uncertainties","title":"Template shape uncertainties","text":"

Shape uncertainties can be taken into account by vertical interpolation of the histograms. The shapes (fraction of events \\(f\\) in each bin) are interpolated using a spline for shifts below +/- 1\u03c3 and linearly outside of that. Specifically, for nuisance parameter values \\(|\\nu|\\leq 1\\)

\\[ f(\\nu) = \\frac{1}{2} \\left( (\\delta^{+}-\\delta^{-})\\nu + \\frac{1}{8}(\\delta^{+}+\\delta^{-})(3\\nu^6 - 10\\nu^4 + 15\\nu^2) \\right) \\]

and for \\(|\\nu|> 1\\) (\\(|\\nu|<-1\\)), \\(f(\\nu)\\) is a straight line with gradient \\(\\delta^{+}\\) (\\(\\delta^{-}\\)), where \\(\\delta^{+}=f(\\nu=1)-f(\\nu=0)\\), and \\(\\delta^{-}=f(\\nu=-1)-f(\\nu=0)\\), derived using the nominal and up/down histograms. This interpolation is designed so that the values of \\(f(\\nu)\\) and its derivatives are continuous for all values of \\(\\nu\\).

The normalizations are interpolated linearly in log scale, just like we do for log-normal uncertainties. If the value in a given bin is negative for some value of \\(\\nu\\), the value will be truncated at 0.

For each shape uncertainty and process/channel affected by it, two additional input shapes have to be provided. These are obtained by shifting the parameter up and down by one standard deviation. When building the likelihood, each shape uncertainty is associated to a nuisance parameter taken from a unit gaussian distribution, which is used to interpolate or extrapolate using the specified histograms.

For each given shape uncertainty, the part of the datacard describing shape uncertainties must contain a row

The effect can be \"-\" or 0 for no effect, 1 for the normal effect, and something different from 1 to test larger or smaller effects (in that case, the unit gaussian is scaled by that factor before using it as parameter for the interpolation).

The datacard in data/tutorials/shapes/simple-shapes-TH1.txt provides an example of how to include shapes in the datacard. In the first block the following line specifies the shape mapping:

shapes * * simple-shapes-TH1.root $PROCESS $PROCESS_$SYSTEMATIC\n

The last block concerns the treatment of the systematic uncertainties that affect shapes. In this case there are two uncertainties with a shape-altering effect.

alpha  shape    -           1   uncertainty on background shape and normalization\nsigma  shape    0.5         -   uncertainty on signal resolution. Assume the histogram is a 2 sigma shift,\n#                                so divide the unit gaussian by 2 before doing the interpolation\n

There are two options for the interpolation algorithm in the \"shape\" uncertainty. Putting shape will result in an interpolation of the fraction of events in each bin. That is, the histograms are first normalized before interpolation. Putting shapeN while instead base the interpolation on the logs of the fraction in each bin. For both shape and shapeN, the total normalization is interpolated using an asymmetric log-normal, so that the effect of the systematic on both the shape and normalization are accounted for. The following image shows a comparison of the two algorithms for the example datacard.

In this case there are two processes, signal and background, and two uncertainties affecting the background (alpha) and signal shapes (sigma). In the ROOT file, two histograms per systematic have to be provided, they are the shapes obtained, for the specific process, by shifting the parameter associated with the uncertainty up and down by a standard deviation: background_alphaUp and background_alphaDown, signal_sigmaUp and signal_sigmaDown.

The content of the ROOT file simple-shapes-TH1.root associated with the datacard data/tutorials/shapes/simple-shapes-TH1.txt is:

root [0]\nAttaching file simple-shapes-TH1.root as _file0...\nroot [1] _file0->ls()\nTFile**     simple-shapes-TH1.root\n TFile*     simple-shapes-TH1.root\n  KEY: TH1F signal;1    Histogram of signal__x\n  KEY: TH1F signal_sigmaUp;1    Histogram of signal__x\n  KEY: TH1F signal_sigmaDown;1  Histogram of signal__x\n  KEY: TH1F background;1    Histogram of background__x\n  KEY: TH1F background_alphaUp;1    Histogram of background__x\n  KEY: TH1F background_alphaDown;1  Histogram of background__x\n  KEY: TH1F data_obs;1  Histogram of data_obs__x\n  KEY: TH1F data_sig;1  Histogram of data_sig__x\n

For example, without shape uncertainties there would only be one row with shapes * * shapes.root $CHANNEL/$PROCESS Then, to give a simple example for two channels (\"e\", \"mu\") with three processes ()\"higgs\", \"zz\", \"top\"), the ROOT file contents should look like:

histogram meaning e/data_obs observed data in electron channel e/higgs expected shape for higgs in electron channel e/zz expected shape for ZZ in electron channel e/top expected shape for top in electron channel mu/data_obs observed data in muon channel mu/higgs expected shape for higgs in muon channel mu/zz expected shape for ZZ in muon channel mu/top expected shape for top in muon channel

If there is also an uncertainty that affects the shape, e.g. the jet energy scale, shape histograms for the jet energy scale shifted up and down by one sigma need to be included. This could be done by creating a folder for each process and writing a line like

shapes * * shapes.root $CHANNEL/$PROCESS/nominal $CHANNEL/$PROCESS/$SYSTEMATIC

or a postifx can be added to the histogram name:

shapes * * shapes.root $CHANNEL/$PROCESS $CHANNEL/$PROCESS_$SYSTEMATIC

Warning

If you have a nuisance parameter that has shape effects on some processes (using shape) and rate effects on other processes (using lnN) you should use a single line for the systematic uncertainty with shape?. This will tell Combine to fist look for Up/Down systematic templates for that process and if it doesnt find them, it will interpret the number that you put for the process as a lnN instead.

For a detailed example of a template-based binned analysis, see the H\u2192\u03c4\u03c4 2014 DAS tutorial, or in our Tutorial pages.

"},{"location":"part2/settinguptheanalysis/#unbinned-or-parametric-shape-analyses","title":"Unbinned or parametric shape analyses","text":"

In some cases, it can be convenient to describe the expected signal and background shapes in terms of analytical functions, rather than templates. Typical examples are searches/measurements where the signal is apparent as a narrow peak over a smooth continuum background. In this context, uncertainties affecting the shapes of the signal and backgrounds can be implemented naturally as uncertainties in the parameters of those analytical functions. It is also possible to adopt an agnostic approach in which the parameters of the background model are left freely floating in the fit to the data, i.e. only requiring the background to be well described by a smooth function.

Technically, this is implemented by means of the RooFit package, which allows writing generic probability density functions, and saving them into ROOT files. The PDFs can be either taken from RooFit's standard library of functions (e.g. Gaussians, polynomials, ...) or hand-coded in C++, and combined together to form even more complex shapes.

In the datacard using templates, the column after the file name would have been the name of the histogram. For parametric analysis we need two names to identify the mapping, separated by a colon (:).

shapes process channel shapes.root workspace_name:pdf_name

The first part identifies the name of the input RooWorkspace containing the PDF, and the second part the name of the RooAbsPdf inside it (or, for the observed data, the RooAbsData). It is possible to have multiple input workspaces, just as there can be multiple input ROOT files. You can use any of the usual RooFit pre-defined PDFs for your signal and background models.

Warning

If in your model you are using RooAddPdfs, in which the coefficients are not defined recursively, Combine will not interpret them correctly. You can add the option --X-rtd ADDNLL_RECURSIVE=0 to any Combine command in order to recover the correct interpretation, however we recommend that you instead re-define your PDF so that the coefficients are recursive (as described in the RooAddPdf documentation) and keep the total normalization (i.e the extended term) as a separate object, as in the case of the tutorial datacard.

For example, take a look at the data/tutorials/shapes/simple-shapes-parametric.txt. We see the following line:

shapes * * simple-shapes-parametric_input.root w:$PROCESS\n[...]\nbin          1          1\nprocess      sig    bkg\n

which indicates that the input file simple-shapes-parametric_input.root should contain an input workspace (w) with PDFs named sig and bkg, since these are the names of the two processes in the datacard. Additionally, we expect there to be a data set named data_obs. If we look at the contents of the workspace in data/tutorials/shapes/simple-shapes-parametric_input.root, this is indeed what we see:

root [1] w->Print()\n\nRooWorkspace(w) w contents\n\nvariables\n---------\n(MH,bkg_norm,cc_a0,cc_a1,cc_a2,j,vogian_sigma,vogian_width)\n\np.d.f.s\n-------\nRooChebychev::bkg[ x=j coefList=(cc_a0,cc_a1,cc_a2) ] = 2.6243\nRooVoigtian::sig[ x=j mean=MH width=vogian_width sigma=vogian_sigma ] = 0.000639771\n\ndatasets\n--------\nRooDataSet::data_obs(j)\n

In this datacard, the signal is parameterized in terms of the hypothesized mass (MH). Combine will use this variable, instead of creating its own, which will be interpreted as the value for -m. For this reason, we should add the option -m 30 (or something else within the observable range) when running Combine. You will also see there is a variable named bkg_norm. This is used to normalize the background rate (see the section on Rate parameters below for details).

Warning

Combine will not accept RooExtendedPdfs as input. This is to alleviate a bug that lead to improper treatment of the normalization when using multiple RooExtendedPdfs to describe a single process. You should instead use RooAbsPdfs and provide the rate as a separate object (see the Rate parameters section).

The part of the datacard related to the systematics can include lines with the syntax

These lines encode uncertainties in the parameters of the signal and background PDFs. The parameter is to be assigned a Gaussian uncertainty of Y around its mean value of X. One can change the mean value from 0 to 1 (or any value, if one so chooses) if the parameter in question is multiplicative instead of additive.

In the data/tutorials/shapes/simple-shapes-parametric.txt datacard, there are lines for one such parametric uncertainty,

sigma   param 1.0      0.1\n

meaning there is a parameter in the input workspace called sigma, that should be constrained with a Gaussian centered at 1.0 with a width of 0.1. Note that the exact interpretation of these parameters is left to the user since the signal PDF is constructed externally by you. All Combine knows is that 1.0 should be the most likely value and 0.1 is its 1\u03c3 uncertainy. Asymmetric uncertainties are written using the syntax -1\u03c3/+1\u03c3 in the datacard, as is the case for lnN uncertainties.

If one wants to specify a parameter that is freely floating across its given range, and not Gaussian constrained, the following syntax is used:

Though this is not strictly necessary in frequentist methods using profiled likelihoods, as Combine will still profile these nuisances when performing fits (as is the case for the simple-shapes-parametric.txt datacard).

Warning

All parameters that are floating or constant in the user's input workspaces will remain floating or constant. Combine will not modify those for you!

A full example of a parametric analysis can be found in this H\u2192\u03b3\u03b3 2014 DAS tutorial or in our Tutorial pages.

"},{"location":"part2/settinguptheanalysis/#caveat-on-using-parametric-pdfs-with-binned-datasets","title":"Caveat on using parametric PDFs with binned datasets","text":"

Users should be aware of a feature that affects the use of parametric PDFs together with binned datasets.

RooFit uses the integral of the PDF, computed analytically (or numerically, but disregarding the binning), to normalize it, but computes the expected event yield in each bin by evaluating the PDF at the bin center. This means that if the variation of the pdf is sizeable within the bin, there is a mismatch between the sum of the event yields per bin and the PDF normalization, which can cause a bias in the fits. More specifically, the bias is present if the contribution of the second derivative integrated in the bin size is not negligible. For linear functions, an evaluation at the bin center is correct. There are two recommended ways to work around this issue:

1. Use narrow bins

It is recommended to use bins that are significantly finer than the characteristic scale of the PDFs. Even in the absence of this feature, this would be advisable. Note that this caveat does not apply to analyses using templates (they are constant across each bin, so there is no bias), or using unbinned datasets.

2. Use a RooParametricShapeBinPdf

Another solution (currently only implemented for 1-dimensional histograms) is to use a custom PDF that performs the correct integrals internally, as in RooParametricShapeBinPdf.

Note that this PDF class now allows parameters that are themselves RooAbsReal objects (i.e. functions of other variables). The integrals are handled internally by calling the underlying PDF's createIntegral() method with named ranges created for each of the bins. This means that if the analytical integrals for the underlying PDF are available, they will be used.

The constructor for this class requires a RooAbsReal (eg any RooAbsPdf) along with a list of RooRealVars (the parameters, excluding the observable \\(x\\)),

RooParametricShapeBinPdf(const char *name, const char *title,  RooAbsReal& _pdf, RooAbsReal& _x, RooArgList& _pars, const TH1 &_shape )\n

Below is a comparison of a fit to a binned dataset containing 1000 events with one observable \\(0 \\leq x \\leq 100\\). The fit function is a RooExponential of the form \\(e^{xp}\\).

In the upper plot, the data are binned in 100 evenly-spaced bins, while in the lower plot, there are three irregular bins. The blue lines show the result of the fit when using the RooExponential directly, while the red lines show the result when wrapping the PDF inside a RooParametricShapeBinPdf. In the narrow binned case, the two agree well, while for wide bins, accounting for the integral over the bin yields a better fit.

You should note that using this class will result in slower fits, so you should first decide whether the added accuracy is enough to justify the reduced efficiency.

"},{"location":"part2/settinguptheanalysis/#beyond-simple-datacards","title":"Beyond simple datacards","text":"

Datacards can be extended in order to provide additional functionality and flexibility during runtime. These can also allow for the production of more complicated models and for producing more advanced results.

"},{"location":"part2/settinguptheanalysis/#rate-parameters","title":"Rate parameters","text":"

The overall rate \"expected\" of a particular process in a particular bin does not necessarily need to be a fixed quantity. Scale factors can be introduced to modify the rate directly in the datacards for ANY type of analysis. This can be achieved using the directive rateParam in the datacard with the following syntax,

name rateParam bin process initial_value [min,max]\n

The [min,max] argument is optional. If it is not included, Combine will remove the range of this parameter. This will produce a new parameter, which multiplies the rate of that particular process in the given bin by its value, in the model (unless it already exists).

You can attach the same rateParam to multiple processes/bins by either using a wild card (eg * will match everything, QCD_* will match everything starting with QCD_, etc.) in the name of the bin and/or process, or by repeating the rateParam line in the datacard for different bins/processes with the same name.

Warning

rateParam is not a shortcut to evaluate the post-fit yield of a process since other nuisance parameters can also change the normalization. E.g., finding that the rateParam best-fit value is 0.9 does not necessarily imply that the process yield is 0.9 times the initial yield. The best is to evaluate the yield taking into account the values of all nuisance parameters using --saveNormalizations.

This parameter is, by default, freely floating. It is possible to include a Gaussian constraint on any rateParam that is floating (i.e not a formula or spline) by adding a param nuisance line in the datacard with the same name.

In addition to rate modifiers that are freely floating, modifiers that are functions of other parameters can be included using the following syntax,

name rateParam bin process formula args\n

where args is a comma-separated list of the arguments for the string formula. You can include other nuisance parameters in the formula, including ones that are Gaussian constrained (i,e via the param directive.)

Below is an example datacard that uses the rateParam directive to implement an ABCD-like method in Combine. For a more realistic description of its use for ABCD, see the single-lepton SUSY search implementation described here.

imax 4  number of channels\njmax 0  number of processes -1\nkmax *  number of nuisance parameters (sources of systematical uncertainties)\n-------\nbin                   B      C       D        A\nobservation           50    100      500      10\n-------\nbin                   B      C       D        A\nprocess               bkg    bkg     bkg      bkg\nprocess               1      1       1         1\nrate                  1      1       1         1\n-------\n\nalpha rateParam A bkg (@0*@1/@2) beta,gamma,delta\nbeta  rateParam B bkg 50\ngamma rateParam C bkg 100\ndelta rateParam D bkg 500\n

For more examples of using rateParam (eg for fitting process normalizations in control regions and signal regions simultaneously) see this 2016 CMS tutorial

Finally, any pre-existing RooAbsReal inside some ROOT file with a workspace can be imported using the following:

name rateParam bin process rootfile:workspacename\n

The name should correspond to the name of the object that is being picked up inside the RooWorkspace. A simple example using the SM XS and BR splines available in HiggsAnalysis/CombinedLimit can be found under data/tutorials/rate_params/simple_sm_datacard.txt

"},{"location":"part2/settinguptheanalysis/#extra-arguments","title":"Extra arguments","text":"

If a parameter is intended to be used, and it is not a user-defined param or rateParam, it can be picked up by first issuing an extArgs directive before this line in the datacard. The syntax for extArgs is:

name extArg rootfile:workspacename\n

The string \":RecycleConflictNodes\" can be added at the end of the final argument (i.e. rootfile:workspacename:RecycleConflictNodes) to apply the corresponding RooFit option when the object is imported into the workspace. It is also possible to simply add a RooRealVar using extArg for use in function rateParams with the following

name extArg init [min,max]\n

Note that the [min,max] argument is optional and if not included, the code will remove the range of this parameter.

"},{"location":"part2/settinguptheanalysis/#manipulation-of-nuisance-parameters","title":"Manipulation of Nuisance parameters","text":"

It can often be useful to modify datacards, or the runtime behavior, without having to modify individual systematic lines. This can be achieved through nuisance parameter modifiers.

"},{"location":"part2/settinguptheanalysis/#nuisance-modifiers","title":"Nuisance modifiers","text":"

If a nuisance parameter needs to be renamed for certain processes/channels, it can be done using a single nuisance edit directive at the end of a datacard

nuisance edit rename process channel oldname newname [options]\n

Note that the wildcard (*) can be used for either a process, a channel, or both. This will have the effect that nuisance parameters affecting a given process/channel will be renamed, thereby de-correlating between processes/channels. Use the option ifexists to skip/avoid an error if the nuisance paremeter is not found. This kind of command will only affect nuisances of the type shape[N], lnN. Instead, if you also want to change the names of param type nuisances, you can use a global version

nuisance edit rename oldname newname\n

which will rename all shape[N], lnN and param nuisances found in one go. You should make sure these commands come after any process/channel specific ones in the datacard. This version does not accept options.

Other edits are also supported, as follows:

The above edits (excluding the renaming) support nuisance parameters of the types shape[N], lnN, lnU, gmN, param, flatParam, rateParam, or discrete.

"},{"location":"part2/settinguptheanalysis/#groups-of-nuisances","title":"Groups of nuisances","text":"

Often it is desirable to freeze one or more nuisance parameters to check the impact they have on limits, likelihood scans, significances etc.

However, for large groups of nuisance parameters (eg everything associated to theory) it is easier to define nuisance groups in the datacard. The following line in a datacard will, for example, produce a group of nuisance parameters with the group name theory that contains two parameters, QCDscale and pdf.

theory group = QCDscale pdf\n

Multiple groups can be defined in this way. It is also possible to extend nuisance parameters groups in datacards using += in place of =.

These groups can be manipulated at runtime (eg for freezing all nuisance parameterss associated to a group at runtime, see Running the tool). You can find more info on groups of nuisances here

Note that when using the automatic addition of statistical uncertainties (autoMCStats), the corresponding nuisance parameters are created by text2workspace.py and so do not exist in the datacards. It is therefore not possible to add autoMCStats parameters to groups of nuisances in the way described above. However, text2workspace.py will automatically create a group labelled autoMCStats, which contains all autoMCStats parameters.

This group is useful for freezing all parameters created by autoMCStats. For freezing subsets of the parameters, for example if the datacard contains two categories, cat_label_1 and cat_label_2, to only freeze the autoMCStat parameters created for category cat_label_1, the regular expression features can be used. In this example this can be achieved by using --freezeParameters 'rgx{prop_bincat_label_1_bin.*}'.

"},{"location":"part2/settinguptheanalysis/#combination-of-multiple-datacards","title":"Combination of multiple datacards","text":"

If you have separate channels, each with their own datacard, it is possible to produce a combined datacard using the script combineCards.py

The syntax is simple: combineCards.py Name1=card1.txt Name2=card2.txt .... > card.txt If the input datacards had just one bin each, the output channels will be called Name1, Name2, and so on. Otherwise, a prefix Name1_ ... Name2_ will be added to the bin labels in each datacard. The supplied bin names Name1, Name2, etc. must themselves conform to valid C++/python identifier syntax.

Warning

When combining datacards, you should keep in mind that systematic uncertainties that have different names will be assumed to be uncorrelated, and those with the same name will be assumed 100% correlated. An uncertainty correlated across channels must have the same PDF. in all cards (i.e. always lnN, or all gmN with same N. Note that shape and lnN can be interchanged via the shape? directive). Furthermore, when using parametric models, \"parameter\" objects such as RooRealVar, RooAbsReal, and RooAbsCategory (parameters, PDF indices etc) with the same name will be assumed to be the same object. If this is not intended, you may encounter unexpected behaviour, such as the order of combining cards having an impact on the results. Make sure that such objects are named differently in your inputs if they represent different things! Instead, Combine will try to rename other \"shape\" objects (such as PDFs) automatically.

The combineCards.py script will fail if you are trying to combine a shape datacard with a counting datacard. You can however convert a counting datacard into an equivalent shape-based one by adding a line shapes * * FAKE in the datacard after the imax, jmax, and kmax section. Alternatively, you can add the option -S to combineCards.py, which will do this for you while creating the combined datacard.

"},{"location":"part2/settinguptheanalysis/#automatic-production-of-datacards-and-workspaces","title":"Automatic production of datacards and workspaces","text":"

For complicated analyses or cases in which multiple datacards are needed (e.g. optimization studies), you can avoid writing these by hand. The object Datacard defines the analysis and can be created as a python object. The template python script below will produce the same workspace as running textToWorkspace.py (see the section on Physics Models) on the realistic-counting-experiment.txt datacard.

from HiggsAnalysis.CombinedLimit.DatacardParser import *\nfrom HiggsAnalysis.CombinedLimit.ModelTools import *\nfrom HiggsAnalysis.CombinedLimit.ShapeTools import *\nfrom HiggsAnalysis.CombinedLimit.PhysicsModel import *\n\nfrom sys import exit\nfrom optparse import OptionParser\nparser = OptionParser()\naddDatacardParserOptions(parser)\noptions,args = parser.parse_args()\noptions.bin = True # make a binary workspace\n\nDC = Datacard()\nMB = None\n\n############## Setup the datacard (must be filled in) ###########################\n\nDC.bins =   ['bin1'] # <type 'list'>\nDC.obs =    {'bin1': 0.0} # <type 'dict'>\nDC.processes =  ['ggH', 'qqWW', 'ggWW', 'others'] # <type 'list'>\nDC.signals =    ['ggH'] # <type 'list'>\nDC.isSignal =   {'qqWW': False, 'ggWW': False, 'ggH': True, 'others': False} # <type 'dict'>\nDC.keyline =    [('bin1', 'ggH', True), ('bin1', 'qqWW', False), ('bin1', 'ggWW', False), ('bin1', 'others', False)] # <type 'list'>\nDC.exp =    {'bin1': {'qqWW': 0.63, 'ggWW': 0.06, 'ggH': 1.47, 'others': 0.22}} # <type 'dict'>\nDC.systs =  [('lumi', False, 'lnN', [], {'bin1': {'qqWW': 0.0, 'ggWW': 1.11, 'ggH': 1.11, 'others': 0.0}}), ('xs_ggH', False, 'lnN', [], {'bin1': {'qqWW': 0.0, 'ggWW': 0.0, 'ggH': 1.16, 'others': 0.0}}), ('WW_norm', False, 'gmN', [4], {'bin1': {'qqWW': 0.16, 'ggWW': 0.0, 'ggH': 0.0, 'others': 0.0}}), ('xs_ggWW', False, 'lnN', [], {'bin1': {'qqWW': 0.0, 'ggWW': 1.5, 'ggH': 0.0, 'others': 0.0}}), ('bg_others', False, 'lnN', [], {'bin1': {'qqWW': 0.0, 'ggWW': 0.0, 'ggH': 0.0, 'others': 1.3}})] # <type 'list'>\nDC.shapeMap =   {} # <type 'dict'>\nDC.hasShapes =  False # <type 'bool'>\nDC.flatParamNuisances =  {} # <type 'dict'>\nDC.rateParams =  {} # <type 'dict'>\nDC.extArgs =    {} # <type 'dict'>\nDC.rateParamsOrder  =  set([]) # <type 'set'>\nDC.frozenNuisances  =  set([]) # <type 'set'>\nDC.systematicsShapeMap =  {} # <type 'dict'>\nDC.nuisanceEditLines    =  [] # <type 'list'>\nDC.groups   =  {} # <type 'dict'>\nDC.discretes    =  [] # <type 'list'>\n\n\n###### User defined options #############################################\n\noptions.out      = \"combine_workspace.root\"     # Output workspace name\noptions.fileName = \"./\"             # Path to input ROOT files\noptions.verbose  = \"1\"              # Verbosity\n\n##########################################################################\n\nif DC.hasShapes:\n    MB = ShapeBuilder(DC, options)\nelse:\n    MB = CountingModelBuilder(DC, options)\n\n# Set physics models\nMB.setPhysics(defaultModel)\nMB.doModel()\n

Any existing datacard can be converted into such a template python script by using the --dump-datacard option in text2workspace.py, in case a more complicated template is needed.

Warning

The above is not advised for final results, as this script is not easily combined with other analyses so should only be used for internal studies.

For the automatic generation of datacards that are combinable, you should instead use the CombineHarvester package, which includes many features for producing complex datacards in a reliable, automated way.

"},{"location":"part2/settinguptheanalysis/#sanity-checking-the-datacard","title":"Sanity checking the datacard","text":"

For large combinations with multiple channels/processes etc, the .txt file can get unwieldy to read through. There are some simple tools to help check and disseminate the contents of the cards.

In order to get a quick view of the systematic uncertainties included in the datacard, you can use the test/systematicsAnalyzer.py tool. This will produce a list of the systematic uncertainties (normalization and shape), indicating what type they are, which channels/processes they affect and the size of the effect on the normalization (for shape uncertainties, this will just be the overall uncertainty on the normalization).

The default output is a .html file that can be expanded to give more details about the effect of the systematic uncertainty for each channel/process. Add the option --format brief to obtain a simpler summary report direct to the terminal. An example output for the tutorial card data/tutorials/shapes/simple-shapes-TH1.txt is shown below.

$ python test/systematicsAnalyzer.py data/tutorials/shapes/simple-shapes-TH1.txt --all -f html > out.html\n

This will produce the following output in html format:

Nuisance Report Nuisance Report Nuisance (types)RangeProcessesChannels lumi (lnN) 1.0001.100 background, signal bin1(1) [+] bin1signal(1.1), background(1.0) alpha (shape) 1.1111.150 background bin1(1) [+] bin1background(0.900/1.150 (shape)) bgnorm (lnN) 1.0001.300 background, signal bin1(1) [+] bin1signal(1.0), background(1.3) sigma (shape) 1.0001.000 signal bin1(1) [+] bin1signal(1.000/1.000 (shape))

In case you only have a counting experiment datacard, include the option --noshape.

If you have a datacard that uses several rateParams or a Physics model that includes a complicated product of normalization terms in each process, you can check the values of the normalization (and which objects in the workspace comprise them) using the test/printWorkspaceNormalisations.py tool. As an example, the first few blocks of output for the tutorial card data/tutorials/counting/realistic-multi-channel.txt are given below:

Show example output
\n$ text2workspace.py data/tutorials/shapes/simple-shapes-parametric.txt -m 30\n$ python test/printWorkspaceNormalisations.py data/tutorials/counting/realistic-multi-channel.root                                                                                                           \n\n---------------------------------------------------------------------------\n---------------------------------------------------------------------------\nChannel - mu_tau\n---------------------------------------------------------------------------\n  Top-level normalisation for process ZTT -> n_exp_binmu_tau_proc_ZTT\n  -------------------------------------------------------------------------\nDumping ProcessNormalization n_exp_binmu_tau_proc_ZTT @ 0x6bbb610\n    nominal value: 329\n    log-normals (3):\n         kappa = 1.23, logKappa = 0.207014, theta = tauid = 0\n         kappa = 1.04, logKappa = 0.0392207, theta = ZtoLL = 0\n         kappa = 1.04, logKappa = 0.0392207, theta = effic = 0\n    asymm log-normals (0):\n    other terms (0):\n\n  -------------------------------------------------------------------------\n  default value =  329.0\n---------------------------------------------------------------------------\n  Top-level normalisation for process QCD -> n_exp_binmu_tau_proc_QCD\n  -------------------------------------------------------------------------\nDumping ProcessNormalization n_exp_binmu_tau_proc_QCD @ 0x6bbcaa0\n    nominal value: 259\n    log-normals (1):\n         kappa = 1.1, logKappa = 0.0953102, theta = QCDmu = 0\n    asymm log-normals (0):\n    other terms (0):\n\n  -------------------------------------------------------------------------\n  default value =  259.0\n---------------------------------------------------------------------------\n  Top-level normalisation for process higgs -> n_exp_binmu_tau_proc_higgs\n  -------------------------------------------------------------------------\nDumping ProcessNormalization n_exp_binmu_tau_proc_higgs @ 0x6bc6390\n    nominal value: 0.57\n    log-normals (3):\n         kappa = 1.11, logKappa = 0.10436, theta = lumi = 0\n         kappa = 1.23, logKappa = 0.207014, theta = tauid = 0\n         kappa = 1.04, logKappa = 0.0392207, theta = effic = 0\n    asymm log-normals (0):\n    other terms (1):\n         term r (class RooRealVar), value = 1\n\n  -------------------------------------------------------------------------\n  default value =  0.57\n---------------------------------------------------------------------------\n---------------------------------------------------------------------------\nChannel - e_mu\n---------------------------------------------------------------------------\n  Top-level normalisation for process ZTT -> n_exp_bine_mu_proc_ZTT\n  -------------------------------------------------------------------------\nDumping ProcessNormalization n_exp_bine_mu_proc_ZTT @ 0x6bc8910\n    nominal value: 88\n    log-normals (2):\n         kappa = 1.04, logKappa = 0.0392207, theta = ZtoLL = 0\n         kappa = 1.04, logKappa = 0.0392207, theta = effic = 0\n    asymm log-normals (0):\n    other terms (0):\n\n  -------------------------------------------------------------------------\n  default value =  88.0\n---------------------------------------------------------------------------\n

As you can see, for each channel, a report is given for the top-level rate object in the workspace, for each process contributing to that channel. You can also see the various terms that make up that rate. The default value is for the default parameters in the workspace (i.e when running text2workspace, these are the values created as default).

Another example is shown below for the workspace produced from the data/tutorials/shapes/simple-shapes-parametric.txt datacard.

Show example output
\n  text2workspace.py data/tutorials/shapes/simple-shapes-parametric.txt\n  python test/printWorkspaceNormalisations.py data/tutorials/shapes/simple-shapes-parametric.root\n  ...\n\n  ---------------------------------------------------------------------------\n  ---------------------------------------------------------------------------\n  Channel - bin1\n  ---------------------------------------------------------------------------\n    Top-level normalisation for process bkg -> n_exp_final_binbin1_proc_bkg\n    -------------------------------------------------------------------------\n  RooProduct::n_exp_final_binbin1_proc_bkg[ n_exp_binbin1_proc_bkg * shapeBkg_bkg_bin1__norm ] = 521.163\n   ... is a product, which contains  n_exp_binbin1_proc_bkg\n  RooRealVar::n_exp_binbin1_proc_bkg = 1 C  L(-INF - +INF)\n    -------------------------------------------------------------------------\n    default value =  521.163204829\n  ---------------------------------------------------------------------------\n    Top-level normalisation for process sig -> n_exp_binbin1_proc_sig\n    -------------------------------------------------------------------------\n  Dumping ProcessNormalization n_exp_binbin1_proc_sig @ 0x464f700\n      nominal value: 1\n      log-normals (1):\n           kappa = 1.1, logKappa = 0.0953102, theta = lumi = 0\n      asymm log-normals (0):\n      other terms (1):\n           term r (class RooRealVar), value = 1\n\n    -------------------------------------------------------------------------\n    default value =  1.0\n

This tells us that the normalization for the background process, named n_exp_final_binbin1_proc_bkg is a product of two objects n_exp_binbin1_proc_bkg * shapeBkg_bkg_bin1__norm. The first object is just from the rate line in the datacard (equal to 1) and the second is a floating parameter. For the signal, the normalisation is called n_exp_binbin1_proc_sig and is a ProcessNormalization object that contains the rate modifications due to the systematic uncertainties. You can see that it also has a \"nominal value\", which again is just from the value given in the rate line of the datacard (again=1).

"},{"location":"part3/commonstatsmethods/","title":"Common Statistical Methods","text":"

In this section, the most commonly used statistical methods from Combine will be covered, including specific instructions on how to obtain limits, significances, and likelihood scans. For all of these methods, the assumed parameter of interest (POI) is the overall signal strength \\(r\\) (i.e the default PhysicsModel). In general however, the first POI in the list of POIs (as defined by the PhysicsModel) will be taken instead of r. This may or may not make sense for any particular method, so care must be taken.

This section will assume that you are using the default physics model, unless otherwise specified.

"},{"location":"part3/commonstatsmethods/#asymptotic-frequentist-limits","title":"Asymptotic Frequentist Limits","text":"

The AsymptoticLimits method can be used to quickly compute an estimate of the observed and expected limits, which is accurate when the event yields are not too small and the systematic uncertainties do not play a major role in the result. The limit calculation relies on an asymptotic approximation of the distributions of the LHC test statistic, which is based on a profile likelihood ratio, under the signal and background hypotheses to compute two p-values \\(p_{\\mu}, p_{b}\\) and therefore \\(CL_s=p_{\\mu}/(1-p_{b})\\) (see the FAQ section for a description). This means it is the asymptotic approximation for evaluating limits with frequentist toys using the LHC test statistic for limits. In the definition below, the parameter \\(\\mu=r\\).

This method is the default Combine method: if you call Combine without specifying -M, the AsymptoticLimits method will be run.

A realistic example of a datacard for a counting experiment can be found in the HiggsCombination package: data/tutorials/counting/realistic-counting-experiment.txt

The AsymptoticLimits method can be run using

combine -M AsymptoticLimits realistic-counting-experiment.txt\n

The program will print the limit on the signal strength r (number of signal events / number of expected signal events) e .g. Observed Limit: r < 1.6297 @ 95% CL , the median expected limit Expected 50.0%: r < 2.3111, and edges of the 68% and 95% ranges for the expected limits.

 <<< Combine >>>\n>>> including systematics\n>>> method used to compute upper limit is AsymptoticLimits\n[...]\n -- AsymptoticLimits ( CLs ) --\nObserved Limit: r < 1.6281\nExpected  2.5%: r < 0.9640\nExpected 16.0%: r < 1.4329\nExpected 50.0%: r < 2.3281\nExpected 84.0%: r < 3.9800\nExpected 97.5%: r < 6.6194\n\nDone in 0.01 min (cpu), 0.01 min (real)\n

By default, the limits are calculated using the CLs prescription, as noted in the output, which takes the ratio of p-values under the signal plus background and background only hypothesis. This can be altered to using the strict p-value by using the option --rule CLsplusb (note that CLsplusb is the jargon for calculating the p-value \\(p_{\\mu}\\)). You can also change the confidence level (default is 95%) to 90% using the option --cl 0.9 or any other confidence level. You can find the full list of options for AsymptoticLimits using --help -M AsymptoticLimits.

Warning

You may find that Combine issues a warning that the best fit for the background-only Asimov dataset returns a nonzero value for the signal strength;

WARNING: Best fit of asimov dataset is at r = 0.220944 (0.011047 times rMax), while it should be at zero

If this happens, you should check to make sure that there are no issues with the datacard or the Asimov generation used for your setup. For details on debugging, it is recommended that you follow the simple checks used by the HIG PAG here.

The program will also create a ROOT file higgsCombineTest.AsymptoticLimits.mH120.root containing a ROOT tree limit that contains the limit values and other bookkeeping information. The important columns are limit (the limit value) and quantileExpected (-1 for observed limit, 0.5 for median expected limit, 0.16/0.84 for the edges of the 65% interval band of expected limits, 0.025/0.975 for 95%).

$ root -l higgsCombineTest.AsymptoticLimits.mH120.root\nroot [0] limit->Scan(\"*\")\n************************************************************************************************************************************\n*    Row   *     limit *  limitErr *        mh *      syst *      iToy *     iSeed *  iChannel *     t_cpu *    t_real * quantileE *\n************************************************************************************************************************************\n*        0 * 0.9639892 *         0 *       120 *         1 *         0 *    123456 *         0 *         0 *         0 * 0.0250000 *\n*        1 * 1.4329109 *         0 *       120 *         1 *         0 *    123456 *         0 *         0 *         0 * 0.1599999 *\n*        2 *  2.328125 *         0 *       120 *         1 *         0 *    123456 *         0 *         0 *         0 *       0.5 *\n*        3 * 3.9799661 *         0 *       120 *         1 *         0 *    123456 *         0 *         0 *         0 * 0.8399999 *\n*        4 * 6.6194028 *         0 *       120 *         1 *         0 *    123456 *         0 *         0 *         0 * 0.9750000 *\n*        5 * 1.6281188 * 0.0050568 *       120 *         1 *         0 *    123456 *         0 * 0.0035000 * 0.0055123 *        -1 *\n************************************************************************************************************************************\n
"},{"location":"part3/commonstatsmethods/#blind-limits","title":"Blind limits","text":"

The AsymptoticLimits calculation follows the frequentist paradigm for calculating expected limits. This means that the routine will first fit the observed data, conditionally for a fixed value of r, and set the nuisance parameters to the values obtained in the fit for generating the Asimov data set. This means it calculates the post-fit or a-posteriori expected limit. In order to use the pre-fit nuisance parameters (to calculate an a-priori limit), you must add the option --noFitAsimov or --bypassFrequentistFit.

For blinding the results completely (i.e not using the data) you can include the option --run blind.

Warning

While you can use -t -1 to get blind limits, if the correct options are passed, we strongly recommend to use --run blind.

"},{"location":"part3/commonstatsmethods/#splitting-points","title":"Splitting points","text":"

In case your model is particularly complex, you can perform the asymptotic calculation by determining the value of CLs for a set grid of points (in r) and merging the results. This is done by using the option --singlePoint X for multiple values of X, hadd'ing the output files and reading them back in,

combine -M AsymptoticLimits realistic-counting-experiment.txt --singlePoint 0.1 -n 0.1\ncombine -M AsymptoticLimits realistic-counting-experiment.txt --singlePoint 0.2 -n 0.2\ncombine -M AsymptoticLimits realistic-counting-experiment.txt --singlePoint 0.3 -n 0.3\n...\n\nhadd limits.root higgsCombine*.AsymptoticLimits.*\n\ncombine -M AsymptoticLimits realistic-counting-experiment.txt --getLimitFromGrid limits.root\n
"},{"location":"part3/commonstatsmethods/#asymptotic-significances","title":"Asymptotic Significances","text":"

The significance of a result is calculated using a ratio of profiled likelihoods, one in which the signal strength is set to 0 and the other in which it is free to float. The evaluated quantity is \\(-2\\ln[\\mathcal{L}(\\mu=0,\\hat{\\hat{\\nu}}(0))/\\mathcal{L}(\\hat{\\mu},\\hat{\\nu})]\\), in which the nuisance parameters are profiled separately for \\(\\mu=\\hat{\\mu}\\) and \\(\\mu=0\\).

The distribution of this test statistic can be determined using Wilks' theorem provided the number of events is large enough (i.e in the Asymptotic limit). The significance (or p-value) can therefore be calculated very quickly. The Significance method can be used for this.

It is also possible to calculate the ratio of likelihoods between the freely floating signal strength to that of a fixed signal strength other than 0, by specifying it with the option --signalForSignificance=X.

Info

This calculation assumes that the signal strength can only be positive (i.e we are not interested in negative signal strengths). This behaviour can be altered by including the option --uncapped.

"},{"location":"part3/commonstatsmethods/#compute-the-observed-significance","title":"Compute the observed significance","text":"

The observed significance is calculated using the Significance method, as

combine -M Significance datacard.txt

The printed output will report the significance and the p-value, for example, when using the realistic-counting-experiment.txt datacard, you will see

 <<< Combine >>>\n>>> including systematics\n>>> method used is Significance\n[...]\n -- Significance --\nSignificance: 0\n       (p-value = 0.5)\nDone in 0.00 min (cpu), 0.01 min (real)\n

which is not surprising since 0 events were observed in that datacard.

The output ROOT file will contain the significance value in the branch limit. To store the p-value instead, include the option --pval. The significance and p-value can be converted between one another using the RooFit functions RooFit::PValueToSignificance and RooFit::SignificanceToPValue.

When calculating the significance, you may find it useful to resort to a brute-force fitting algorithm that scans the nll (repeating fits until a certain tolerance is reached), bypassing MINOS, which can be activated with the option bruteForce. This can be tuned using the options setBruteForceAlgo, setBruteForceTypeAndAlgo and setBruteForceTolerance.

"},{"location":"part3/commonstatsmethods/#computing-the-expected-significance","title":"Computing the expected significance","text":"

The expected significance can be computed from an Asimov data set of signal+background. There are two options for this:

The a-priori expected significance from the Asimov dataset is calculated as

combine -M Significance datacard.txt -t -1 --expectSignal=1\n

In order to produce the a-posteriori expected significance, just generate a post-fit Asimov data set by adding the option --toysFreq in the command above.

The output format is the same as for observed significances: the variable limit in the tree will be filled with the significance (or with the p-value if you put also the option --pvalue)

"},{"location":"part3/commonstatsmethods/#bayesian-limits-and-credible-regions","title":"Bayesian Limits and Credible regions","text":"

Bayesian calculation of limits requires the user to assume a particular prior distribution for the parameter of interest (default r). You can specify the prior using the --prior option, the default is a flat pior in r.

"},{"location":"part3/commonstatsmethods/#computing-the-observed-bayesian-limit-for-simple-models","title":"Computing the observed bayesian limit (for simple models)","text":"

The BayesianSimple method computes a Bayesian limit performing classical numerical integration. This is very fast and accurate, but only works for simple models (a few channels and nuisance parameters).

combine -M BayesianSimple simple-counting-experiment.txt\n[...]\n\n -- BayesianSimple --\nLimit: r < 0.672292 @ 95% CL\nDone in 0.04 min (cpu), 0.05 min (real)\n

The output tree will contain a single entry corresponding to the observed 95% confidence level upper limit. The confidence level can be modified to 100*X% using --cl X.

"},{"location":"part3/commonstatsmethods/#computing-the-observed-bayesian-limit-for-arbitrary-models","title":"Computing the observed bayesian limit (for arbitrary models)","text":"

The MarkovChainMC method computes a Bayesian limit performing a Monte Carlo integration. From the statistical point of view it is identical to the BayesianSimple method, only the technical implementation is different. The method is slower, but can also handle complex models. For this method you can increase the accuracy of the result by increasing the number of Markov Chains, at the expense of a longer running time (option --tries, default is 10). Let's use the realistic counting experiment datacard to test the method.

To use the MarkovChainMC method, users need to specify this method in the command line, together with the options they want to use. For instance, to set the number of times the algorithm will run with different random seeds, use option --tries:

combine -M MarkovChainMC realistic-counting-experiment.txt --tries 100\n[...]\n\n -- MarkovChainMC --\nLimit: r < 2.20438 +/- 0.0144695 @ 95% CL (100 tries)\nAverage chain acceptance: 0.078118\nDone in 0.14 min (cpu), 0.15 min (real)\n

Again, the resulting limit tree will contain the result. You can also save the chains using the option --saveChain, which will then also be included in the output file.

Exclusion regions can be made from the posterior once an ordering principle is defined to decide how to grow the contour (there is an infinite number of possible regions that contain 68% of the posterior pdf). Below is a simple example script that can be used to plot the posterior distribution from these chains and calculate the smallest such region. Note that in this example we are ignoring the burn-in. This can be added by e.g. changing for i in range(mychain.numEntries()): to for i in range(200,mychain.numEntries()): for a burn-in of 200.

Show example script
\nimport ROOT\n\nrmin = 0\nrmax = 30\nnbins = 100\nCL = 0.95\nchains = \"higgsCombineTest.MarkovChainMC.blahblahblah.root\"\n\ndef findSmallestInterval(hist,CL):\n bins = hist.GetNbinsX()\n best_i = 1\n best_j = 1\n bd = bins+1\n val = 0;\n for i in range(1,bins+1):\n   integral = hist.GetBinContent(i)\n   for j in range(i+1,bins+2):\n    integral += hist.GetBinContent(j)\n    if integral > CL :\n      val = integral\n      break\n   if integral > CL and  j-i < bd :\n     bd = j-i\n     best_j = j+1\n     best_i = i\n     val = integral\n return hist.GetBinLowEdge(best_i), hist.GetBinLowEdge(best_j), val\n\nfi_MCMC = ROOT.TFile.Open(chains)\n# Sum up all of the chains (or we could take the average limit)\nmychain=0\nfor k in fi_MCMC.Get(\"toys\").GetListOfKeys():\n    obj = k.ReadObj\n    if mychain ==0:\n        mychain = k.ReadObj().GetAsDataSet()\n    else :\n        mychain.append(k.ReadObj().GetAsDataSet())\nhist = ROOT.TH1F(\"h_post\",\";r;posterior probability\",nbins,rmin,rmax)\nfor i in range(mychain.numEntries()):\n#for i in range(200,mychain.numEntries()): burn-in of 200\n  mychain.get(i)\n  hist.Fill(mychain.get(i).getRealValue(\"r\"), mychain.weight())\nhist.Scale(1./hist.Integral())\nhist.SetLineColor(1)\nvl,vu,trueCL = findSmallestInterval(hist,CL)\nhistCL = hist.Clone()\nfor b in range(nbins):\n  if histCL.GetBinLowEdge(b+1) < vl or histCL.GetBinLowEdge(b+2)>vu: histCL.SetBinContent(b+1,0)\nc6a = ROOT.TCanvas()\nhistCL.SetFillColor(ROOT.kAzure-3)\nhistCL.SetFillStyle(1001)\nhist.Draw()\nhistCL.Draw(\"histFsame\")\nhist.Draw(\"histsame\")\nll = ROOT.TLine(vl,0,vl,2*hist.GetBinContent(hist.FindBin(vl))); ll.SetLineColor(2); ll.SetLineWidth(2)\nlu = ROOT.TLine(vu,0,vu,2*hist.GetBinContent(hist.FindBin(vu))); lu.SetLineColor(2); lu.SetLineWidth(2)\nll.Draw()\nlu.Draw()\n\nprint \" %g %% (%g %%) interval (target)  = %g < r < %g \"%(trueCL,CL,vl,vu)\n

Running the script on the output file produced for the same datacard (including the --saveChain option) will produce the following output

0.950975 % (0.95 %) interval (target)  = 0 < r < 2.2\n

along with a plot of the posterior distribution shown below. This is the same as the output from Combine, but the script can also be used to find lower limits (for example) or credible intervals.

An example to make contours when ordering by probability density can be found in bayesContours.cxx. Note that the implementation is simplistic, with no clever handling of bin sizes nor smoothing of statistical fluctuations.

The MarkovChainMC algorithm has many configurable parameters, and you are encouraged to experiment with those. The default configuration might not be the best for your analysis.

"},{"location":"part3/commonstatsmethods/#iterations-burn-in-tries","title":"Iterations, burn-in, tries","text":"

Three parameters control how the MCMC integration is performed:

"},{"location":"part3/commonstatsmethods/#proposals","title":"Proposals","text":"

The option --proposal controls the way new points are proposed to fill in the MC chain.

If you believe there is something going wrong, e.g. if your chain remains stuck after accepting only a few events, the option --debugProposal can be used to obtain a printout of the first N proposed points. This can help you understand what is happening; for example if you have a region of the phase space with probability zero, the gaus and fit proposal can get stuck there forever.

"},{"location":"part3/commonstatsmethods/#computing-the-expected-bayesian-limit","title":"Computing the expected bayesian limit","text":"

The expected limit is computed by generating many toy MC data sets and computing the limit for each of them. This can be done passing the option -t . E.g. to run 100 toys with the BayesianSimple method, you can run

combine -M BayesianSimple datacard.txt -t 100\n

The program will print out the mean and median limit, as well as the 68% and 95% quantiles of the distributions of the limits. This time, the output ROOT tree will contain one entry per toy.

For more heavy methods (eg the MarkovChainMC) you will probably want to split this calculation into multiple jobs. To do this, just run Combine multiple times specifying a smaller number of toys (as low as 1), using a different seed to initialize the random number generator each time. The option -s can be used for this; if you set it to -1, the starting seed will be initialized randomly at the beginning of the job. Finally, you can merge the resulting trees with hadd and look at the distribution in the merged file.

"},{"location":"part3/commonstatsmethods/#multidimensional-bayesian-credible-regions","title":"Multidimensional bayesian credible regions","text":"

The MarkovChainMC method allows the user to produce the posterior PDF as a function of (in principle) any number of POIs. In order to do so, you first need to create a workspace with more than one parameter, as explained in the physics models section.

For example, let us use the toy datacard data/tutorials/multiDim/toy-hgg-125.txt (counting experiment that vaguely resembles an early H\u2192\u03b3\u03b3 analysis at 125 GeV) and convert the datacard into a workspace with 2 parameters, the ggH and qqH cross sections, using text2workspace.

text2workspace.py data/tutorials/multiDim/toy-hgg-125.txt -P HiggsAnalysis.CombinedLimit.PhysicsModel:floatingXSHiggs --PO modes=ggH,qqH -o workspace.root\n

Now we just run one (or more) MCMC chain(s) and save them in the output tree. By default, the nuisance parameters will be marginalized (integrated) over their PDFs. You can ignore the complaints about not being able to compute an upper limit (since for more than 1D, this is not well-defined),

combine -M MarkovChainMC workspace.root --tries 1 --saveChain -i 1000000 -m 125 -s 12345\n

The output of the Markov Chain is again a RooDataSet of weighted events distributed according to the posterior PDF (after you cut out the burn in part), so it can be used to make histograms or other distributions of the posterior PDF. See as an example bayesPosterior2D.cxx.

Below is an example of the output of the macro,

$ root -l higgsCombineTest.MarkovChainMC....\n.L bayesPosterior2D.cxx\nbayesPosterior2D(\"bayes2D\",\"Posterior PDF\")\n

"},{"location":"part3/commonstatsmethods/#computing-limits-with-toys","title":"Computing Limits with toys","text":"

The HybridNew method is used to compute either the hybrid bayesian-frequentist limits, popularly known as \"CLs of LEP or Tevatron type\", or the fully frequentist limits, which are the current recommended method by the LHC Higgs Combination Group. Note that these methods can be resource intensive for complex models.

It is possible to define the criterion used for setting limits using --rule CLs (to use the CLs criterion) or --rule CLsplusb (to calculate the limit using \\(p_{\\mu}\\)) and as always the confidence level desired using --cl=X.

The choice of test statistic can be made via the option --testStat. Different methodologies for the treatment of the nuisance parameters are available. While it is possible to mix different test statistics with different nuisance parameter treatments, we strongly do not recommend this. Instead one should follow one of the following three procedures. Note that the signal strength \\(r\\) here is given the more common notation \\(\\mu\\).

Warning

The recommended style is the LHC-style. Please note that this method is sensitive to the observation in data since the post-fit (after a fit to the data) values of the nuisance parameters (assuming different values of r) are used when generating the toys. For completely blind limits you can first generate a pre-fit asimov toy data set (described in the toy data generation section) and use that in place of the data. You can use this toy by passing the argument -D toysFileName.root:toys/toy_asimov

While the above shortcuts are the commonly used versions, variations can be tested. The treatment of the nuisances can be changed to the so-called \"Hybrid-Bayesian\" method, which effectively integrates over the nuisance parameters. This is especially relevant when you have very few expected events in your data, and you are using those events to constrain background processes. This can be achieved by setting --generateNuisances=1 --generateExternalMeasurements=0. In case you want to avoid first fitting to the data to choose the nominal values you can additionally pass --fitNuisances=0.

Warning

If you have unconstrained parameters in your model (rateParam, or if you are using a _norm variable for a PDF) and you want to use the \"Hybrid-Bayesian\" method, you must declare these as flatParam in your datacard. When running text2workspace you must add the option --X-assign-flatParam-prior in the command line. This will create uniform priors for these parameters. These are needed for this method and they would otherwise not get created.

Info

Note that (observed and expected) values of the test statistic stored in the instances of RooStats::HypoTestResult when the option --saveHybridResult is passed are defined without the factor 2. They are therefore twice as small as the values given by the formulas above. This factor is however included automatically by all plotting scripts supplied within the Combine package. If you use your own plotting scripts, you need to make sure to incorporate the factor 2.

"},{"location":"part3/commonstatsmethods/#simple-models","title":"Simple models","text":"

For relatively simple models, the observed and expected limits can be calculated interactively. Since the LHC-style is the recommended set of options for calculating limits using toys, we will use that in this section. However, the same procedure can be followed with the other sets of options.

combine realistic-counting-experiment.txt -M HybridNew --LHCmode LHC-limits\n
Show output
 <<< Combine >>>\n>>> including systematics\n>>> using the Profile Likelihood test statistics modified for upper limits (Q_LHC)\n>>> method used is HybridNew\n>>> random number generator seed is 123456\nComputing results starting from observation (a-posteriori)\nSearch for upper limit to the limit\n  r = 20 +/- 0\n    CLs = 0 +/- 0\n    CLs      = 0 +/- 0\n    CLb      = 0.264 +/- 0.0394263\n    CLsplusb = 0 +/- 0\n\nSearch for lower limit to the limit\nNow doing proper bracketing & bisection\n  r = 10 +/- 10\n    CLs = 0 +/- 0\n    CLs      = 0 +/- 0\n    CLb      = 0.288 +/- 0.0405024\n    CLsplusb = 0 +/- 0\n\n  r = 5 +/- 5\n    CLs = 0 +/- 0\n    CLs      = 0 +/- 0\n    CLb      = 0.152 +/- 0.0321118\n    CLsplusb = 0 +/- 0\n\n  r = 2.5 +/- 2.5\n    CLs = 0.0192308 +/- 0.0139799\n    CLs = 0.02008 +/- 0.0103371\n    CLs = 0.0271712 +/- 0.00999051\n    CLs = 0.0239524 +/- 0.00783634\n    CLs      = 0.0239524 +/- 0.00783634\n    CLb      = 0.208748 +/- 0.0181211\n    CLsplusb = 0.005 +/- 0.00157718\n\n  r = 2.00696 +/- 1.25\n    CLs = 0.0740741 +/- 0.0288829\n    CLs = 0.0730182 +/- 0.0200897\n    CLs = 0.0694474 +/- 0.0166468\n    CLs = 0.0640182 +/- 0.0131693\n    CLs = 0.0595 +/- 0.010864\n    CLs = 0.0650862 +/- 0.0105575\n    CLs = 0.0629286 +/- 0.00966301\n    CLs = 0.0634945 +/- 0.00914091\n    CLs = 0.060914 +/- 0.00852667\n    CLs = 0.06295 +/- 0.00830083\n    CLs = 0.0612758 +/- 0.00778181\n    CLs = 0.0608142 +/- 0.00747001\n    CLs = 0.0587169 +/- 0.00697039\n    CLs = 0.0591432 +/- 0.00678587\n    CLs = 0.0599683 +/- 0.00666966\n    CLs = 0.0574868 +/- 0.00630809\n    CLs = 0.0571451 +/- 0.00608177\n    CLs = 0.0553836 +/- 0.00585531\n    CLs = 0.0531612 +/- 0.0055234\n    CLs = 0.0516837 +/- 0.0052607\n    CLs = 0.0496776 +/- 0.00499783\n    CLs      = 0.0496776 +/- 0.00499783\n    CLb      = 0.216635 +/- 0.00801002\n    CLsplusb = 0.0107619 +/- 0.00100693\n\nTrying to move the interval edges closer\n  r = 1.00348 +/- 0\n    CLs = 0.191176 +/- 0.0459911\n    CLs      = 0.191176 +/- 0.0459911\n    CLb      = 0.272 +/- 0.0398011\n    CLsplusb = 0.052 +/- 0.00992935\n\n  r = 1.50522 +/- 0\n    CLs = 0.125 +/- 0.0444346\n    CLs = 0.09538 +/- 0.0248075\n    CLs = 0.107714 +/- 0.0226712\n    CLs = 0.103711 +/- 0.018789\n    CLs = 0.0845069 +/- 0.0142341\n    CLs = 0.0828468 +/- 0.0126789\n    CLs = 0.0879647 +/- 0.0122332\n    CLs      = 0.0879647 +/- 0.0122332\n    CLb      = 0.211124 +/- 0.0137494\n    CLsplusb = 0.0185714 +/- 0.00228201\n\n  r = 1.75609 +/- 0\n    CLs = 0.0703125 +/- 0.0255807\n    CLs = 0.0595593 +/- 0.0171995\n    CLs = 0.0555271 +/- 0.0137075\n    CLs = 0.0548727 +/- 0.0120557\n    CLs = 0.0527832 +/- 0.0103348\n    CLs = 0.0555828 +/- 0.00998248\n    CLs = 0.0567971 +/- 0.00923449\n    CLs = 0.0581822 +/- 0.00871417\n    CLs = 0.0588835 +/- 0.00836245\n    CLs = 0.0594035 +/- 0.00784761\n    CLs = 0.0590583 +/- 0.00752672\n    CLs = 0.0552067 +/- 0.00695542\n    CLs = 0.0560446 +/- 0.00679746\n    CLs = 0.0548083 +/- 0.0064351\n    CLs = 0.0566998 +/- 0.00627124\n    CLs = 0.0561576 +/- 0.00601888\n    CLs = 0.0551643 +/- 0.00576338\n    CLs = 0.0583584 +/- 0.00582854\n    CLs = 0.0585691 +/- 0.0057078\n    CLs = 0.0599114 +/- 0.00564585\n    CLs = 0.061987 +/- 0.00566905\n    CLs = 0.061836 +/- 0.00549856\n    CLs = 0.0616849 +/- 0.0053773\n    CLs = 0.0605352 +/- 0.00516844\n    CLs = 0.0602028 +/- 0.00502875\n    CLs = 0.058667 +/- 0.00486263\n    CLs      = 0.058667 +/- 0.00486263\n    CLb      = 0.222901 +/- 0.00727258\n    CLsplusb = 0.0130769 +/- 0.000996375\n\n  r = 2.25348 +/- 0\n    CLs = 0.0192308 +/- 0.0139799\n    CLs = 0.0173103 +/- 0.00886481\n    CLs      = 0.0173103 +/- 0.00886481\n    CLb      = 0.231076 +/- 0.0266062\n    CLsplusb = 0.004 +/- 0.001996\n\n  r = 2.13022 +/- 0\n    CLs = 0.0441176 +/- 0.0190309\n    CLs = 0.0557778 +/- 0.01736\n    CLs = 0.0496461 +/- 0.0132776\n    CLs = 0.0479048 +/- 0.0114407\n    CLs = 0.0419333 +/- 0.00925719\n    CLs = 0.0367934 +/- 0.0077345\n    CLs = 0.0339814 +/- 0.00684844\n    CLs = 0.03438 +/- 0.0064704\n    CLs = 0.0337633 +/- 0.00597315\n    CLs = 0.0321262 +/- 0.00551608\n    CLs      = 0.0321262 +/- 0.00551608\n    CLb      = 0.230342 +/- 0.0118665\n    CLsplusb = 0.0074 +/- 0.00121204\n\n  r = 2.06859 +/- 0\n    CLs = 0.0357143 +/- 0.0217521\n    CLs = 0.0381957 +/- 0.0152597\n    CLs = 0.0368622 +/- 0.0117105\n    CLs = 0.0415097 +/- 0.0106676\n    CLs = 0.0442816 +/- 0.0100457\n    CLs = 0.0376644 +/- 0.00847235\n    CLs = 0.0395133 +/- 0.0080427\n    CLs = 0.0377625 +/- 0.00727262\n    CLs = 0.0364415 +/- 0.00667827\n    CLs = 0.0368015 +/- 0.00628517\n    CLs = 0.0357251 +/- 0.00586442\n    CLs = 0.0341604 +/- 0.00546373\n    CLs = 0.0361935 +/- 0.00549648\n    CLs = 0.0403254 +/- 0.00565172\n    CLs = 0.0408613 +/- 0.00554124\n    CLs = 0.0416682 +/- 0.00539651\n    CLs = 0.0432645 +/- 0.00538062\n    CLs = 0.0435229 +/- 0.00516945\n    CLs = 0.0427647 +/- 0.00501322\n    CLs = 0.0414894 +/- 0.00479711\n    CLs      = 0.0414894 +/- 0.00479711\n    CLb      = 0.202461 +/- 0.00800632\n    CLsplusb = 0.0084 +/- 0.000912658\n\n\n -- HybridNew, before fit --\nLimit: r < 2.00696 +/- 1.25 [1.50522, 2.13022]\nWarning in : Could not create the Migrad minimizer. Try using the minimizer Minuit\nFit to 5 points: 1.91034 +/- 0.0388334\n\n -- Hybrid New --\nLimit: r < 1.91034 +/- 0.0388334 @ 95% CL\nDone in 0.01 min (cpu), 4.09 min (real)\nFailed to delete temporary file roostats-Sprxsw.root: No such file or directory\n\n

\n\n

The result stored in the limit branch of the output tree will be the upper limit (and its error, stored in limitErr). The default behaviour will be, as above, to search for the upper limit on r. However, the values of \\(p_{\\mu}, p_{b}\\) and CLs can be calculated for a particular value r=X by specifying the option --singlePoint=X. In this case, the value stored in the branch limit will be the value of CLs (or \\(p_{\\mu}\\)) (see the FAQ section).

"},{"location":"part3/commonstatsmethods/#expected-limits","title":"Expected Limits","text":"

For simple models, we can run interactively 5 times to compute the median expected and the 68% and 95% central interval boundaries. For this, we can use the HybridNew method with the same options as for the observed limit, but adding a --expectedFromGrid=<quantile>. Here, the quantile should be set to 0.5 for the median, 0.84 for the +ve side of the 68% band, 0.16 for the -ve side of the 68% band, 0.975 for the +ve side of the 95% band, and 0.025 for the -ve side of the 95% band.

\n

The output file will contain the value of the quantile in the branch quantileExpected. This branch can therefore be used to separate the points.

"},{"location":"part3/commonstatsmethods/#accuracy","title":"Accuracy","text":"

The search for the limit is performed using an adaptive algorithm, terminating when the estimate of the limit value is below some limit or when the precision cannot be improved further with the specified options. The options controlling this behaviour are:

\n
    \n
  • rAbsAcc, rRelAcc: define the accuracy on the limit at which the search stops. The default values are 0.1 and 0.05 respectively, meaning that the search is stopped when \u0394r < 0.1 or \u0394r/r < 0.05.
  • \n
  • clsAcc: this determines the absolute accuracy up to which the CLs values are computed when searching for the limit. The default is 0.5%. Raising the accuracy above this value will significantly increase the time needed to run the algorithm, as you need N2 more toys to improve the accuracy by a factor N. You can consider increasing this value if you are computing limits with a larger CL (e.g. 90% or 68%). Note that if you are using the CLsplusb rule, this parameter will control the uncertainty on \\(p_{\\mu}\\) rather than CLs.
  • \n
  • T or toysH: controls the minimum number of toys that are generated for each point. The default value of 500 should be sufficient when computing the limit at 90-95% CL. You can decrease this number if you are computing limits at 68% CL, or increase it if you are using 99% CL.
  • \n
\n

Note, to further improve the accuracy when searching for the upper limit, Combine will also fit an exponential function to several of the points and interpolate to find the crossing.

"},{"location":"part3/commonstatsmethods/#complex-models","title":"Complex models","text":"

For complicated models, it is best to produce a grid of test statistic distributions at various values of the signal strength, and use it to compute the observed and expected limit and central intervals. This approach is convenient for complex models, since the grid of points can be distributed across any number of jobs. In this approach we will store the distributions of the test statistic at different values of the signal strength using the option --saveHybridResult. The distribution at a single value of r=X can be determined by

\n
combine datacard.txt -M HybridNew --LHCmode LHC-limits --singlePoint X --saveToys --saveHybridResult -T 500 --clsAcc 0\n
\n\n

Warning

\n

We have specified the accuracy here by including --clsAcc=0, which turns off adaptive sampling, and specifying the number of toys to be 500 with the -T N option. For complex models, it may be necessary to internally split the toys over a number of instances of HybridNew using the option --iterations I. The total number of toys will be the product I*N.

\n\n

The above can be repeated several times, in parallel, to build the distribution of the test statistic (passing the random seed option -s -1). Once all of the distributions have been calculated, the resulting output files can be merged into one using hadd, and read back to calculate the limit, specifying the merged file with --grid=merged.root.

\n

The observed limit can be obtained with

\n
combine datacard.txt -M HybridNew --LHCmode LHC-limits --readHybridResults --grid=merged.root\n
\n

and similarly, the median expected and quantiles can be determined using

\n
combine datacard.txt -M HybridNew --LHCmode LHC-limits --readHybridResults --grid=merged.root --expectedFromGrid <quantile>\n
\n

substituting <quantile> with 0.5 for the median, 0.84 for the +ve side of the 68% band, 0.16 for the -ve side of the 68% band, 0.975 for the +ve side of the 95% band, and 0.025 for the -ve side of the 95% band. You should note that Combine will update the grid to improve the accuracy on the extracted limit by default. If you want to avoid this, you can use the option --noUpdateGrid. This will mean only the toys/points you produced in the grid will be used to compute the limit.

\n\n

Warning

\n

Make sure that if you specified a particular mass value (-m or --mass) in the commands for calculating the toys, you also specify the same mass when reading in the grid of distributions.

\n\n

The splitting of the jobs can be left to the user's preference. However, users may wish to use the combineTool for automating this, as described in the section on combineTool for job submission

"},{"location":"part3/commonstatsmethods/#plotting","title":"Plotting","text":"

A plot of the CLs (or \\(p_{\\mu}\\)) as a function of r, which is used to find the crossing, can be produced using the option --plot=limit_scan.png. This can be useful for judging if the chosen grid was sufficient for determining the upper limit.

\n

If we use our realistic-counting-experiment.txt datacard and generate a grid of points \\(r\\varepsilon[1.4,2.2]\\) in steps of 0.1, with 5000 toys for each point, the plot of the observed CLs vs r should look like the following,

\n

\n

You should judge in each case whether the limit is accurate given the spacing of the points and the precision of CLs at each point. If it is not sufficient, simply generate more points closer to the limit and/or more toys at each point.

\n

The distributions of the test statistic can also be plotted, at each value in the grid, using

\n
python test/plotTestStatCLs.py --input mygrid.root --poi r --val all --mass MASS\n
\n

The resulting output file will contain a canvas showing the distribution of the test statistics for the background only and signal+background hypotheses at each value of r. Use --help to see more options for this script.

\n\n

Info

\n

If you used the TEV or LEP style test statistic (using the commands as described above), then you should include the option --doublesided, which will also take care of defining the correct integrals for \\(p_{\\mu}\\) and \\(p_{b}\\). Click on the examples below to see what a typical output of this plotting tool will look like when using the LHC test statistic, or the TEV test statistic.

\n\n\nqLHC test stat example\n

\n\n\nqTEV test stat example\n

"},{"location":"part3/commonstatsmethods/#computing-significances-with-toys","title":"Computing Significances with toys","text":"

Computation of the expected significance with toys is a two-step procedure: first you need to run one or more jobs to construct the expected distribution of the test statistic. As for setting limits, there are a number of different possible configurations for generating toys. However, we will use the most commonly used option,

\n
    \n
  • LHC-style: --LHCmode LHC-significance\n, which is the shortcut for --testStat LHC --generateNuisances=0 --generateExternalMeasurements=1 --fitNuisances=1 --significance
      \n
    • The test statistic is defined using the ratio of likelihoods \\(q_{0} = -2\\ln[\\mathcal{L}(\\mu=0,\\hat{\\hat{\\nu}}(0))/\\mathcal{L}(\\hat{\\mu},\\hat{\\nu})]\\), in which the nuisance parameters are profiled separately for \\(\\mu=\\hat{\\mu}\\) and \\(\\mu=0\\).
    • \n
    • The value of the test statistic is set to 0 when \\(\\hat{\\mu}<0\\)
    • \n
    • For the purposes of toy generation, the nuisance parameters are fixed to their post-fit values from the data assuming no signal, while the constraint terms are randomized for the evaluation of the likelihood.
    • \n
    \n
  • \n
"},{"location":"part3/commonstatsmethods/#observed-significance","title":"Observed significance","text":"

To construct the distribution of the test statistic, the following command should be run as many times as necessary

\n
combine -M HybridNew datacard.txt --LHCmode LHC-significance  --saveToys --fullBToys --saveHybridResult -T toys -i iterations -s seed\n
\n

with different seeds, or using -s -1 for random seeds, then merge all those results into a single ROOT file with hadd. The toys can then be read back into combine using the option --toysFile=input.root --readHybridResult.

\n

The observed significance can be calculated as

\n
combine -M HybridNew datacard.txt --LHCmode LHC-significance --readHybridResult --toysFile=input.root [--pvalue ]\n
\n

where the option --pvalue will replace the result stored in the limit branch output tree to be the p-value instead of the signficance.

"},{"location":"part3/commonstatsmethods/#expected-significance-assuming-some-signal","title":"Expected significance, assuming some signal","text":"

The expected significance, assuming a signal with r=X can be calculated, by including the option --expectSignal X when generating the distribution of the test statistic and using the option --expectedFromGrid=0.5 when calculating the significance for the median. To get the \u00b11\u03c3 bands, use 0.16 and 0.84 instead of 0.5, and so on.

\n

The total number of background toys needs to be large enough to compute the value of the significance, but you need fewer signal toys (especially when you are only computing the median expected significance). For large significances, you can run most of the toys without the --fullBToys option, which will be about a factor 2 faster. Only a small part of the toys needs to be run with that option turned on.

\n

As with calculating limits with toys, these jobs can be submitted to the grid or batch systems with the help of the combineTool, as described in the section on combineTool for job submission

"},{"location":"part3/commonstatsmethods/#goodness-of-fit-tests","title":"Goodness of fit tests","text":"

The GoodnessOfFit method can be used to evaluate how compatible the observed data are with the model PDF.

\n

This method implements several algorithms, and will compute a goodness of fit indicator for the chosen algorithm and the data. The procedure is therefore to first run on the real data

\n
combine -M GoodnessOfFit datacard.txt --algo=<some-algo>\n
\n

and then to run on many toy MC data sets to determine the distribution of the goodness-of-fit indicator

\n
combine -M GoodnessOfFit datacard.txt --algo=<some-algo> -t <number-of-toys> -s <seed>\n
\n

When computing the goodness-of-fit, by default the signal strength is left floating in the fit, so that the measure is independent from the presence or absence of a signal. It is possible to fixe the signal strength to some value by passing the option --fixedSignalStrength=<value>.

\n

The following algorithms are implemented:

\n
    \n
  • \n

    saturated: Compute a goodness-of-fit measure for binned fits based on the saturated model, as prescribed by the Statistics Committee (note). This quantity is similar to a chi-square, but can be computed for an arbitrary combination of binned channels with arbitrary constraints.

    \n
  • \n
  • \n

    KS: Compute a goodness-of-fit measure for binned fits using the Kolmogorov-Smirnov test. It is based on the largest difference between the cumulative distribution function and the empirical distribution function of any bin.

    \n
  • \n
  • \n

    AD: Compute a goodness-of-fit measure for binned fits using the Anderson-Darling test. It is based on the integral of the difference between the cumulative distribution function and the empirical distribution function over all bins. It also gives the tail ends of the distribution a higher weighting.

    \n
  • \n
\n

The output tree will contain a branch called limit, which contains the value of the test statistic in each toy. You can make a histogram of this test statistic \\(t\\). From the distribution that is obtained in this way (\\(f(t)\\)) and the single value obtained by running on the observed data (\\(t_{0}\\)) you can calculate the p-value $$p = \\int_{t=t_{0}}^{\\mathrm{+inf}} f(t) dt $$. Note: in rare cases the test statistic value for the toys can be undefined (for AS and KD). In this case we set the test statistic value to -1. When plotting the test statistic distribution, those toys should be excluded. This is automatically taken care of if you use the GoF collection script in CombineHarvester, which is described below.

\n

When generating toys, the default behavior will be used. See the section on toy generation for options that control how nuisance parameters are generated and fitted in these tests. It is recommended to use frequentist toys (--toysFreq) when running the saturated model, and the default toys for the other two tests.

\n

Further goodness-of-fit methods could be added on request, especially if volunteers are available to code them.\nThe output limit tree will contain the value of the test statistic in each toy (or the data)

\n\n

Warning

\n

The above algorithms are all concerned with one-sample tests. For two-sample tests, you can follow an example CMS HIN analysis described in this Twiki

"},{"location":"part3/commonstatsmethods/#masking-analysis-regions-in-the-saturated-model","title":"Masking analysis regions in the saturated model","text":"

For analyses that employ a simultaneous fit across signal and control regions, it may be useful to mask one or more analysis regions, either when the likelihood is maximized (fit) or when the test statistic is computed. This can be done by using the options --setParametersForFit and --setParametersForEval, respectively. The former will set parameters before each fit, while the latter is used to set parameters after each fit, but before the NLL is evaluated. Note, of course, that if the parameter in the list is floating, it will still be floating in each fit. Therefore, it will not affect the results when using --setParametersForFit.

\n

A realistic example for a binned shape analysis performed in one signal region and two control samples can be found in this directory of the Combine package Datacards-shape-analysis-multiple-regions.

\n

First of all, one needs to Combine the individual datacards to build a single model, and to introduce the channel masking variables as follow:

\n
combineCards.py signal_region.txt dimuon_control_region.txt singlemuon_control_region.txt > combined_card.txt\ntext2workspace.py combined_card.txt --channel-masks\n
\n

More information about the channel masking can be found in this\nsection Channel Masking. The saturated test static value for a simultaneous fit across all the analysis regions can be calculated as:

\n
combine -M GoodnessOfFit -d combined_card.root --algo=saturated -n _result_sb\n
\n

In this case, signal and control regions are included in both the fit and in the evaluation of the test statistic, and the signal strength is freely floating. This measures the compatibility between the signal+background fit and the observed data. Moreover, it can be interesting to assess the level of compatibility between the observed data in all the regions and the background prediction obtained by only fitting the control regions (CR-only fit). This can be evaluated as follow:

\n
combine -M GoodnessOfFit -d combined_card.root --algo=saturated -n _result_bonly_CRonly --setParametersForFit mask_ch1=1 --setParametersForEval mask_ch1=0 --freezeParameters r --setParameters r=0\n
\n

where the signal strength is frozen and the signal region is not considered in the fit (--setParametersForFit mask_ch1=1), but it is included in the test statistic computation (--setParametersForEval mask_ch1=0). To show the differences between the two models being tested, one can perform a fit to the data using the FitDiagnostics method as:

\n
combine -M FitDiagnostics -d combined_card.root -n _fit_result --saveShapes --saveWithUncertainties\ncombine -M FitDiagnostics -d combined_card.root -n _fit_CRonly_result --saveShapes --saveWithUncertainties --setParameters mask_ch1=1\n
\n

By taking the total background, the total signal, and the data shapes from the FitDiagnostics output, we can compare the post-fit predictions from the S+B fit (first case) and the CR-only fit (second case) with the observation as reported below:

\n\nFitDiagnostics S+B fit\n

\n\n\nFitDiagnostics CR-only fit\n

\n\n

To compute a p-value for the two results, one needs to compare the observed goodness-of-fit value previously computed with the expected distribution of the test statistic obtained in toys:

\n
    combine -M GoodnessOfFit combined_card.root --algo=saturated -n result_toy_sb --toysFrequentist -t 500\n    combine -M GoodnessOfFit -d combined_card.root --algo=saturated -n _result_bonly_CRonly_toy --setParametersForFit mask_ch1=1 --setParametersForEval mask_ch1=0 --freezeParameters r --setParameters r=0,mask_ch1=1 -t 500 --toysFrequentist\n
\n

where the former gives the result for the S+B model, while the latter gives the test-statistic for CR-only fit. The command --setParameters r=0,mask_ch1=1 is needed to ensure that toys are thrown using the nuisance parameters estimated from the CR-only fit to the data. The comparison between the observation and the expected distribition should look like the following two plots:

\n\nGoodness-of-fit for S+B model\n

\n\n\nGoodness-of-fit for CR-only model\n

"},{"location":"part3/commonstatsmethods/#making-a-plot-of-the-gof-test-statistic-distribution","title":"Making a plot of the GoF test statistic distribution","text":"

If you have also checked out the combineTool, you can use this to run batch jobs or on the grid (see here) and produce a plot of the results. Once the jobs have completed, you can hadd them together and run (e.g for the saturated model),

\n
combineTool.py -M CollectGoodnessOfFit --input data_run.root toys_run.root -m 125.0 -o gof.json\nplotGof.py gof.json --statistic saturated --mass 125.0 -o gof_plot --title-right=\"my label\"\n
"},{"location":"part3/commonstatsmethods/#channel-compatibility","title":"Channel Compatibility","text":"

The ChannelCompatibilityCheck method can be used to evaluate how compatible the measurements of the signal strength from the separate channels of a combination are with each other.

\n

The method performs two fits of the data, first with the nominal model in which all channels are assumed to have the same signal strength modifier \\(r\\), and then another allowing separate signal strengths \\(r_{i}\\) in each channel. A chisquare-like quantity is computed as \\(-2 \\ln \\mathcal{L}(\\mathrm{data}| r)/L(\\mathrm{data}|\\{r_{i}\\}_{i=1}^{N_{\\mathrm{chan}}})\\). Just like for the goodness-of-fit indicators, the expected distribution of this quantity under the nominal model can be computed from toy MC data sets.

\n

By default, the signal strength is kept floating in the fit with the nominal model. It can however be fixed to a given value by passing the option --fixedSignalStrength=<value>.

\n

In the default model built from the datacards, the signal strengths in all channels are constrained to be non-negative. One can allow negative signal strengths in the fits by changing the bound on the variable (option --rMin=<value>), which should make the quantity more chisquare-like under the hypothesis of zero signal; this however can create issues in channels with small backgrounds, since total expected yields and PDFs in each channel must be positive.

\n

Optionally, channels can be grouped together by using the option -g <name_fragment>, where <name_fragment> is a string which is common to all channels to be grouped together. The -g option can also be used to set the range for the each POI separately via -g <name>=<min>,<max>.

\n

When run with a verbosity of 1, as is the default, the program also prints out the best fit signal strengths in all channels. As the fit to all channels is done simultaneously, the correlation between the other systematic uncertainties is taken into account. Therefore, these results can differ from the ones obtained when fitting each channel separately.

\n

Below is an example output from Combine,

\n
$ combine -M ChannelCompatibilityCheck comb_hww.txt -m 160 -n HWW\n <<< Combine >>>\n>>> including systematics\n>>> method used to compute upper limit is ChannelCompatibilityCheck\n>>> random number generator seed is 123456\n\nSanity checks on the model: OK\nComputing limit starting from observation\n\n--- ChannelCompatibilityCheck ---\nNominal fit : r = 0.3431 -0.1408/+0.1636\nAlternate fit: r = 0.4010 -0.2173/+0.2724 in channel hww_0jsf_shape\nAlternate fit: r = 0.2359 -0.1854/+0.2297 in channel hww_0jof_shape\nAlternate fit: r = 0.7669 -0.4105/+0.5380 in channel hww_1jsf_shape\nAlternate fit: r = 0.3170 -0.3121/+0.3837 in channel hww_1jof_shape\nAlternate fit: r = 0.0000 -0.0000/+0.5129 in channel hww_2j_cut\nChi2-like compatibility variable: 2.16098\nDone in 0.08 min (cpu), 0.08 min (real)\n
\n

The output tree will contain the value of the compatibility (chi-square variable) in the limit branch. If the option --saveFitResult is specified, the output ROOT file also contains two RooFitResult objects fit_nominal and fit_alternate with the results of the two fits.

\n

This can be read and used to extract the best fit value for each channel, and the overall best fit value, using

\n
$ root -l\nTFile* _file0 = TFile::Open(\"higgsCombineTest.ChannelCompatibilityCheck.mH120.root\");\nfit_alternate->floatParsFinal().selectByName(\"*ChannelCompatibilityCheck*\")->Print(\"v\");\nfit_nominal->floatParsFinal().selectByName(\"r\")->Print(\"v\");\n
\n

The macro cccPlot.cxx can be used to produce a comparison plot of the best fit signal strengths from all channels.

"},{"location":"part3/commonstatsmethods/#likelihood-fits-and-scans","title":"Likelihood Fits and Scans","text":"

The MultiDimFit method can be used to perform multi-dimensional fits and likelihood-based scans/contours using models with several parameters of interest.

\n

Taking a toy datacard data/tutorials/multiDim/toy-hgg-125.txt (counting experiment which vaguely resembles an early H\u2192\u03b3\u03b3 analysis at 125 GeV), we need to convert the datacard into a workspace with 2 parameters, the ggH and qqH cross sections:

\n
text2workspace.py toy-hgg-125.txt -m 125 -P HiggsAnalysis.CombinedLimit.PhysicsModel:floatingXSHiggs --PO modes=ggH,qqH\n
\n

A number of different algorithms can be used with the option --algo <algo>,

\n
    \n
  • \n

    none (default): Perform a maximum likelihood fit combine -M MultiDimFit toy-hgg-125.root; The output ROOT tree will contain two columns, one for each parameter, with the fitted values.

    \n
  • \n
  • \n

    singles: Perform a fit of each parameter separately, treating the other parameters of interest as unconstrained nuisance parameters: combine -M MultiDimFit toy-hgg-125.root --algo singles --cl=0.68 . The output ROOT tree will contain two columns, one for each parameter, with the fitted values; there will be one row with the best fit point (and quantileExpected set to -1) and two rows for each fitted parameter, where the corresponding column will contain the maximum and minimum of that parameter in the 68% CL interval, according to a one-dimensional chi-square (i.e. uncertainties on each fitted parameter do not increase when adding other parameters if they are uncorrelated). Note that if you run, for example, with --cminDefaultMinimizerStrategy=0, these uncertainties will be derived from the Hessian, while --cminDefaultMinimizerStrategy=1 will invoke Minos to derive them.

    \n
  • \n
  • \n

    cross: Perform a joint fit of all parameters: combine -M MultiDimFit toy-hgg-125.root --algo=cross --cl=0.68. The output ROOT tree will have one row with the best fit point, and two rows for each parameter, corresponding to the minimum and maximum of that parameter on the likelihood contour corresponding to the specified CL, according to an N-dimensional chi-square (i.e. the uncertainties on each fitted parameter do increase when adding other parameters, even if they are uncorrelated). Note that this method does not produce 1D uncertainties on each parameter, and should not be taken as such.

    \n
  • \n
  • \n

    contour2d: Make a 68% CL contour \u00e0 la minos combine -M MultiDimFit toy-hgg-125.root --algo contour2d --points=20 --cl=0.68. The output will contain values corresponding to the best fit point (with quantileExpected set to -1) and for a set of points on the contour (with quantileExpected set to 1-CL, or something larger than that if the contour hits the boundary of the parameters). Probabilities are computed from the the n-dimensional \\(\\chi^{2}\\) distribution. For slow models, this method can be split by running several times with a different number of points, and merging the outputs. The contourPlot.cxx macro can be used to make plots out of this algorithm.

    \n
  • \n
  • \n

    random: Scan N random points and compute the probability out of the profile likelihood ratio combine -M MultiDimFit toy-hgg-125.root --algo random --points=20 --cl=0.68. Again, the best fit will have quantileExpected set to -1, while each random point will have quantileExpected set to the probability given by the profile likelihood ratio at that point.

    \n
  • \n
  • \n

    fixed: Compare the log-likelihood at a fixed point compared to the best fit. combine -M MultiDimFit toy-hgg-125.root --algo fixed --fixedPointPOIs r=r_fixed,MH=MH_fixed. The output tree will contain the difference in the negative log-likelihood between the points (\\(\\hat{r},\\hat{m}_{H}\\)) and (\\(\\hat{r}_{fixed},\\hat{m}_{H,fixed}\\)) in the branch deltaNLL.

    \n
  • \n
  • \n

    grid: Scan a fixed grid of points with approximately N points in total. combine -M MultiDimFit toy-hgg-125.root --algo grid --points=10000.

    \n
      \n
    • You can partition the job in multiple tasks by using the options --firstPoint and --lastPoint. For complicated scans, the points can be split as described in the combineTool for job submission section. The output file will contain a column deltaNLL with the difference in negative log-likelihood with respect to the best fit point. Ranges/contours can be evaluated by filling TGraphs or TH2 histograms with these points.
    • \n
    • By default the \"min\" and \"max\" of the POI ranges are not included and the points that are in the scan are centred , eg combine -M MultiDimFit --algo grid --rMin 0 --rMax 5 --points 5 will scan at the points \\(r=0.5, 1.5, 2.5, 3.5, 4.5\\). You can include the option --alignEdges 1, which causes the points to be aligned with the end-points of the parameter ranges - e.g. combine -M MultiDimFit --algo grid --rMin 0 --rMax 5 --points 6 --alignEdges 1 will scan at the points \\(r=0, 1, 2, 3, 4, 5\\). Note - the number of points must be increased by 1 to ensure both end points are included.
    • \n
    \n
  • \n
\n

With the algorithms none and singles you can save the RooFitResult from the initial fit using the option --saveFitResult. The fit result is saved into a new file called multidimfit.root.

\n

As usual, any floating nuisance parameters will be profiled. This behaviour can be modified by using the --freezeParameters option.

\n

For most of the methods, for lower-precision results you can turn off the profiling of the nuisance parameters by using the option --fastScan, which for complex models speeds up the process by several orders of magnitude. All nuisance parameters will be kept fixed at the value corresponding to the best fit point.

\n

As an example, let's produce the \\(-2\\Delta\\ln{\\mathcal{L}}\\) scan as a function of r_ggH and r_qqH from the toy H\u2192\u03b3\u03b3 datacard, with the nuisance parameters fixed to their global best fit values.

\n
combine toy-hgg-125.root -M MultiDimFit --algo grid --points 2000 --setParameterRanges r_qqH=0,10:r_ggH=0,4 -m 125 --fastScan\n
\n\nShow output\n
\n <<< Combine >>>\n>>> including systematics\n>>> method used is MultiDimFit\n>>> random number generator seed is 123456\nModelConfig 'ModelConfig' defines more than one parameter of interest. This is not supported in some statistical methods.\nSet Range of Parameter r_qqH To : (0,10)\nSet Range of Parameter r_ggH To : (0,4)\nComputing results starting from observation (a-posteriori)\n POI: r_ggH= 0.88152 -> [0,4]\n POI: r_qqH= 4.68297 -> [0,10]\nPoint 0/2025, (i,j) = (0,0), r_ggH = 0.044444, r_qqH = 0.111111\nPoint 11/2025, (i,j) = (0,11), r_ggH = 0.044444, r_qqH = 2.555556\nPoint 22/2025, (i,j) = (0,22), r_ggH = 0.044444, r_qqH = 5.000000\nPoint 33/2025, (i,j) = (0,33), r_ggH = 0.044444, r_qqH = 7.444444\nPoint 55/2025, (i,j) = (1,10), r_ggH = 0.133333, r_qqH = 2.333333\nPoint 66/2025, (i,j) = (1,21), r_ggH = 0.133333, r_qqH = 4.777778\nPoint 77/2025, (i,j) = (1,32), r_ggH = 0.133333, r_qqH = 7.222222\nPoint 88/2025, (i,j) = (1,43), r_ggH = 0.133333, r_qqH = 9.666667\nPoint 99/2025, (i,j) = (2,9), r_ggH = 0.222222, r_qqH = 2.111111\nPoint 110/2025, (i,j) = (2,20), r_ggH = 0.222222, r_qqH = 4.555556\nPoint 121/2025, (i,j) = (2,31), r_ggH = 0.222222, r_qqH = 7.000000\nPoint 132/2025, (i,j) = (2,42), r_ggH = 0.222222, r_qqH = 9.444444\nPoint 143/2025, (i,j) = (3,8), r_ggH = 0.311111, r_qqH = 1.888889\nPoint 154/2025, (i,j) = (3,19), r_ggH = 0.311111, r_qqH = 4.333333\nPoint 165/2025, (i,j) = (3,30), r_ggH = 0.311111, r_qqH = 6.777778\nPoint 176/2025, (i,j) = (3,41), r_ggH = 0.311111, r_qqH = 9.222222\nPoint 187/2025, (i,j) = (4,7), r_ggH = 0.400000, r_qqH = 1.666667\nPoint 198/2025, (i,j) = (4,18), r_ggH = 0.400000, r_qqH = 4.111111\nPoint 209/2025, (i,j) = (4,29), r_ggH = 0.400000, r_qqH = 6.555556\nPoint 220/2025, (i,j) = (4,40), r_ggH = 0.400000, r_qqH = 9.000000\n[...]\n\nDone in 0.00 min (cpu), 0.02 min (real)\n
\n\n

The scan, along with the best fit point can be drawn using root,

\n
$ root -l higgsCombineTest.MultiDimFit.mH125.root\n\nlimit->Draw(\"2*deltaNLL:r_ggH:r_qqH>>h(44,0,10,44,0,4)\",\"2*deltaNLL<10\",\"prof colz\")\n\nlimit->Draw(\"r_ggH:r_qqH\",\"quantileExpected == -1\",\"P same\")\nTGraph *best_fit = (TGraph*)gROOT->FindObject(\"Graph\")\n\nbest_fit->SetMarkerSize(3); best_fit->SetMarkerStyle(34); best_fit->Draw(\"p same\")\n
\n

\n

To make the full profiled scan, just remove the --fastScan option from the Combine command.

\n

Similarly, 1D scans can be drawn directly from the tree, however for 1D likelihood scans, there is a python script from the CombineHarvester/CombineTools package plot1DScan.py that can be used to make plots and extract the crossings of the 2*deltaNLL - e.g the 1\u03c3/2\u03c3 boundaries.

"},{"location":"part3/commonstatsmethods/#useful-options-for-likelihood-scans","title":"Useful options for likelihood scans","text":"

A number of common, useful options (especially for computing likelihood scans with the grid algo) are,

\n
    \n
  • --autoBoundsPOIs arg: Adjust bounds for the POIs if they end up close to the boundary. This can be a comma-separated list of POIs, or \"*\" to get all of them.
  • \n
  • --autoMaxPOIs arg: Adjust maxima for the POIs if they end up close to the boundary. Can be a list of POIs, or \"*\" to get all.
  • \n
  • --autoRange X: Set to any X >= 0 to do the scan in the \\(\\hat{p}\\) \\(\\pm\\) X\u03c3 range, where \\(\\hat{p}\\) and \u03c3 are the best fit parameter value and uncertainty from the initial fit (so it may be fairly approximate). In case you do not trust the estimate of the error from the initial fit, you can just centre the range on the best fit value by using the option --centeredRange X to do the scan in the \\(\\hat{p}\\) \\(\\pm\\) X range centered on the best fit value.
  • \n
  • --squareDistPoiStep: POI step size based on distance from the midpoint ( either (max-min)/2 or the best fit if used with --autoRange or --centeredRange ) rather than linear separation.
  • \n
  • --skipInitialFit: Skip the initial fit (saves time if, for example, a snapshot is loaded from a previous fit)
  • \n
\n

Below is a comparison in a likelihood scan, with 20 points, as a function of r_qqH with our toy-hgg-125.root workspace with and without some of these options. The options added tell Combine to scan more points closer to the minimum (best-fit) than with the default.

\n

\n

You may find it useful to use the --robustFit=1 option to turn on robust (brute-force) for likelihood scans (and other algorithms). You can set the strategy and tolerance when using the --robustFit option using the options --setRobustFitAlgo (default is Minuit2,migrad), setRobustFitStrategy (default is 0) and --setRobustFitTolerance (default is 0.1). If these options are not set, the defaults (set using cminDefaultMinimizerX options) will be used.

\n

If running --robustFit=1 with the algo singles, you can tune the accuracy of the routine used to find the crossing points of the likelihood using the option --setCrossingTolerance (the default is set to 0.0001)

\n

If you suspect your fits/uncertainties are not stable, you may also try to run custom HESSE-style calculation of the covariance matrix. This is enabled by running MultiDimFit with the --robustHesse=1 option. A simple example of how the default behaviour in a simple datacard is given here.

\n

For a full list of options use combine -M MultiDimFit --help

"},{"location":"part3/commonstatsmethods/#fitting-only-some-parameters","title":"Fitting only some parameters","text":"

If your model contains more than one parameter of interest, you can still decide to fit a smaller number of them, using the option --parameters (or -P), with a syntax like this:

\n
combine -M MultiDimFit [...] -P poi1 -P poi2 ... --floatOtherPOIs=(0|1)\n
\n

If --floatOtherPOIs is set to 0, the other parameters of interest (POIs), which are not included as a -P option, are kept fixed to their nominal values. If it's set to 1, they are kept floating, which has different consequences depending on algo:

\n
    \n
  • When running with --algo=singles, the other floating POIs are treated as unconstrained nuisance parameters.
  • \n
  • When running with --algo=cross or --algo=contour2d, the other floating POIs are treated as other POIs, and so they increase the number of dimensions of the chi-square.
  • \n
\n

As a result, when running with --floatOtherPOIs set to 1, the uncertainties on each fitted parameters do not depend on the selection of POIs passed to MultiDimFit, but only on the number of parameters of the model.

\n\n

Info

\n

Note that poi given to the the option -P can also be any nuisance parameter. However, by default, the other nuisance parameters are left floating, so in general this does not need to be specified.

\n\n

You can save the values of the other parameters of interest in the output tree by passing the option --saveInactivePOI=1. You can additionally save the post-fit values any nuisance parameter, function, or discrete index (RooCategory) defined in the workspace using the following options;

\n
    \n
  • --saveSpecifiedNuis=arg1,arg2,... will store the fitted value of any specified constrained nuisance parameter. Use all to save every constrained nuisance parameter. Note that if you want to store the values of flatParams (or floating parameters that are not defined in the datacard) or rateParams, which are unconstrained, you should instead use the generic option --trackParameters as described here.
  • \n
  • --saveSpecifiedFunc=arg1,arg2,... will store the value of any function (eg RooFormulaVar) in the model.
  • \n
  • --saveSpecifiedIndex=arg1,arg2,... will store the index of any RooCategory object - eg a discrete nuisance.
  • \n
"},{"location":"part3/commonstatsmethods/#using-best-fit-snapshots","title":"Using best fit snapshots","text":"

This can be used to save time when performing scans so that the best fit does not need to be repeated. It can also be used to perform scans with some nuisance parameters frozen to their best-fit values. This can be done as follows,

\n
    \n
  • Create a workspace for a floating \\(r,m_{H}\\) fit
  • \n
\n
text2workspace.py hgg_datacard_mva_8TeV_bernsteins.txt -m 125 -P HiggsAnalysis.CombinedLimit.PhysicsModel:floatingHiggsMass --PO higgsMassRange=120,130 -o testmass.root`\n
\n
    \n
  • Perfom the fit, saving the workspace
  • \n
\n
combine -m 123 -M MultiDimFit --saveWorkspace -n teststep1 testmass.root  --verbose 9\n
\n

Now we can load the best fit \\(\\hat{r},\\hat{m}_{H}\\) and fit for \\(r\\) freezing \\(m_{H}\\) and lumi_8TeV to their best-fit values,

\n
combine -m 123 -M MultiDimFit -d higgsCombineteststep1.MultiDimFit.mH123.root -w w --snapshotName \"MultiDimFit\" -n teststep2  --verbose 9 --freezeParameters MH,lumi_8TeV\n
"},{"location":"part3/commonstatsmethods/#feldman-cousins","title":"Feldman-Cousins","text":"

The Feldman-Cousins (FC) procedure for computing confidence intervals for a generic model is,

\n
    \n
  • use the profile likelihood ratio as the test statistic, \\(q(x) = - 2 \\ln \\mathcal{L}(x,\\hat{\\hat{\\nu}}(x))/\\mathcal{L}(\\hat{x},\\hat{\\nu})\\) where \\(x\\) is a point in the (N-dimensional) parameter space, and \\(\\hat{x}\\) is the point corresponding to the best fit. In this test statistic, the nuisance parameters are profiled, both in the numerator and denominator.
  • \n
  • for each point \\(x\\):
      \n
    • compute the observed test statistic \\(q_{\\mathrm{obs}}(x)\\)
    • \n
    • compute the expected distribution of \\(q(x)\\) under the hypothesis of \\(x\\) as the true value.
    • \n
    • accept the point in the region if \\(p_{x}=P\\left[q(x) > q_{\\mathrm{obs}}(x)| x\\right] > \\alpha\\)
    • \n
    \n
  • \n
\n

With a critical value \\(\\alpha\\).

\n

In Combine, you can perform this test on each individual point (param1, param2,...) = (value1,value2,...) by doing,

\n
combine workspace.root -M HybridNew --LHCmode LHC-feldman-cousins --clsAcc 0 --singlePoint  param1=value1,param2=value2,param3=value3,... --saveHybridResult [Other options for toys, iterations etc as with limits]\n
\n

The point belongs to your confidence region if \\(p_{x}\\) is larger than \\(\\alpha\\) (e.g. 0.3173 for a 1\u03c3 region, \\(1-\\alpha=0.6827\\)).

\n\n

Warning

\n

You should not use this method without the option --singlePoint. Although Combine will not complain, the algorithm to find the crossing will only find a single crossing and therefore not find the correct interval. Instead you should calculate the Feldman-Cousins intervals as described above.

"},{"location":"part3/commonstatsmethods/#physical-boundaries","title":"Physical boundaries","text":"

Imposing physical boundaries (such as requiring \\(\\mu>0\\) for a signal strength) is achieved by setting the ranges of the physics model parameters using

\n
--setParameterRanges param1=param1_min,param1_max:param2=param2_min,param2_max ....\n
\n

The boundary is imposed by restricting the parameter range(s) to those set by the user, in the fits. Note that this is a trick! The actual fitted value, as one of an ensemble of outcomes, can fall outside of the allowed region, while the boundary should be imposed on the physical parameter. The effect of restricting the parameter value in the fit is such that the test statistic is modified as follows ;

\n

\\(q(x) = - 2 \\ln \\mathcal{L}(x,\\hat{\\hat{\\theta}}(x))/\\mathcal{L}(\\hat{x},\\hat{\\nu})\\), if \\(\\hat{x}\\) in contained in the bounded range

\n

and,

\n

\\(q(x) = - 2 \\ln \\mathcal{L}(x,\\hat{\\hat{\\nu}}(x))/\\mathcal{L}(x_{B},\\hat{\\hat{\\nu}}(x_{B}))\\), if \\(\\hat{x}\\) is outside of the bounded range. Here \\(x_{B}\\) and \\(\\hat{\\hat{\\nu}}(x_{B})\\) are the values of \\(x\\) and \\(\\nu\\) which maximise the likelihood excluding values outside of the bounded region for \\(x\\) - typically, \\(x_{B}\\) will be found at one of the boundaries which is imposed. For example, if the boundary \\(x>0\\) is imposed, you will typically expect \\(x_{B}=0\\), when \\(\\hat{x}\\leq 0\\), and \\(x_{B}=\\hat{x}\\) otherewise.

\n

This can sometimes be an issue as Minuit may not know if has successfully converged when the minimum lies outside of that range. If there is no upper/lower boundary, just set that value to something far from the region of interest.

\n\n

Info

\n

One can also imagine imposing the boundaries by first allowing Minuit to find the minimum in the unrestricted region and then setting the test statistic to that in the case that minimum lies outside the physical boundary. This would avoid potential issues of convergence. If you are interested in implementing this version in Combine, please contact the development team.

"},{"location":"part3/commonstatsmethods/#extracting-contours-from-results-files","title":"Extracting contours from results files","text":"

As in general for HybridNew, you can split the task into multiple tasks (grid and/or batch) and then merge the outputs with hadd. You can also refer to the combineTool for job submission section for submitting the jobs to the grid/batch or if you have more than one parameter of interest, see the instructions for running HybridNew on a grid of parameter points on the CombineHarvest - HybridNewGrid documentation.

"},{"location":"part3/commonstatsmethods/#extracting-1d-intervals","title":"Extracting 1D intervals","text":"

For one-dimensional models only, and if the parameter behaves like a cross section, the code is able to interpolate and determine the values of your parameter on the contour (just like it does for the limits). As with limits, read in the grid of points and extract 1D intervals using,

\n
combine workspace.root -M HybridNew --LHCmode LHC-feldman-cousins --readHybridResults --grid=mergedfile.root --cl <1-alpha>\n
\n

The output tree will contain the values of the POI that crosses the critical value (\\(\\alpha\\)) - i.e, the boundaries of the confidence intervals.

\n

You can produce a plot of the value of \\(p_{x}\\) vs the parameter of interest \\(x\\) by adding the option --plot <plotname>.

"},{"location":"part3/commonstatsmethods/#extracting-2d-contours","title":"Extracting 2D contours","text":"

There is a tool for extracting 2D contours from the output of HybridNew located in test/makeFCcontour.py. This can be used provided the option --saveHybridResult was included when running HybridNew. It can be run with the usual Combine output files (or several of them) as input,

\n
./test/makeFCcontour.py  toysfile1.root toysfile2.root .... [options] -out outputfile.root\n
\n

To extract 2D contours, the names of each parameter must be given --xvar poi_x --yvar poi_y. The output will be a ROOT file containing a 2D histogram of value of \\(p_{x,y}\\) for each point \\((x,y)\\) which can be used to draw 2D contours. There will also be a histogram containing the number of toys found for each point.

\n

There are several options for reducing the running time, such as setting limits on the region of interest or the minimum number of toys required for a point to be included. Finally, adding the option --storeToys in this script will add histograms for each point to the output file of the test statistic distribution. This will increase the memory usage, as all of the toys will be kept in memory.

"},{"location":"part3/debugging/","title":"Debugging fits","text":"

When a fit fails there are several things you can do to investigate. CMS users can have a look at these slides from a previous Combine tutorial. This section contains a few pointers for some of the methods mentioned in the slides.

"},{"location":"part3/debugging/#analyzing-the-nll-shape-in-each-parameter","title":"Analyzing the NLL shape in each parameter","text":"

The FastScan mode of combineTool.py can be used to analyze the shape of the NLL as a function of each parameter in the fit model. The NLL is evaluated varying a single parameter at a time, the other parameters stay at the default values they have in the workspace. This produces a file with the NLL, plus its first and second derivatives, as a function of each parameter. Discontinuities in the derivatives, particularly if they are close to the minimum of the parameter, can be the source of issues with the fit.

The usage is as follows:

combineTool.py -M FastScan -w workspace.root:w

Note that this will make use of the data in the workspace for evaluating the NLL. To run this on an asimov data set, with r=1 injected, you can do the following:

combine -M GenerateOnly workspace.root -t -1 --saveToys --setParameters r=1\n\ncombineTool.py -M FastScan -w workspace.root:w -d higgsCombineTest.GenerateOnly.mH120.123456.root:toys/toy_asimov\n

higgsCombineTest.GenerateOnly.mH120.123456.root is generated by the first command; if you pass a value for -m or change the default output file name with -n the file name will be different and you should change the combineTool call accordingly.

"},{"location":"part3/nonstandard/","title":"Advanced Use Cases","text":"

This section will cover some of the more specific use cases for Combine that are not necessarily related to the main results of the analysis.

"},{"location":"part3/nonstandard/#fit-diagnostics","title":"Fit Diagnostics","text":"

If you want to diagnose your limits/fit results, you may first want to look at the HIG PAG standard checks, which are applied to all datacards and can be found here.

If you have already found the Higgs boson but it's an exotic one, instead of computing a limit or significance you might want to extract its cross section by performing a maximum-likelihood fit. Alternatively, you might want to know how compatible your data and your model are, e.g. how strongly your nuisance parameters are constrained, to what extent they are correlated, etc. These general diagnostic tools are contained in the method FitDiagnostics.

    combine -M FitDiagnostics datacard.txt\n

The program will print out the result of two fits. The first one is performed with the signal strength r (or the first POI in the list, in models with multiple POIs) set to zero and a second with floating r. The output ROOT tree will contain the best fit value for r and its uncertainty. You will also get a fitDiagnostics.root file containing the following objects:

Object Description nuisances_prefit RooArgSet containing the pre-fit values of the nuisance parameters, and their uncertainties from the external constraint terms only fit_b RooFitResult object containing the outcome of the fit of the data with signal strength set to zero fit_s RooFitResult object containing the outcome of the fit of the data with floating signal strength tree_prefit TTree of pre-fit nuisance parameter values and constraint terms (_In) tree_fit_sb TTree of fitted nuisance parameter values and constraint terms (_In) with floating signal strength tree_fit_b TTree of fitted nuisance parameter values and constraint terms (_In) with signal strength set to 0

by including the option --plots, you will additionally find the following contained in the ROOT file:

Object Description covariance_fit_s TH2D Covariance matrix of the parameters in the fit with floating signal strength covariance_fit_b TH2D Covariance matrix of the parameters in the fit with signal strength set to zero category_variable_prefit RooPlot plot of the pre-fit PDFs/templates with the data (or toy if running with -t) overlaid category_variable_fit_b RooPlot plot of the PDFs/templates from the background only fit with the data (or toy if running with -t) overlaid category_variable_fit_s RooPlot plot of the PDFs/templates from the signal+background fit with the data (or toy if running with -t) overlaid

There will be one RooPlot object per category in the likelihood, and one per variable if using a multi-dimensional dataset. For each of these additional objects a png file will also be produced.

Info

If you use the option --name, this additional name will be inserted into the file name for this output file.

As well as the values of the constrained nuisance parameters (and their constraints), you will also find branches for the number of \"bad\" nll calls (which you should check is not too large) and the status of the fit fit_status. The fit status is computed as follows

fit_status = 100 * hesse_status + 10 * minos_status +  minuit_summary_status\n

The minuit_summary_status is the usual status from Minuit, details of which can be found here. For the other status values, check these documentation links for the hesse_status and the minos_status.

A fit status of -1 indicates that the fit failed (Minuit summary was not 0 or 1) and hence the fit result is not valid.

"},{"location":"part3/nonstandard/#fit-options","title":"Fit options","text":"
  • If you only want to run the signal+background fit, and do not need the output file, you can run with --justFit. In case you would like to run only the signal+background fit but would like to produce the output file, you should use the option --skipBOnlyFit instead.
  • You can use --rMin and --rMax to set the range of the first POI; a range that is not too large compared with the uncertainties you expect from the fit usually gives more stable and accurate results.
  • By default, the uncertainties are computed using MINOS for the first POI and HESSE for all other parameters. For the nuisance parameters the uncertainties will therefore be symmetric. You can run MINOS for all parameters using the option --minos all, or for none of the parameters using --minos none. Note that running MINOS is slower so you should only consider using it if you think the HESSE uncertainties are not accurate.
  • If MINOS or HESSE fails to converge, you can try running with --robustFit=1. This will do a slower, but more robust, likelihood scan, which can be further controlled with the parameter --stepSize (the default value is 0.1, and is relative to the range of the parameter).
  • The strategy and tolerance when using the --robustFit option can be set using the options setRobustFitAlgo (default is Minuit2,migrad), setRobustFitStrategy (default is 0) and --setRobustFitTolerance (default is 0.1). If these options are not set, the defaults (set using cminDefaultMinimizerX options) will be used. You can also tune the accuracy of the routine used to find the crossing points of the likelihood using the option --setCrossingTolerance (the default is set to 0.0001)
  • If you find the covariance matrix provided by HESSE is not accurate (i.e. fit_s->Print() reports this was forced positive-definite) then a custom HESSE-style calculation of the covariance matrix can be used instead. This is enabled by running FitDiagnostics with the --robustHesse 1 option. Please note that the status reported by RooFitResult::Print() will contain covariance matrix quality: Unknown, matrix was externally provided when robustHesse is used, this is normal and does not indicate a problem. NB: one feature of the robustHesse algorithm is that if it still cannot calculate a positive-definite covariance matrix it will try to do so by dropping parameters from the hessian matrix before inverting. If this happens it will be reported in the output to the screen.
  • For other fitting options see the generic minimizer options section.
"},{"location":"part3/nonstandard/#fit-parameter-uncertainties","title":"Fit parameter uncertainties","text":"

If you get a warning message when running FitDiagnostics that says Unable to determine uncertainties on all fit parameters. This means the covariance matrix calculated in FitDiagnostics was not correct.

The most common problem is that the covariance matrix is forced positive-definite. In this case the constraints on fit parameters as taken from the covariance matrix are incorrect and should not be used. In particular, if you want to make post-fit plots of the distribution used in the signal extraction fit and are extracting the uncertainties on the signal and background expectations from the covariance matrix, the resulting values will not reflect the truth if the covariance matrix was incorrect. By default if this happens and you passed the --saveWithUncertainties flag when calling FitDiagnostics, this option will be ignored as calculating the uncertainties would lead to incorrect results. This behaviour can be overridden by passing --ignoreCovWarning.

Such problems with the covariance matrix can be caused by a number of things, for example:

  • Parameters being close to their boundaries after the fit.

  • Strong (anti-) correlations between some parameters. A discontinuity in the NLL function or its derivatives at or near the minimum.

If you are aware that your analysis has any of these features you could try resolving these. Setting --cminDefaultMinimizerStrategy 0 can also help with this problem.

"},{"location":"part3/nonstandard/#pre-and-post-fit-nuisance-parameters","title":"Pre- and post-fit nuisance parameters","text":"

It is possible to compare pre-fit and post-fit nuisance parameter values with the script diffNuisances.py. Taking as input a fitDiagnostics.root file, the script will by default print out the parameters that have changed significantly with respect to their initial estimate. For each of those parameters, it will print out the shift in value and the post-fit uncertainty, both normalized to the initial (pre-fit) value. The linear correlation between the parameter and the signal strength will also be printed.

python diffNuisances.py fitDiagnostics.root\n

The script has several options to toggle the thresholds used to decide whether a parameter has changed significantly, to get the printout of the absolute value of the nuisance parameters, and to get the output in another format for use on a webpage or in a note (the supported formats are html, latex, twiki). To print all of the parameters, use the option --all.

By default, the changes in the nuisance parameter values and uncertainties are given relative to their initial (pre-fit) values (usually relative to initial values of 0 and 1 for most nuisance types).

The values in the output will be \\((\\nu-\\nu_{I})/\\sigma_{I}\\) if the nuisance has a pre-fit uncertainty, otherwise they will be \\(\\nu-\\nu_{I}\\) (for example, a flatParam has no pre-fit uncertainty).

The reported uncertainty will be the ratio \\(\\sigma/\\sigma_{I}\\) - i.e the ratio of the post-fit to the pre-fit uncertainty. If there is no pre-fit uncertainty (as for flatParam nuisances), the post-fit uncertainty is shown.

To print the pre-fit and post-fit values and (asymmetric) uncertainties, rather than the ratios, the option --abs can be used.

Info

We recommend that you include the options --abs and --all to get the full information on all of the parameters (including unconstrained nuisance parameters) at least once when checking your datacards.

If instead of the nuisance parameter values, you wish to report the pulls, you can do so using the option --pullDef X, with X being one of the options listed below. You should note that since the pulls below are only defined when the pre-fit uncertainty exists, nothing will be reported for parameters that have no prior constraint (except in the case of the unconstPullAsym choice as described below). You may want to run without this option and --all to get information about those parameters.

  • relDiffAsymErrs: This is the same as the default output of the tool, except that only constrained parameters (i.e. where the pre-fit uncertainty is defined) are reported. The uncertainty is also reported and calculated as \\(\\sigma/\\sigma_{I}\\).

  • unconstPullAsym: Report the pull as \\(\\frac{\\nu-\\nu_{I}}{\\sigma}\\), where \\(\\nu_{I}\\) and \\(\\sigma\\) are the initial value and post-fit uncertainty of that nuisance parameter. The pull defined in this way will have no error bar, but all nuisance parameters will have a result in this case.

  • compatAsym: The pull is defined as \\(\\frac{\\nu-\\nu_{D}}{\\sqrt{\\sigma^{2}+\\sigma_{D}^{2}}}\\), where \\(\\nu_{D}\\) and \\(\\sigma_{D}\\) are calculated as \\(\\sigma_{D} = (\\frac{1}{\\sigma^{2}} - \\frac{1}{\\sigma_{I}^{2}})^{-1}\\) and \\(\\nu_{D} = \\sigma_{D}(\\nu - \\frac{\\nu_{I}}{\\sigma_{I}^{2}})\\). In this expression \\(\\nu_{I}\\) and \\(\\sigma_{I}\\) are the initial value and uncertainty of that nuisance parameter. This can be thought of as a compatibility between the initial measurement (prior) and an imagined measurement where only the data (with no constraint on the nuisance parameter) is used to measure the nuisance parameter. There is no error bar associated with this value.

  • diffPullAsym: The pull is defined as \\(\\frac{\\nu-\\nu_{I}}{\\sqrt{\\sigma_{I}^{2}-\\sigma^{2}}}\\), where \\(\\nu_{I}\\) and \\(\\sigma_{I}\\) are the pre-fit value and uncertainty (from L. Demortier and L. Lyons). If the denominator is close to 0 or the post-fit uncertainty is larger than the pre-fit (usually due to some failure in the calculation), the pull is not defined and the result will be reported as 0 +/- 999.

If using --pullDef, the results for all parameters for which the pull can be calculated will be shown (i.e --all will be set to true), not just those that have moved by some metric.

This script has the option (-g outputfile.root) to produce plots of the fitted values of the nuisance parameters and their post-fit, asymmetric uncertainties. Instead, the pulls defined using one of the options above, can be plotted using the option --pullDef X. In addition this will produce a plot showing a comparison between the post-fit and pre-fit (symmetrized) uncertainties on the nuisance parameters.

Info

In the above options, if an asymmetric uncertainty is associated with the nuisance parameter, then the choice of which uncertainty is used in the definition of the pull will depend on the sign of \\(\\nu-\\nu_{I}\\).

"},{"location":"part3/nonstandard/#normalizations","title":"Normalizations","text":"

For a certain class of models, like those made from datacards for shape-based analysis, the tool can also compute and save the best fit yields of all processes to the output ROOT file. If this feature is turned on with the option --saveNormalizations, the file will also contain three RooArgSet objects norm_prefit, norm_fit_s, and norm_fit_b. These each contain one RooConstVar for each channel xxx and process yyy with name xxx/yyy and value equal to the best fit yield. You can use RooRealVar::getVal and RooRealVar::getError to estimate both the post-fit (or pre-fit) values and uncertainties of these normalizations.

The sample pyROOT macro mlfitNormsToText.py can be used to convert the ROOT file into a text table with four columns: channel, process, yield from the signal+background fit, and yield from the background-only fit. To include the uncertainties in the table, add the option --uncertainties.

Warning

Note that when running with multiple toys, the norm_fit_s, norm_fit_b, and norm_prefit objects will be stored for the last toy dataset generated and so may not be useful to you.

Note that this procedure works only for \"extended likelihoods\" like the ones used in shape-based analysis, not for counting experiment datacards. You can however convert a counting experiment datacard to an equivalent shape-based one by adding a line shapes * * FAKE in the datacard after the imax, jmax, kmax lines. Alternatively, you can use combineCards.py countingcard.txt -S > shapecard.txt to do this conversion.

"},{"location":"part3/nonstandard/#per-bin-norms-for-shape-analyses","title":"Per-bin norms for shape analyses","text":"

If you have a shape-based analysis, you can include the option --savePredictionsPerToy. With this option, additional branches will be filled in the three output trees contained in fitDiagnostics.root.

The normalization values for each toy will be stored in the branches inside the TTrees named n_exp[_final]_binxxx_proc_yyy. The _final will only be there if there are systematic uncertainties affecting this process.

Additionally, there will be branches that provide the value of the expected bin content for each process, in each channel. These are named n_exp[_final]_binxxx_proc_yyy_i (where _final will only be in the name if there are systematic uncertainties affecting this process) for channel xxx, process yyy, bin number i. In the case of the post-fit trees (tree_fit_s/b), these will be the expectations from the fitted models, while for the pre-fit tree, they will be the expectation from the generated model (i.e if running toys with -t N and using --genNuisances, they will be randomized for each toy). These can be useful, for example, for calculating correlations/covariances between different bins, in different channels or processes, within the model from toys.

Info

Be aware that for unbinned models, a binning scheme is adopted based on the RooRealVar::getBinning for the observable defining the shape, if it exists, or Combine will adopt some appropriate binning for each observable.

"},{"location":"part3/nonstandard/#plotting","title":"Plotting","text":"

FitDiagnostics can also produce pre- and post-fit plots of the model along with the data. They will be stored in the same directory as fitDiagnostics.root. To obtain these, you have to specify the option --plots, and then optionally specify the names of the signal and background PDFs/templates, e.g. --signalPdfNames='ggH*,vbfH*' and --backgroundPdfNames='*DY*,*WW*,*Top*' (by default, the definitions of signal and background are taken from the datacard). For models with more than 1 observable, a separate projection onto each observable will be produced.

An alternative is to use the option --saveShapes. This will add additional folders in fitDiagnostics.root for each category, with pre- and post-fit distributions of the signals and backgrounds as TH1s, and the data as TGraphAsymmErrors (with Poisson intervals as error bars).

Info

If you want to save post-fit shapes at a specific r value, add the options --customStartingPoint and --skipSBFit, and set the r value. The result will appear in shapes_fit_b, as described below.

Three additional folders (shapes_prefit, shapes_fit_sb and shapes_fit_b ) will contain the following distributions:

Object Description data TGraphAsymmErrors containing the observed data (or toy data if using -t). The vertical error bars correspond to the 68% interval for a Poisson distribution centered on the observed count (Garwood intervals), following the recipe provided by the CMS Statistics Committee. $PROCESS (id <= 0) TH1F for each signal process in each channel, named as in the datacard $PROCESS (id > 0) TH1F for each background process in each channel, named as in the datacard total_signal TH1F Sum over the signal components total_background TH1F Sum over the background components total TH1F Sum over all of the signal and background components

The above distributions are provided for each channel included in the datacard, in separate subfolders, named as in the datacard: There will be one subfolder per channel.

Warning

The pre-fit signal is evaluated for r=1 by default, but this can be modified using the option --preFitValue.

The distributions and normalizations are guaranteed to give the correct interpretation:

  • For shape datacards whose inputs are TH1, the histograms/data points will have the bin number as the x-axis and the content of each bin will be a number of events.

  • For datacards whose inputs are RooAbsPdf/RooDataHists, the x-axis will correspond to the observable and the bin content will be the PDF density / events divided by the bin width. This means the absolute number of events in a given bin, i, can be obtained from h.GetBinContent(i)*h.GetBinWidth(i) or similar for the data graphs. Note that for unbinned analyses Combine will make a reasonable guess as to an appropriate binning.

Uncertainties on the shapes will be added with the option --saveWithUncertainties. These uncertainties are generated by re-sampling of the fit covariance matrix, thereby accounting for the full correlation between the parameters of the fit.

Warning

It may be tempting to sum up the uncertainties in each bin (in quadrature) to get the total uncertainty on a process. However, this is (usually) incorrect, as doing so would not account for correlations between the bins. Instead you can refer to the uncertainties which will be added to the post-fit normalizations described above.

Additionally, the covariance matrix between bin yields (or yields/bin-widths) in each channel will also be saved as a TH2F named total_covar. If the covariance between all bins across all channels is desired, this can be added using the option --saveOverallShapes. Each folder will now contain additional distributions (and covariance matrices) corresponding to the concatenation of the bins in each channel (and therefore the covaraince between every bin in the analysis). The bin labels should make it clear as to which bin corresponds to which channel.

"},{"location":"part3/nonstandard/#toy-by-toy-diagnostics","title":"Toy-by-toy diagnostics","text":"

FitDiagnostics can also be used to diagnose the fitting procedure in toy experiments to identify potentially problematic nuisance parameters when running the full limits/p-values. This can be done by adding the option -t <num toys>. The output file, fitDiagnostics.root the three TTrees will contain the value of the constraint fitted result in each toy, as a separate entry. It is recommended to use the following options when investigating toys to reduce the running time: --toysFrequentist --noErrors --minos none

The results can be plotted using the macro test/plotParametersFromToys.C

$ root -l\n.L plotParametersFromToys.C+\nplotParametersFromToys(\"fitDiagnosticsToys.root\",\"fitDiagnosticsData.root\",\"workspace.root\",\"r<0\")\n

The first argument is the name of the output file from running with toys, and the second and third (optional) arguments are the name of the file containing the result from a fit to the data and the workspace (created from text2workspace.py). The fourth argument can be used to specify a cut string applied to one of the branches in the tree, which can be used to correlate strange behaviour with specific conditions. The output will be 2 pdf files (tree_fit_(s)b.pdf) and 2 ROOT files (tree_fit_(s)b.root) containing canvases of the fit results of the tool. For details on the output plots, consult AN-2012/317.

"},{"location":"part3/nonstandard/#scaling-constraints","title":"Scaling constraints","text":"

It possible to scale the constraints on the nuisance parameters when converting the datacard to a workspace (see the section on physics models) with text2workspace.py. This can be useful for projection studies of the analysis to higher luminosities or with different assumptions about the sizes of certain systematics without changing the datacard by hand.

We consider two kinds of scaling;

  • A constant scaling factor to scale the constraints
  • A functional scale factor that depends on some other parameters in the workspace, eg a luminosity scaling parameter (as a rateParam affecting all processes).

In both cases these scalings can be introduced by adding some extra options at the text2workspace.py step.

To add a constant scaling factor we use the option --X-rescale-nuisance, eg

text2workspace.py datacard.txt --X-rescale-nuisance '[some regular expression]' 0.5\n

will create the workspace in which every nuisance parameter whose name matches the specified regular expression will have the width of the gaussian constraint scaled by a factor 0.5.

Multiple --X-rescale-nuisance options can be specified to set different scalings for different nuisances (note that you actually have to write --X-rescale-nuisance each time as in --X-rescale-nuisance 'theory.*' 0.5 --X-rescale-nuisance 'exp.*' 0.1).

To add a functional scaling factor we use the option --X-nuisance-function, which works in a similar way. Instead of a constant value you should specify a RooFit factory expression.

A typical case would be scaling by \\(1/\\sqrt{L}\\), where \\(L\\) is a luminosity scale factor. For example, assuming there is some parameter in the datacard/workspace called lumiscale,

text2workspace.py datacard.txt --X-nuisance-function '[some regular expression]' 'expr::lumisyst(\"1/sqrt(@0)\",lumiscale[1])'\n

This factory syntax is flexible, but for our use case the typical format will be: expr::[function name](\"[formula]\", [arg0], [arg1], ...). The arg0, arg1 ... are represented in the formula by @0, @1,... placeholders.

Warning

We are playing a slight trick here with the lumiscale parameter. At the point at which text2workspace.py is building these scaling terms the lumiscale for the rateParam has not yet been created. By writing lumiscale[1] we are telling RooFit to create this variable with an initial value of 1, and then later this will be re-used by the rateParam creation.

A similar option, --X-nuisance-group-function, can be used to scale whole groups of nuisances (see groups of nuisances). Instead of a regular expression just give the group name instead,

text2workspace.py datacard.txt --X-nuisance-group-function [group name] 'expr::lumisyst(\"1/sqrt(@0)\",lumiscale[1])'\n
"},{"location":"part3/nonstandard/#nuisance-parameter-impacts","title":"Nuisance parameter impacts","text":"

The impact of a nuisance parameter (NP) \u03b8 on a parameter of interest (POI) \u03bc is defined as the shift \u0394\u03bc that is induced as \u03b8 is fixed and brought to its +1\u03c3 or \u22121\u03c3 post-fit values, with all other parameters profiled as normal (see JHEP 01 (2015) 069 for a description of this method).

This is effectively a measure of the correlation between the NP and the POI, and is useful for determining which NPs have the largest effect on the POI uncertainty.

It is possible to use the FitDiagnostics method of Combine with the option --algo impact -P parameter to calculate the impact of a particular nuisance parameter on the parameter(s) of interest. We will use the combineTool.py script to automate the fits (see the combineTool section to check out the tool.

We will use an example workspace from the \\(H\\rightarrow\\tau\\tau\\) datacard,

$ cp HiggsAnalysis/CombinedLimit/data/tutorials/htt/125/htt_tt.txt .\n$ text2workspace.py htt_tt.txt -m 125\n

Calculating the impacts is done in a few stages. First we just fit for each POI, using the --doInitialFit option with combineTool.py, and adding the --robustFit 1 option that will be passed through to Combine,

combineTool.py -M Impacts -d htt_tt.root -m 125 --doInitialFit --robustFit 1\n

Have a look at the options as for likelihood scans when using robustFit 1.

Next we perform a similar scan for each nuisance parameter with the --doFits options,

combineTool.py -M Impacts -d htt_tt.root -m 125 --robustFit 1 --doFits\n

Note that this will run approximately 60 scans, and to speed things up the option --parallel X can be given to run X Combine jobs simultaneously. The batch and grid submission methods described in the combineTool for job submission section can also be used.

Once all jobs are completed, the output can be collected and written into a json file:

combineTool.py -M Impacts -d htt_tt.root -m 125 -o impacts.json\n

A plot summarizing the nuisance parameter values and impacts can be made with plotImpacts.py,

plotImpacts.py -i impacts.json -o impacts\n

The first page of the output is shown below. Note that in these figures, the nuisance parameters are labelled as \\(\\theta\\) instead of \\(\\nu\\).

The direction of the +1\u03c3 and -1\u03c3 impacts (i.e. when the NP is moved to its +1\u03c3 or -1\u03c3 values) on the POI indicates whether the parameter is correlated or anti-correlated with it.

For models with multiple POIs, the Combine option --redefineSignalPOIs X,Y,Z... should be specified in all three of the combineTool.py -M Impacts [...] steps above. The final step will produce the impacts.json file which will contain the impacts for all the specified POIs. In the plotImpacts.py script, a particular POI can be specified with --POI X.

Warning

The plot also shows the best fit value of the POI at the top and its uncertainty. You may wish to allow the range to go negative (i.e using --setParameterRanges or --rMin) to avoid getting one-sided impacts!

This script also accepts an optional json-file argument with -t, which can be used to provide a dictionary for renaming parameters. A simple example would be to create a file rename.json,

{\n  \"r\" : \"#mu\"\n}\n

that will rename the POI label on the plot.

Info

Since combineTool accepts the usual options for combine you can also generate the impacts on an Asimov or toy dataset.

The left panel in the summary plot shows the value of \\((\\nu-\\nu_{0})/\\Delta_{\\nu}\\) where \\(\\nu\\) and \\(\\nu_{0}\\) are the post and pre-fit values of the nuisance parameter and \\(\\Delta_{\\nu}\\) is the pre-fit uncertainty. The asymmetric error bars show the post-fit uncertainty divided by the pre-fit uncertainty meaning that parameters with error bars smaller than \\(\\pm 1\\) are constrained in the fit. The pull will additionally be shown. As with the diffNuisances.py script, the option --pullDef can be used (to modify the definition of the pull that is shown).

"},{"location":"part3/nonstandard/#breakdown-of-uncertainties","title":"Breakdown of uncertainties","text":"

Often you will want to report the breakdown of your total (systematic) uncertainty on a measured parameter due to one or more groups of nuisance parameters. For example, these groups could be theory uncertainties, trigger uncertainties, ... The prodecude to do this in Combine is to sequentially freeze groups of nuisance parameters and subtract (in quadrature) from the total uncertainty. Below are the steps to do so. We will use the data/tutorials/htt/125/htt_tt.txt datacard for this.

  1. Add groups to the datacard to group nuisance parameters. Nuisance parameters not in groups will be considered as \"rest\" in the later steps. The lines should look like the following and you should add them to the end of the datacard
theory      group = QCDscale_VH QCDscale_ggH1in QCDscale_ggH2in QCDscale_qqH UEPS pdf_gg pdf_qqbar\ncalibration group = CMS_scale_j_8TeV CMS_scale_t_tautau_8TeV CMS_htt_scale_met_8TeV\nefficiency  group = CMS_eff_b_8TeV   CMS_eff_t_tt_8TeV CMS_fake_b_8TeV\n
  1. Create the workspace with text2workspace.py data/tutorials/htt/125/htt_tt.txt -m 125.

  2. Run a fit with all nuisance parameters floating and store the workspace in an output file - combine data/tutorials/htt/125/htt_tt.root -M MultiDimFit --saveWorkspace -n htt.postfit

  3. Run a scan from the postfit workspace

combine higgsCombinehtt.postfit.MultiDimFit.mH120.root -M MultiDimFit -n htt.total --algo grid --snapshotName MultiDimFit --setParameterRanges r=0,4\n
  1. Run additional scans using the post-fit workspace, sequentially adding another group to the list of groups to freeze
combine higgsCombinehtt.postfit.MultiDimFit.mH120.root -M MultiDimFit --algo grid --snapshotName MultiDimFit --setParameterRanges r=0,4  --freezeNuisanceGroups theory -n htt.freeze_theory\n\ncombine higgsCombinehtt.postfit.MultiDimFit.mH120.root -M MultiDimFit --algo grid --snapshotName MultiDimFit --setParameterRanges r=0,4  --freezeNuisanceGroups theory,calibration -n htt.freeze_theory_calibration\n\ncombine higgsCombinehtt.postfit.MultiDimFit.mH120.root -M MultiDimFit --algo grid --snapshotName MultiDimFit --setParameterRanges r=0,4  --freezeNuisanceGroups theory,calibration,efficiency -n htt.freeze_theory_calibration_efficiency\n
  1. Run one last scan freezing all of the constrained nuisance parameters (this represents the statistical uncertainty only).
combine higgsCombinehtt.postfit.MultiDimFit.mH120.root -M MultiDimFit --algo grid --snapshotName MultiDimFit --setParameterRanges r=0,4  --freezeParameters allConstrainedNuisances -n htt.freeze_all\n
  1. Use the combineTool script plot1DScan.py to report the breakdown of uncertainties.
plot1DScan.py higgsCombinehtt.total.MultiDimFit.mH120.root --main-label \"Total Uncert.\"  --others higgsCombinehtt.freeze_theory.MultiDimFit.mH120.root:\"freeze theory\":4 higgsCombinehtt.freeze_theory_calibration.MultiDimFit.mH120.root:\"freeze theory+calibration\":7 higgsCombinehtt.freeze_theory_calibration_efficiency.MultiDimFit.mH120.root:\"freeze theory+calibration+efficiency\":2 higgsCombinehtt.freeze_all.MultiDimFit.mH120.root:\"stat only\":6  --output breakdown --y-max 10 --y-cut 40 --breakdown \"theory,calibration,efficiency,rest,stat\"\n

The final step calculates the contribution of each group of nuisance parameters as the subtraction in quadrature of each scan from the previous one. This procedure guarantees that the sum in quadrature of the individual components is the same as the total uncertainty.

The plot below is produced,

Warning

While the above procedure is guaranteed the have the effect that the sum in quadrature of the breakdown will equal the total uncertainty, the order in which you freeze the groups can make a difference due to correlations induced by the fit. You should check if the answers change significantly if changing the order and we recommend you start with the largest group (in terms of overall contribution to the uncertainty) first, working down the list in order of the size of the contribution.

"},{"location":"part3/nonstandard/#channel-masking","title":"Channel Masking","text":"

The Combine tool has a number of features for diagnostics and plotting results of fits. It can often be useful to turn off particular channels in a combined analysis to see how constraints/shifts in parameter values can vary. It can also be helpful to plot the post-fit shapes and uncertainties of a particular channel (for example a signal region) without including the constraints from the data in that region.

This can in some cases be achieved by removing a specific datacard when running combineCards.py. However, when doing so, the information of particular nuisance parameters and PDFs in that region will be lost. Instead, it is possible to mask that channel from the likelihood. This is achieved at the text2Workspace step using the option --channel-masks.

"},{"location":"part3/nonstandard/#example-removing-constraints-from-the-signal-region","title":"Example: removing constraints from the signal region","text":"

We will take the control region example from the rate parameters tutorial from data/tutorials/rate_params/.

The first step is to combine the cards combineCards.py signal=signal_region.txt dimuon=dimuon_control_region.txt singlemuon=singlemuon_control_region.txt > datacard.txt

Note that we use the directive CHANNELNAME=CHANNEL_DATACARD.txt so that the names of the channels are under our control and easier to interpret. Next, we make a workspace and tell Combine to create the parameters used to mask channels

text2workspace.py datacard.txt --channel-masks\n

Now we will try to do a fit ignoring the signal region. We can turn off the signal region by setting the corresponding channel mask parameter to 1: --setParameters mask_signal=1. Note that text2workspace has created a masking parameter for every channel with the naming scheme mask_CHANNELNAME. By default, every parameter is set to 0 so that the channel is unmasked by default.

combine datacard.root -M FitDiagnostics --saveShapes --saveWithUncertainties --setParameters mask_signal=1\n

Warning

There will be a lot of warnings from Combine. These are safe to ignore as they are due to the s+b fit not converging. This is expected as the free signal parameter cannot be constrained because the data in the signal region is being ignored.

We can compare the post-fit background and uncertainties with and without the signal region included by re-running with --setParameters mask_signal=0 (or just removing that option completely). Below is a comparison of the background in the signal region with and without masking the data in the signal region. We take these from the shapes folder shapes_fit_b/signal/total_background in the fitDiagnostics.root output.

Clearly the background shape is different and much less constrained without including the signal region, as expected. Channel masking can be used with any method in Combine.

"},{"location":"part3/nonstandard/#roomultipdf-conventional-bias-studies","title":"RooMultiPdf conventional bias studies","text":"

Several analyses in CMS use a functional form to describe the background. This functional form is fit to the data. Often however, there is some uncertainty associated with the choice of which background function to use, and this choice will impact the fit results. It is therefore often the case that in these analyses, a bias study is performed. This study will give an indication of the size of the potential bias in the result, given a certain choice of functional form. These studies can be conducted using Combine.

Below is an example script that will produce a workspace based on a simplified Higgs to diphoton (Hgg) analysis with a single category. It will produce the data and PDFs necessary for this example, and you can use it as a basis to construct your own studies.

void makeRooMultiPdfWorkspace(){\n\n   // Load the combine Library\n   gSystem->Load(\"libHiggsAnalysisCombinedLimit.so\");\n\n   // mass variable\n   RooRealVar mass(\"CMS_hgg_mass\",\"m_{#gamma#gamma}\",120,100,180);\n\n\n   // create 3 background pdfs\n   // 1. exponential\n   RooRealVar expo_1(\"expo_1\",\"slope of exponential\",-0.02,-0.1,-0.0001);\n   RooExponential exponential(\"exponential\",\"exponential pdf\",mass,expo_1);\n\n   // 2. polynomial with 2 parameters\n   RooRealVar poly_1(\"poly_1\",\"T1 of chebychev polynomial\",0,-3,3);\n   RooRealVar poly_2(\"poly_2\",\"T2 of chebychev polynomial\",0,-3,3);\n   RooChebychev polynomial(\"polynomial\",\"polynomial pdf\",mass,RooArgList(poly_1,poly_2));\n\n   // 3. A power law function\n   RooRealVar pow_1(\"pow_1\",\"exponent of power law\",-3,-6,-0.0001);\n   RooGenericPdf powerlaw(\"powerlaw\",\"TMath::Power(@0,@1)\",RooArgList(mass,pow_1));\n\n   // Generate some data (lets use the power lay function for it)\n   // Here we are using unbinned data, but binning the data is also fine\n   RooDataSet *data = powerlaw.generate(mass,RooFit::NumEvents(1000));\n\n   // First we fit the pdfs to the data (gives us a sensible starting value of parameters for, e.g - blind limits)\n   exponential.fitTo(*data);   // index 0\n   polynomial.fitTo(*data);   // index 1\n   powerlaw.fitTo(*data);     // index 2\n\n   // Make a plot (data is a toy dataset)\n   RooPlot *plot = mass.frame();   data->plotOn(plot);\n   exponential.plotOn(plot,RooFit::LineColor(kGreen));\n   polynomial.plotOn(plot,RooFit::LineColor(kBlue));\n   powerlaw.plotOn(plot,RooFit::LineColor(kRed));\n   plot->SetTitle(\"PDF fits to toy data\");\n   plot->Draw();\n\n   // Make a RooCategory object. This will control which of the pdfs is \"active\"\n   RooCategory cat(\"pdf_index\",\"Index of Pdf which is active\");\n\n   // Make a RooMultiPdf object. The order of the pdfs will be the order of their index, ie for below\n   // 0 == exponential\n   // 1 == polynomial\n   // 2 == powerlaw\n   RooArgList mypdfs;\n   mypdfs.add(exponential);\n   mypdfs.add(polynomial);\n   mypdfs.add(powerlaw);\n\n   RooMultiPdf multipdf(\"roomultipdf\",\"All Pdfs\",cat,mypdfs);\n   // By default the multipdf will tell combine to add 0.5 to the nll for each parameter (this is the penalty for the discrete profiling method)\n   // It can be changed with\n   //   multipdf.setCorrectionFactor(penalty)\n   // For bias-studies, this isn;t relevant however, so lets just leave the default\n\n   // As usual make an extended term for the background with _norm for freely floating yield\n   RooRealVar norm(\"roomultipdf_norm\",\"Number of background events\",1000,0,10000);\n\n   // We will also produce a signal model for the bias studies\n   RooRealVar sigma(\"sigma\",\"sigma\",1.2); sigma.setConstant(true);\n   RooRealVar MH(\"MH\",\"MH\",125); MH.setConstant(true);\n   RooGaussian signal(\"signal\",\"signal\",mass,MH,sigma);\n\n\n   // Save to a new workspace\n   TFile *fout = new TFile(\"workspace.root\",\"RECREATE\");\n   RooWorkspace wout(\"workspace\",\"workspace\");\n\n   data->SetName(\"data\");\n   wout.import(*data);\n   wout.import(cat);\n   wout.import(norm);\n   wout.import(multipdf);\n   wout.import(signal);\n   wout.Print();\n   wout.Write();\n}\n

The signal is modelled as a simple Gaussian with a width approximately that of the diphoton resolution. For the background there is a choice of 3 functions: an exponential, a power-law, and a 2nd order polynomial. This choice is accessible within Combine through the use of the RooMultiPdf object, which can switch between the functions by setting their associated indices (herein called pdf_index). This (as with all parameters in Combine) can be set via the --setParameters option.

To assess the bias, one can throw toys using one function and fit with another. To do this, only a single datacard is needed: hgg_toy_datacard.txt.

The bias studies are performed in two stages. The first is to generate toys using one of the functions, under some value of the signal strength r (or \\(\\mu\\)). This can be repeated for several values of r and also at different masses, but in this example the Higgs boson mass is fixed to 125 GeV.

    combine hgg_toy_datacard.txt -M GenerateOnly --setParameters pdf_index=0 --toysFrequentist -t 100 --expectSignal 1 --saveToys -m 125 --freezeParameters pdf_index\n

Warning

It is important to freeze pdf_index, otherwise Combine will try to iterate over the index in the frequentist fit.

Now we have 100 toys which, by setting pdf_index=0, sets the background PDF to the exponential function. This means we assume that the exponential is the true function. Note that the option --toysFrequentist is added; this first performs a fit of the PDF, assuming a signal strength of 1, to the data before generating the toys. This is the most obvious choice as to where to throw the toys from.

The next step is to fit the toys under a different background PDF hypothesis. This time we set the pdf_index to 1, which selects the powerlaw, and run fits with the FitDiagnostics method, again freezing pdf_index.

    combine hgg_toy_datacard.txt -M FitDiagnostics  --setParameters pdf_index=1 --toysFile higgsCombineTest.GenerateOnly.mH125.123456.root  -t 100 --rMin -10 --rMax 10 --freezeParameters pdf_index --cminDefaultMinimizerStrategy=0\n

Note how we add the option --cminDefaultMinimizerStrategy=0. This is because we do not need the Hessian, as FitDiagnostics will run MINOS to get the uncertainty on r. If we do not do this, Minuit will think the fit failed as we have parameters (those not attached to the current PDF) for which the likelihood is flat.

Warning

You may get warnings about non-accurate errors such as [WARNING]: Unable to determine uncertainties on all fit parameters in b-only fit - These can be ignored since they are related to the free parameters of the background PDFs which are not active.

In the output file fitDiagnostics.root there is a tree that contains the best fit results under the signal+background hypothesis. One measure of the bias is the pull defined as the difference between the measured value of \\(\\mu\\) and the generated value (here we used 1) relative to the uncertainty on \\(\\mu\\). The pull distribution can be drawn and the mean provides an estimate of the pull. In this example, we are averaging the positive and negative uncertainties, but we could do something smarter if the uncertainties are very asymmetric.

root -l fitDiagnostics.root\ntree_fit_sb->Draw(\"(r-1)/(0.5*(rHiErr+rLoErr))>>h(20,-5,5)\")\nh->Fit(\"gaus\")\n

From the fitted Gaussian, we see the mean is at -1.29, which would indicate a bias of 129% of the uncertainty on mu from choosing the polynomial when the true function is an exponential.

"},{"location":"part3/nonstandard/#discrete-profiling","title":"Discrete profiling","text":"

If the discrete nuisance is left floating, it will be profiled by looping through the possible index values and finding the PDF that gives the best fit. This allows for the discrete profiling method to be applied for any method which involves a profiled likelihood (frequentist methods).

Warning

You should be careful since MINOS knows nothing about the discrete nuisances and hence estimations of uncertainties will be incorrect via MINOS. Instead, uncertainties from scans and limits will correctly account for these nuisance parameters. Currently the Bayesian methods will not properly treat the nuisance parameters, so some care should be taken when interpreting Bayesian results.

As an example, we can peform a likelihood scan as a function of the Higgs boson signal strength in the toy Hgg datacard. By leaving the object pdf_index non-constant, at each point in the likelihood scan, the PDFs will be iterated over and the one that gives the lowest -2 times log-likelihood, including the correction factor \\(c\\) (as defined in the paper linked above) will be stored in the output tree. We can also check the scan when we fix at each PDF individually to check that the envelope is achieved. For this, you will need to include the option --X-rtd REMOVE_CONSTANT_ZERO_POINT=1. In this way, we can take a look at the absolute value to compare the curves, if we also include --saveNLL.

For example for a full scan, you can run

    combine -M MultiDimFit -d hgg_toy_datacard.txt --algo grid --setParameterRanges r=-1,3 --cminDefaultMinimizerStrategy 0 --saveNLL -n Envelope -m 125 --setParameters myIndex=-1 --X-rtd REMOVE_CONSTANT_ZERO_POINT=1\n

and for the individual pdf_index set to X,

    combine -M MultiDimFit -d hgg_toy_datacard.txt --algo grid --setParameterRanges r=-1,3 --cminDefaultMinimizerStrategy 0 --saveNLL --freezeParameters pdf_index --setParameters pdf_index=X -n fixed_pdf_X -m 125 --X-rtd REMOVE_CONSTANT_ZERO_POINT=1\n

for X=0,1,2

You can then plot the value of 2*(deltaNLL+nll+nll0) to plot the absolute value of (twice) the negative log-likelihood, including the correction term for extra parameters in the different PDFs.

The above output will produce the following scans.

As expected, the curve obtained by allowing the pdf_index to float (labelled \"Envelope\") picks out the best function (maximum corrected likelihood) for each value of the signal strength.

In general, the performance of Combine can be improved when using the discrete profiling method by including the option --X-rtd MINIMIZER_freezeDisassociatedParams. This will stop parameters not associated to the current PDF from floating in the fits. Additionally, you can include the following options:

  • --X-rtd MINIMIZER_multiMin_hideConstants: hide the constant terms in the likelihood when recreating the minimizer
  • --X-rtd MINIMIZER_multiMin_maskConstraints: hide the constraint terms during the discrete minimization process
  • --X-rtd MINIMIZER_multiMin_maskChannels=<choice> mask the channels that are not needed from the NLL:
  • <choice> 1: keeps unmasked all channels that are participating in the discrete minimization.
  • <choice> 2: keeps unmasked only the channel whose index is being scanned at the moment.

You may want to check with the Combine development team if you are using these options, as they are somewhat for expert use.

"},{"location":"part3/nonstandard/#roosplinend-multidimensional-splines","title":"RooSplineND multidimensional splines","text":"

RooSplineND can be used to interpolate from a tree of points to produce a continuous function in N-dimensions. This function can then be used as input to workspaces allowing for parametric rates/cross-sections/efficiencies. It can also be used to up-scale the resolution of likelihood scans (i.e like those produced from Combine) to produce smooth contours.

The spline makes use of a radial basis decomposition to produce a continous \\(N \\to 1\\) map (function) from \\(M\\) provided sample points. The function of the \\(N\\) variables \\(\\vec{x}\\) is assumed to be of the form,

\\[ f(\\vec{x}) = \\sum_{i=1}^{M}w_{i}\\phi(||\\vec{x}-\\vec{x}_{i}||), \\]

where \\(\\phi(||\\vec{z}||) = e^{-\\frac{||\\vec{z}||}{\\epsilon^{2}}}\\). The distance \\(||.||\\) between two points is given by,

\\[ ||\\vec{x}-\\vec{y}|| = \\sum_{j=1}^{N}(x_{j}-y_{j})^{2}, \\]

if the option rescale=false and,

\\[ ||\\vec{x}-\\vec{y}|| = \\sum_{j=1}^{N} M^{1/N} \\cdot \\left( \\frac{ x_{j}-y_{j} }{ \\mathrm{max_{i=1,M}}(x_{i,j})-\\mathrm{min_{i=1,M}}(x_{i,j}) }\\right)^{2}, \\]

if the option rescale=true. Given the sample points, it is possible to determine the weights \\(w_{i}\\) as the solution of the set of equations,

\\[ \\sum_{i=1}^{M}w_{i}\\phi(||\\vec{x}_{j}-\\vec{x}_{i}||) = f(\\vec{x}_{j}). \\]

The solution is obtained using the eigen c++ package.

The typical constructor of the object is as follows;

RooSplineND(const char *name, const char *title, RooArgList &vars, TTree *tree, const char* fName=\"f\", double eps=3., bool rescale=false, std::string cutstring=\"\" ) ;\n

where the arguments are:

  • vars: A RooArgList of RooRealVars representing the \\(N\\) dimensions of the spline. The length of this list determines the dimension \\(N\\) of the spline.
  • tree: a TTree pointer where each entry represents a sample point used to construct the spline. The branch names must correspond to the names of the variables in vars.
  • fName: is a string representing the name of the branch to interpret as the target function \\(f\\).
  • eps : is the value of \\(\\epsilon\\) and represents the width of the basis functions \\(\\phi\\).
  • rescale : is an option to rescale the input sample points so that each variable has roughly the same range (see above in the definition of \\(||.||\\)).
  • cutstring : a string to remove sample points from the tree. Can be any typical cut string (eg \"var1>10 && var2<3\").

The object can be treated as a RooAbsArg; its value for the current values of the parameters is obtained as usual by using the getVal() method.

Warning

You should not include more variable branches than contained in vars in the tree, as the spline will interpret them as additional sample points. You will get a warning if there are two nearby points in the input samples and this will cause a failure in determining the weights. If you cannot create a reduced tree, you can remove entries by using the cutstring.

The following script is an example that produces a 2D spline (N=2) from a set of 400 points (M=400) generated from a function.

Show script
void splinend(){\n   // library containing the RooSplineND\n   gSystem->Load(\"libHiggsAnalysisCombinedLimit.so\");\n\n   TTree *tree = new TTree(\"tree_vals\",\"tree_vals\");\n   float xb,yb,fb;\n\n   tree->Branch(\"f\",&fb,\"f/F\");\n   tree->Branch(\"x\",&xb,\"x/F\");\n   tree->Branch(\"y\",&yb,\"y/F\");\n\n   TRandom3 *r = new TRandom3();\n   int nentries = 20; // just use a regular grid of 20x20=400 points\n\n   double xmin = -3.2;\n   double xmax = 3.2;\n   double ymin = -3.2;\n   double ymax = 3.2;\n\n   for (int n=0;n<nentries;n++){\n    for (int k=0;k<nentries;k++){\n\n      xb=xmin+n*((xmax-xmin)/nentries);\n      yb=ymin+k*((ymax-ymin)/nentries);\n      // Gaussian * cosine function radial in \"F(x^2+y^2)\"\n      double R = (xb*xb)+(yb*yb);\n      fb = 0.1*TMath::Exp(-1*(R)/9)*TMath::Cos(2.5*TMath::Sqrt(R));\n      tree->Fill();\n     }\n   }\n\n   // 2D graph of points in tree\n   TGraph2D *p0 = new TGraph2D();\n   p0->SetMarkerSize(0.8);\n   p0->SetMarkerStyle(20);\n\n   int c0=0;\n   for (int p=0;p<tree->GetEntries();p++){\n        tree->GetEntry(p);\n        p0->SetPoint(c0,xb,yb,fb);\n        c0++;\n        }\n\n\n   // ------------------------------ THIS IS WHERE WE BUILD THE SPLINE ------------------------ //\n   // Create 2 Real-vars, one for each of the parameters of the spline\n   // The variables MUST be named the same as the corresponding branches in the tree\n   RooRealVar x(\"x\",\"x\",0.1,xmin,xmax);\n   RooRealVar y(\"y\",\"y\",0.1,ymin,ymax);\n\n\n   // And the spline - arguments are\n   // Required ->   name, title, arglist of dependants, input tree,\n   // Optional ->  function branch name, interpolation width (tunable parameter), rescale Axis bool, cutstring\n   // The tunable parameter gives the radial basis a \"width\", over which the interpolation will be effectively taken\n\n   // the reascale Axis bool (if true) will first try to rescale the points so that they are of order 1 in range\n   // This can be helpful if for example one dimension is in much larger units than another.\n\n   // The cutstring is just a ROOT string which can be used to apply cuts to the tree in case only a sub-set of the points should be used\n\n   RooArgList args(x,y);\n   RooSplineND *spline = new RooSplineND(\"spline\",\"spline\",args,tree,\"f\",1,true);\n      // ----------------------------------------------------------------------------------------- //\n\n\n   //TGraph *gr = spline->getGraph(\"x\",0.1); // Return 1D graph. Will be a slice of the spline for fixed y generated at steps of 0.1\n\n   // Plot the 2D spline\n   TGraph2D *gr = new TGraph2D();\n   int pt = 0;\n   for (double xx=xmin;xx<xmax;xx+=0.1){\n     for (double yy=xmin;yy<ymax;yy+=0.1){\n        x.setVal(xx);\n        y.setVal(yy);\n        gr->SetPoint(pt,xx,yy,spline->getVal());\n        pt++;\n     }\n   }\n\n   gr->SetTitle(\"\");\n\n   gr->SetLineColor(1);\n   //p0->SetTitle(\"0.1 exp(-(x{^2}+y{^2})/9) #times Cos(2.5#sqrt{x^{2}+y^{2}})\");\n   gr->Draw(\"surf\");\n   gr->GetXaxis()->SetTitle(\"x\");\n   gr->GetYaxis()->SetTitle(\"y\");\n   p0->Draw(\"Pcolsame\");\n\n   //p0->Draw(\"surfsame\");\n   TLegend *leg = new TLegend(0.2,0.82,0.82,0.98);\n   leg->SetFillColor(0);\n   leg->AddEntry(p0,\"0.1 exp(-(x{^2}+y{^2})/9) #times Cos(2.5#sqrt{x^{2}+y^{2}})\",\"p\");\n   leg->AddEntry(gr,\"RooSplineND (N=2) interpolation\",\"L\");\n   leg->Draw();\n}\n

Running the script will produce the following plot. The plot shows the sampled points and the spline produced from them.

"},{"location":"part3/nonstandard/#rooparametrichist-gamman-for-shapes","title":"RooParametricHist gammaN for shapes","text":"

Currently, there is no straightforward implementation of using per-bin gmN-like uncertainties with shape (histogram) analyses. Instead, it is possible to tie control regions (written as datacards) with the signal region using three methods.

For analyses that take the normalization of some process from a control region, it is possible to use either lnU or rateParam directives to float the normalization in a correlated way of some process between two regions. Instead if each bin is intended to be determined via a control region, one can use a number of RooFit histogram PDFs/functions to accomplish this. The example below shows a simple implementation of a RooParametricHist to achieve this.

Copy the script below into a file called examplews.C and create the input workspace using root -l examplews.C...

Show script
void examplews(){\n    // As usual, load the combine library to get access to the RooParametricHist\n    gSystem->Load(\"libHiggsAnalysisCombinedLimit.so\");\n\n    // Output file and workspace\n    TFile *fOut = new TFile(\"param_ws.root\",\"RECREATE\");\n    RooWorkspace wspace(\"wspace\",\"wspace\");\n\n    // better to create the bins rather than use the \"nbins,min,max\" to avoid spurious warning about adding bins with different\n    // ranges in combine - see https://root-forum.cern.ch/t/attempt-to-divide-histograms-with-different-bin-limits/17624/3 for why!\n    const int nbins = 4;\n    double xmin=200.;\n    double xmax=1000.;\n    double xbins[5] = {200.,400.,600.,800.,1000.};\n\n    // A search in a MET tail, define MET as our variable\n\n    RooRealVar met(\"met\",\"E_{T}^{miss}\",200,xmin,xmax);\n    RooArgList vars(met);\n\n\n    // ---------------------------- SIGNAL REGION -------------------------------------------------------------------//\n    // Make a dataset, this will be just four bins in MET.\n    // its easiest to make this from a histogram. Set the contents to \"somehting\"\n    TH1F data_th1(\"data_obs_SR\",\"Data observed in signal region\",nbins,xbins);\n\n    data_th1.SetBinContent(1,100);\n    data_th1.SetBinContent(2,50);\n    data_th1.SetBinContent(3,25);\n    data_th1.SetBinContent(4,10);\n    RooDataHist data_hist(\"data_obs_SR\",\"Data observed\",vars,&data_th1);\n    wspace.import(data_hist);\n\n    // In the signal region, our background process will be freely floating,\n    // Create one parameter per bin representing the yield. (note of course we can have multiple processes like this)\n    RooRealVar bin1(\"bkg_SR_bin1\",\"Background yield in signal region, bin 1\",100,0,500);\n    RooRealVar bin2(\"bkg_SR_bin2\",\"Background yield in signal region, bin 2\",50,0,500);\n    RooRealVar bin3(\"bkg_SR_bin3\",\"Background yield in signal region, bin 3\",25,0,500);\n    RooRealVar bin4(\"bkg_SR_bin4\",\"Background yield in signal region, bin 4\",10,0,500);\n    RooArgList bkg_SR_bins;\n    bkg_SR_bins.add(bin1);\n    bkg_SR_bins.add(bin2);\n    bkg_SR_bins.add(bin3);\n    bkg_SR_bins.add(bin4);\n\n    // Create a RooParametericHist which contains those yields, last argument is just for the binning,\n    // can use the data TH1 for that\n    RooParametricHist p_bkg(\"bkg_SR\", \"Background PDF in signal region\",met,bkg_SR_bins,data_th1);\n    // Always include a _norm term which should be the sum of the yields (thats how combine likes to play with pdfs)\n    RooAddition p_bkg_norm(\"bkg_SR_norm\",\"Total Number of events from background in signal region\",bkg_SR_bins);\n\n    // Every signal region needs a signal\n    TH1F signal_th1(\"signal_SR\",\"Signal expected in signal region\",nbins,xbins);\n\n    signal_th1.SetBinContent(1,1);\n    signal_th1.SetBinContent(2,2);\n    signal_th1.SetBinContent(3,3);\n    signal_th1.SetBinContent(4,8);\n    RooDataHist signal_hist(\"signal\",\"Data observed\",vars,&signal_th1);\n    wspace.import(signal_hist);\n\n    // -------------------------------------------------------------------------------------------------------------//\n    // ---------------------------- CONTROL REGION -----------------------------------------------------------------//\n    TH1F data_CRth1(\"data_obs_CR\",\"Data observed in control region\",nbins,xbins);\n\n    data_CRth1.SetBinContent(1,200);\n    data_CRth1.SetBinContent(2,100);\n    data_CRth1.SetBinContent(3,50);\n    data_CRth1.SetBinContent(4,20);\n\n    RooDataHist data_CRhist(\"data_obs_CR\",\"Data observed\",vars,&data_CRth1);\n    wspace.import(data_CRhist);\n\n    // This time, the background process will be dependent on the yields of the background in the signal region.\n    // The transfer factor TF must account for acceptance/efficiency etc differences in the signal to control\n    // In this example lets assume the control region is populated by the same process decaying to clean daughters with 2xBR\n    // compared to the signal region\n\n    // NB You could have a different transfer factor for each bin represented by a completely different RooRealVar\n\n    // We can imagine that the transfer factor could be associated with some uncertainty - lets say a 1% uncertainty due to efficiency and 2% due to acceptance.\n    // We need to make these nuisance parameters ourselves and give them a nominal value of 0\n\n\n    RooRealVar efficiency(\"efficiency\", \"efficiency nuisance parameter\",0);\n    RooRealVar acceptance(\"acceptance\", \"acceptance nuisance parameter\",0);\n\n    // We would need to make the transfer factor a function of those too. Here we've assumed Log-normal effects (i.e the same as putting lnN in the CR datacard)\n    // but note that we could use any function which could be used to parameterise the effect - eg if the systematic is due to some alternate template, we could\n    // use polynomials for example.\n\n\n    RooFormulaVar TF(\"TF\",\"Trasnfer factor\",\"2*TMath::Power(1.01,@0)*TMath::Power(1.02,@1)\",RooArgList(efficiency,acceptance) );\n\n    // Finally, we need to make each bin of the background in the control region a function of the background in the signal and the transfer factor\n    // N_CR = N_SR x TF\n\n    RooFormulaVar CRbin1(\"bkg_CR_bin1\",\"Background yield in control region, bin 1\",\"@0*@1\",RooArgList(TF,bin1));\n    RooFormulaVar CRbin2(\"bkg_CR_bin2\",\"Background yield in control region, bin 2\",\"@0*@1\",RooArgList(TF,bin2));\n    RooFormulaVar CRbin3(\"bkg_CR_bin3\",\"Background yield in control region, bin 3\",\"@0*@1\",RooArgList(TF,bin3));\n    RooFormulaVar CRbin4(\"bkg_CR_bin4\",\"Background yield in control region, bin 4\",\"@0*@1\",RooArgList(TF,bin4));\n\n    RooArgList bkg_CR_bins;\n    bkg_CR_bins.add(CRbin1);\n    bkg_CR_bins.add(CRbin2);\n    bkg_CR_bins.add(CRbin3);\n    bkg_CR_bins.add(CRbin4);\n    RooParametricHist p_CRbkg(\"bkg_CR\", \"Background PDF in control region\",met,bkg_CR_bins,data_th1);\n    RooAddition p_CRbkg_norm(\"bkg_CR_norm\",\"Total Number of events from background in control region\",bkg_CR_bins);\n    // -------------------------------------------------------------------------------------------------------------//\n\n\n    // we can also use the standard interpolation from combine by providing alternative shapes (as RooDataHists)\n    // here we're adding two of them (JES and ISR)\n    TH1F background_up(\"tbkg_CR_JESUp\",\"\",nbins,xbins);\n    background_up.SetBinContent(1,CRbin1.getVal()*1.01);\n    background_up.SetBinContent(2,CRbin2.getVal()*1.02);\n    background_up.SetBinContent(3,CRbin3.getVal()*1.03);\n    background_up.SetBinContent(4,CRbin4.getVal()*1.04);\n    RooDataHist bkg_CRhist_sysUp(\"bkg_CR_JESUp\",\"Bkg sys up\",vars,&background_up);\n    wspace.import(bkg_CRhist_sysUp);\n\n    TH1F background_down(\"bkg_CR_JESDown\",\"\",nbins,xbins);\n    background_down.SetBinContent(1,CRbin1.getVal()*0.90);\n    background_down.SetBinContent(2,CRbin2.getVal()*0.98);\n    background_down.SetBinContent(3,CRbin3.getVal()*0.97);\n    background_down.SetBinContent(4,CRbin4.getVal()*0.96);\n    RooDataHist bkg_CRhist_sysDown(\"bkg_CR_JESDown\",\"Bkg sys down\",vars,&background_down);\n    wspace.import(bkg_CRhist_sysDown);\n\n    TH1F background_2up(\"tbkg_CR_ISRUp\",\"\",nbins,xbins);\n    background_2up.SetBinContent(1,CRbin1.getVal()*0.85);\n    background_2up.SetBinContent(2,CRbin2.getVal()*0.9);\n    background_2up.SetBinContent(3,CRbin3.getVal()*0.95);\n    background_2up.SetBinContent(4,CRbin4.getVal()*0.99);\n    RooDataHist bkg_CRhist_sys2Up(\"bkg_CR_ISRUp\",\"Bkg sys 2up\",vars,&background_2up);\n    wspace.import(bkg_CRhist_sys2Up);\n\n    TH1F background_2down(\"bkg_CR_ISRDown\",\"\",nbins,xbins);\n    background_2down.SetBinContent(1,CRbin1.getVal()*1.15);\n    background_2down.SetBinContent(2,CRbin2.getVal()*1.1);\n    background_2down.SetBinContent(3,CRbin3.getVal()*1.05);\n    background_2down.SetBinContent(4,CRbin4.getVal()*1.01);\n    RooDataHist bkg_CRhist_sys2Down(\"bkg_CR_ISRDown\",\"Bkg sys 2down\",vars,&background_2down);\n    wspace.import(bkg_CRhist_sys2Down);\n\n    // import the pdfs\n    wspace.import(p_bkg);\n    wspace.import(p_bkg_norm,RooFit::RecycleConflictNodes());\n    wspace.import(p_CRbkg);\n    wspace.import(p_CRbkg_norm,RooFit::RecycleConflictNodes());\n    fOut->cd();\n    wspace.Write();\n\n    // Clean up\n    fOut->Close();\n    fOut->Delete();\n\n\n}\n

We will now discuss what the script is doing. First, the observable for the search is the missing energy, so we create a parameter to represent this observable.

   RooRealVar met(\"met\",\"E_{T}^{miss}\",xmin,xmax);\n

The following lines create a freely floating parameter for each of our bins (in this example, there are only 4 bins, defined for our observable met).

   RooRealVar bin1(\"bkg_SR_bin1\",\"Background yield in signal region, bin 1\",100,0,500);\n   RooRealVar bin2(\"bkg_SR_bin2\",\"Background yield in signal region, bin 2\",50,0,500);\n   RooRealVar bin3(\"bkg_SR_bin3\",\"Background yield in signal region, bin 3\",25,0,500);\n   RooRealVar bin4(\"bkg_SR_bin4\",\"Background yield in signal region, bin 4\",10,0,500);\n\n   RooArgList bkg_SR_bins;\n   bkg_SR_bins.add(bin1);\n   bkg_SR_bins.add(bin2);\n   bkg_SR_bins.add(bin3);\n   bkg_SR_bins.add(bin4);\n

They are put into a list so that we can create a RooParametricHist and its normalisation from that list

  RooParametricHist p_bkg(\"bkg_SR\", \"Background PDF in signal region\",met,bkg_SR_bins,data_th1);\n\n  RooAddition p_bkg_norm(\"bkg_SR_norm\",\"Total Number of events from background in signal region\",bkg_SR_bins);\n

For the control region, the background process will be dependent on the yields of the background in the signal region using a transfer factor. The transfer factor TF must account for acceptance/efficiency/etc differences between the signal region and the control regions.

In this example we will assume the control region is populated by the same process decaying to a different final state with twice as large branching fraction as the one in the signal region.

We could imagine that the transfer factor could be associated with some uncertainty - for example a 1% uncertainty due to efficiency and a 2% uncertainty due to acceptance differences. We need to make nuisance parameters ourselves to model this, and give them a nominal value of 0.

   RooRealVar efficiency(\"efficiency\", \"efficiency nuisance parameter\",0);\n   RooRealVar acceptance(\"acceptance\", \"acceptance nuisance parameter\",0);\n

We need to make the transfer factor a function of these parameters, since variations in these uncertainties will lead to variations of the transfer factor. Here we have assumed Log-normal effects (i.e the same as putting lnN in the CR datacard), but we could use any function which could be used to parameterize the effect - for example if the systematic uncertainty is due to some alternate template, we could use polynomials.

   RooFormulaVar TF(\"TF\",\"Trasnfer factor\",\"2*TMath::Power(1.01,@0)*TMath::Power(1.02,@1)\",RooArgList(efficiency,acceptance) );\n

Then, we need to make each bin of the background in the control region a function of the background in the signal region and the transfer factor - i.e $N{CR} = N{SR} \\times TF $.

   RooFormulaVar CRbin1(\"bkg_CR_bin1\",\"Background yield in control region, bin 1\",\"@0*@1\",RooArgList(TF,bin1));\n   RooFormulaVar CRbin2(\"bkg_CR_bin2\",\"Background yield in control region, bin 2\",\"@0*@1\",RooArgList(TF,bin2));\n   RooFormulaVar CRbin3(\"bkg_CR_bin3\",\"Background yield in control region, bin 3\",\"@0*@1\",RooArgList(TF,bin3));\n   RooFormulaVar CRbin4(\"bkg_CR_bin4\",\"Background yield in control region, bin 4\",\"@0*@1\",RooArgList(TF,bin4));\n

As before, we also need to create the RooParametricHist for this process in the control region but this time the bin yields will be the RooFormulaVars we just created instead of freely floating parameters.

   RooArgList bkg_CR_bins;\n   bkg_CR_bins.add(CRbin1);\n   bkg_CR_bins.add(CRbin2);\n   bkg_CR_bins.add(CRbin3);\n   bkg_CR_bins.add(CRbin4);\n\n   RooParametricHist p_CRbkg(\"bkg_CR\", \"Background PDF in control region\",met,bkg_CR_bins,data_th1);\n   RooAddition p_CRbkg_norm(\"bkg_CR_norm\",\"Total Number of events from background in control region\",bkg_CR_bins);\n

Finally, we can also create alternative shape variations (Up/Down) that can be fed to Combine as we do with TH1 or RooDataHist type workspaces. These need to be of type RooDataHist. The example below is for a Jet Energy Scale type shape uncertainty.

   TH1F background_up(\"tbkg_CR_JESUp\",\"\",nbins,xbins);\n   background_up.SetBinContent(1,CRbin1.getVal()*1.01);\n   background_up.SetBinContent(2,CRbin2.getVal()*1.02);\n   background_up.SetBinContent(3,CRbin3.getVal()*1.03);\n   background_up.SetBinContent(4,CRbin4.getVal()*1.04);\n   RooDataHist bkg_CRhist_sysUp(\"bkg_CR_JESUp\",\"Bkg sys up\",vars,&background_up);\n   wspace.import(bkg_CRhist_sysUp);\n\n   TH1F background_down(\"bkg_CR_JESDown\",\"\",nbins,xbins);\n   background_down.SetBinContent(1,CRbin1.getVal()*0.90);\n   background_down.SetBinContent(2,CRbin2.getVal()*0.98);\n   background_down.SetBinContent(3,CRbin3.getVal()*0.97);\n   background_down.SetBinContent(4,CRbin4.getVal()*0.96);\n   RooDataHist bkg_CRhist_sysDown(\"bkg_CR_JESDown\",\"Bkg sys down\",vars,&background_down);\n   wspace.import(bkg_CRhist_sysDown);\n

Below are datacards (for signal and control regions) which can be used in conjunction with the workspace built above. In order to \"use\" the control region, simply combine the two cards as usual using combineCards.py.

Show Signal Region Datacard
Signal Region Datacard -- signal category\n\nimax _ number of bins\njmax _ number of processes minus 1\nkmax \\* number of nuisance parameters\n\n---\n\nshapes data_obs signal param_ws.root wspace:data_obs_SR\nshapes background signal param_ws.root wspace:bkg_SR # the background model pdf which is freely floating, note other backgrounds can be added as usual\nshapes signal signal param_ws.root wspace:signal\n\n---\n\nbin signal\nobservation -1\n\n---\n\n# background rate must be taken from \\_norm param x 1\n\nbin signal signal\nprocess background signal\nprocess 1 0\nrate 1 -1\n\n---\n\n# Normal uncertainties in the signal region\n\n## lumi_8TeV lnN - 1.026\n\n# free floating parameters, we do not need to declare them, but its a good idea to\n\nbkg_SR_bin1 flatParam\nbkg_SR_bin2 flatParam\nbkg_SR_bin3 flatParam\nbkg_SR_bin4 flatParam\n\n
Show Control Region Datacard
\nControl Region Datacard -- control category\n\nimax _ number of bins\njmax _ number of processes minus 1\nkmax \\* number of nuisance parameters\n\n---\n\nshapes data*obs control param_ws.root wspace:data_obs_CR\nshapes background control param_ws.root wspace:bkg_CR wspace:bkg_CR*$SYSTEMATIC # the background model pdf which is dependant on that in the SR, note other backgrounds can be added as usual\n\n---\n\nbin control\nobservation -1\n\n---\n\n# background rate must be taken from \\_norm param x 1\n\nbin control\nprocess background\nprocess 1\nrate 1\n\n---\n\nJES shape 1\nISR shape 1\nefficiency param 0 1\nacceptance param 0 1\n\n

Note that for the control region, our nuisance parameters appear as param types, so that Combine will correctly constrain them.

If we combine the two cards and fit the result with -M MultiDimFit -v 3 we can see that the parameters that give the rate of background in each bin of the signal region, along with the nuisance parameters and signal strength, are determined by the fit - i.e we have properly included the constraint from the control region, just as with the 1-bin gmN.

\nacceptance = 0.00374312 +/- 0.964632 (limited)\nbkg_SR_bin1 = 99.9922 +/- 5.92062 (limited)\nbkg_SR_bin2 = 49.9951 +/- 4.13535 (limited)\nbkg_SR_bin3 = 24.9915 +/- 2.9267 (limited)\nbkg_SR_bin4 = 9.96478 +/- 2.1348 (limited)\nefficiency = 0.00109195 +/- 0.979334 (limited)\nlumi_8TeV = -0.0025911 +/- 0.994458\nr = 0.00716347 +/- 12.513 (limited)\n\n

The example given here is extremely basic and it should be noted that additional complexity in the transfer factors, as well as additional uncertainties/backgrounds etc in the cards are, as always, supported.

Danger

If trying to implement parametric uncertainties in this setup (eg on transfer factors) that are correlated with other channels and implemented separately, you MUST normalize the uncertainty effect so that the datacard line can read param name X 1. That is, the uncertainty on this parameter must be 1. Without this, there will be inconsistency with other nuisances of the same name in other channels implemented as shape or lnN.

"},{"location":"part3/nonstandard/#look-elsewhere-effect-for-one-parameter","title":"Look-elsewhere effect for one parameter","text":"

In case you see an excess somewhere in your analysis, you can evaluate the look-elsewhere effect (LEE) of that excess. For an explanation of the LEE, take a look at the CMS Statistics Committee Twiki here.

To calculate the look-elsewhere effect for a single parameter (in this case the mass of the resonance), you can follow the instructions below. Note that these instructions assume you have a workspace that is parametric in your resonance mass \\(m\\), otherwise you need to fit each background toy with separate workspaces. We will assume the local significance for your excess is \\(\\sigma\\).

  • Generate background-only toys combine ws.root -M GenerateOnly --toysFrequentist -m 16.5 -t 100 --saveToys --expectSignal=0. The output will be something like higgsCombineTest.GenerateOnly.mH16.5.123456.root.

  • For each toy, calculate the significance for a predefined range (e.g \\(m\\in [10,35]\\) GeV) in steps suitable to the resolution (e.g. 1 GeV). For toy_1 the procedure would be: for i in $(seq 10 35); do combine ws.root -M Significance --redefineSignalPOI r --freezeParameters MH --setParameter MH=$i -n $i -D higgsCombineTest.GenerateOnly.mH16.5.123456.root:toys/toy_1. Calculate the maximum significance over all of these mass points - call this \\(\\sigma_{max}\\).

  • Count how many toys have a maximum significance larger than the local one for your observed excess. This fraction of toys with \\(\\sigma_{max}>\\sigma\\) is the global p-value.

You can find more tutorials on the LEE here

"},{"location":"part3/regularisation/","title":"Unfolding & regularization","text":"

This section details how to perform an unfolded cross-section measurement, including regularization, within Combine.

There are many resources available that describe unfolding, including when to use it (or not), and what the common issues surrounding it are. For CMS users, useful summary is available in the CMS Statistics Committee pages on unfolding. You can also find an overview of unfolding and its usage in Combine in these slides.

The basic idea behind the unfolding technique is to describe smearing introduced through the reconstruction (e.g. of the particle energy) in a given truth level bin \\(x_{i}\\) through a linear relationship with the effects in the nearby truth-bins. We can make statements about the probability \\(p_{j}\\) that the event falling in the truth bin \\(x_{i}\\) is reconstructed in the bin \\(y_{i}\\) via the linear relationship,

\\[ y_{obs} = \\tilde{\\boldsymbol{R}}\\cdot x_{true} + b \\]

or, if the truth bins are expressed relative to some particular model, we use the usual signal strength terminology,

\\[ y_{obs} = \\boldsymbol{R}\\cdot \\mu + b \\]

Unfolding aims to find the distribution at truth level \\(x\\), given the observations \\(y\\) at reco-level.

"},{"location":"part3/regularisation/#likelihood-based-unfolding","title":"Likelihood-based unfolding","text":"

Since Combine has access to the full likelihood for any analysis written in the usual datacard format, we will use likelihood-based unfolding throughout - for other approaches, there are many other tools available (eg RooUnfold or TUnfold), which can be used instead.

The benefits of the likelihood-based approach are that,

  • Background subtraction is accounted for directly in the likelihood
  • Systematic uncertainties are accounted for directly during the unfolding as nuisance parameters
  • We can profile the nuisance parameters during the unfolding to make the most of the data available

In practice, one must construct the response matrix and unroll it in the reconstructed bins:

  • First, one derives the truth distribution, e.g. after the generator-level selection only, \\(x_{i}\\).
  • Each reconstructed bin (e.g. each datacard) should describe the contribution from each truth bin - this is how Combine knows about the response matrix \\(\\boldsymbol{R}\\) and folds in the acceptance/efficiency effects as usual.
  • The out-of-acceptance contributions can also be included in the above.

The model we use for this is then just the usual PhysicsModel:multiSignalModel, where each signal refers to a particular truth level bin. The results can be extracted through a simple maximum-likelihood fit with,

    text2workspace.py -m 125 --X-allow-no-background -o datacard.root datacard.txt\n       -P HiggsAnalysis.CombinedLimit.PhysicsModel:multiSignalModel --PO map='.*GenBin0.*:r_Bin0[1,-1,20]' --PO map='.*GenBin1.*:r_Bin1[1,-1,20]' --PO map='.*GenBin2.*:r_Bin2[1,-1,20]' --PO map='.*GenBin3.*:r_Bin3[1,-1,20]' --PO map='.*GenBin4.*:r_Bin4[1,-1,20]'\n\n    combine -M MultiDimFit --setParameters=r_Bin0=1,r_Bin1=1,r_Bin2=1,r_Bin3=1,r_Bin4=1 -t -1 -m 125 datacard.root\n    combine -M MultiDimFit --setParameters=r_Bin0=1,r_Bin1=1,r_Bin2=1,r_Bin3=1,r_Bin4=1 -t -1 -m 125 --algo=grid --points=100 -P r_Bin1 --setParameterRanges r_Bin1=0.5,1.5 --floatOtherPOIs=1 datacard.root\n

Notice that one can also perform the so called bin-by-bin unfolding (though it is strongly discouraged, except for testing) with,

    text2workspace.py -m 125 --X-allow-no-background -o datacard.root datacard.txt\n      -P HiggsAnalysis.CombinedLimit.PhysicsModel:multiSignalModel --PO map='.*RecoBin0.*:r_Bin0[1,-1,20]' --PO map='.*RecoBin1.*:r_Bin1[1,-1,20]' --PO map='.*RecoBin2.*:r_Bin2[1,-1,20]' --PO map='.*RecoBin3.*:r_Bin3[1,-1,20]' --PO map='.*RecoBin4.*:r_Bin4[1,-1,20]'\n

Nuisance parameters can be added to the likelihood function and profiled in the usual way via the datacards. Theory uncertainties on the inclusive cross section are typically not included in unfolded measurements.

The figure below shows a comparison of likelihood-based unfolding and a least-squares based unfolding as implemented in RooUnfold.

Show comparison

"},{"location":"part3/regularisation/#regularization","title":"Regularization","text":"

The main difference with respect to other models with multiple signal contributions is the introduction of Regularization, which is used to stabilize the unfolding process.

An example of unfolding in Combine with and without regularization, can be found under data/tutorials/regularization.

Running python createWs.py [-r] will create a simple datacard and perform a fit both with and without including regularization.

The simplest way to introduce regularization in the likelihood based approach, is to apply a penalty term, which depends on the values of the truth bins, in the likelihood function (so-called Tikhonov regularization):

\\[ -2\\ln L = -2\\ln L + P(\\vec{x}) \\]

Here, \\(P\\) is a linear operator. There are two different approaches that are supported to construct \\(P\\). If you run python makeModel.py, you will create a more complex datacard with the two regularization schemes implemented. You will need to uncomment the relevant sections of code to activate SVD or TUnfold-type regularization.

Warning

When using any unfolding method with regularization, you must perform studies of the potential bias/coverage properties introduced through the

inclusion of regularization, and how strong the associated regularization is. Advice on this can be found in the CMS Statistics Committee pages.

"},{"location":"part3/regularisation/#singular-value-decomposition-svd","title":"Singular Value Decomposition (SVD)","text":"

In the SVD approach - as described in the SVD paper - the penalty term is constructed directly based on the strengths (\\(\\vec{\\mu}=\\{\\mu_{i}\\}_{i=1}^{N}\\)),

\\[ P = \\tau\\left| A\\cdot \\vec{\\mu} \\right|^{2}, \\]

where \\(A\\) is typically the discrete curvature matrix, with

\\[ A = \\begin{bmatrix} 1 & -1 & ... \\\\ 1 & -2 & 1 & ... \\\\ ... \\end{bmatrix} \\]

Penalty terms on the derivatives can also be included. Such a penalty term is included by modifying the likelihood to include one constraint for each row of the product \\(A\\cdot\\vec{\\mu}\\), by including them as lines in the datacard of the form,

    name constr formula dependents delta\n

where the regularization strength is \\(\\delta=\\frac{1}{\\sqrt{\\tau}}\\) and can either be a fixed value (e.g. by directly putting 0.01) or as a modifiable parameter with e.g. delta[0.01].

For example, for 3 bins and a regularization strength of 0.03, the first line would be

    name constr @0-2*@2+@1 r_Bin0,r_Bin1,r_Bin2 0.03\n

Alternative valid syntaxes are

    constr1 constr r_bin0-r_bin1 0.01\n    constr1 constr r_bin0-r_bin1 delta[0.01]\n    constr1 constr r_bin0+r_bin1 r_bin0,r_bin1 0.01\n    constr1 constr r_bin0+r_bin1 {r_bin0,r_bin1} delta[0.01]\n

The figure below shows an example unfolding using the \"SVD regularization\" approach with the least squares method (as implemented by RooUnfold) and implemented as a penalty term added to the likelihood using the maximum likelihood approach in Combine.

Show comparison

"},{"location":"part3/regularisation/#tunfold-method","title":"TUnfold method","text":"

The Tikhonov regularization as implemented in TUnfold uses the MC information, or rather the density prediction, as a bias vector. In order to give this information to Combine, a single datacard for each reconstruction-level bin needs to be produced, so that we have access to the proper normalization terms during the minimization. In this case the bias vector is \\(\\vec{x}_{obs}-\\vec{x}_{true}\\)

Then one can write a constraint term in the datacard via, for example,

    constr1 constr (r_Bin0-1.)*(shapeSig_GenBin0_RecoBin0__norm+shapeSig_GenBin0_RecoBin1__norm+shapeSig_GenBin0_RecoBin2__norm+shapeSig_GenBin0_RecoBin3__norm+shapeSig_GenBin0_RecoBin4__norm)+(r_Bin2-1.)*(shapeSig_GenBin2_RecoBin0__norm+shapeSig_GenBin2_RecoBin1__norm+shapeSig_GenBin2_RecoBin2__norm+shapeSig_GenBin2_RecoBin3__norm+shapeSig_GenBin2_RecoBin4__norm)-2*(r_Bin1-1.)*(shapeSig_GenBin1_RecoBin0__norm+shapeSig_GenBin1_RecoBin1__norm+shapeSig_GenBin1_RecoBin2__norm+shapeSig_GenBin1_RecoBin3__norm+shapeSig_GenBin1_RecoBin4__norm) {r_Bin0,r_Bin1,r_Bin2,shapeSig_GenBin1_RecoBin0__norm,shapeSig_GenBin0_RecoBin0__norm,shapeSig_GenBin2_RecoBin0__norm,shapeSig_GenBin1_RecoBin1__norm,shapeSig_GenBin0_RecoBin1__norm,shapeSig_GenBin2_RecoBin1__norm,shapeSig_GenBin1_RecoBin2__norm,shapeSig_GenBin0_RecoBin2__norm,shapeSig_GenBin2_RecoBin2__norm,shapeSig_GenBin1_RecoBin3__norm,shapeSig_GenBin0_RecoBin3__norm,shapeSig_GenBin2_RecoBin3__norm,shapeSig_GenBin1_RecoBin4__norm,shapeSig_GenBin0_RecoBin4__norm,shapeSig_GenBin2_RecoBin4__norm} delta[0.03]\n
"},{"location":"part3/runningthetool/","title":"How to run the tool","text":"

The executable Combine provided by the package is used to invoke the tools via the command line. The statistical analysis method, as well as user settings, are also specified on the command line. To see the full list of available options, you can run:

combine --help\n

The option -M is used to choose the statistical evaluation method. There are several groups of statistical methods:

  • Asymptotic likelihood methods:
    • AsymptoticLimits: limits calculated according to the asymptotic formulae in arxiv:1007.1727.
    • Significance: simple profile likelihood approximation, for calculating significances.
  • Bayesian methods:
    • BayesianSimple: performing a classical numerical integration (for simple models only).
    • MarkovChainMC: performing Markov Chain integration, for arbitrarily complex models.
  • Frequentist or hybrid bayesian-frequentist methods:
    • HybridNew: compute modified frequentist limits, significance/p-values and confidence intervals according to several possible prescriptions with toys.
  • Fitting
    • FitDiagnostics: performs maximum likelihood fits to extract the signal rate, and provides diagnostic tools such as pre- and post-fit figures and correlations
    • MultiDimFit: performs maximum likelihood fits and likelihood scans with an arbitrary number of parameters of interest.
  • Miscellaneous other modules that do not compute limits or confidence intervals, but use the same framework:
    • GoodnessOfFit: perform a goodness of fit test for models including shape information. Several GoF tests are implemented.
    • ChannelConsistencyCheck: study the consistency between individual channels in a combination.
    • GenerateOnly: generate random or asimov toy datasets for use as input to other methods

The command help is organized into five parts:

  • The Main options section indicates how to pass the datacard as input to the tool (-d datacardName), how to choose the statistical method (-M MethodName), and how to set the verbosity level -v
  • Under Common statistics options, options common to different statistical methods are given. Examples are --cl, to specify the confidence level (default is 0.95), or -t, to give the number of toy MC extractions required.
  • The Common input-output options section includes, for example, the options to specify the mass hypothesis under study (-m) or to include a specific string in the output filename (--name).
  • Common miscellaneous options.
  • Further method-specific options are available for each method. By passing the method name via the -M option, along with --help, the options for that specific method are shown in addition to the common options.

Not all the available options are discussed in this online documentation; use --help to get the documentation of all options.

"},{"location":"part3/runningthetool/#common-command-line-options","title":"Common command-line options","text":"

There are a number of useful command-line options that can be used to alter the model (or parameters of the model) at run time. The most commonly used, generic options, are:

  • -H: first run a different, faster, algorithm (e.g. the ProfileLikelihood described below) to obtain an approximate indication of the limit, which will allow the precise chosen algorithm to converge more quickly. We strongly recommend to use this option when using the MarkovChainMC, HybridNew or FeldmanCousins calculators, unless you know in which range your limit lies and you set this range manually (the default is [0, 20])

  • --rMax, --rMin: manually restrict the range of signal strengths to consider. For Bayesian limits with MCMC, a rule of thumb is that rMax should be 3-5 times the limit (a too small value of rMax will bias your limit towards low values, since you are restricting the integration range, while a too large value will bias you to higher limits)

  • --setParameters name=value[,name2=value2,...] sets the starting values of the parameters, useful e.g. when generating toy MC or when setting the parameters as fixed. This option supports the use of regular expressions by replacing name with rgx{some regular expression}.

  • --setParameterRanges name=min,max[:name2=min2,max2:...] sets the ranges of the parameters (useful e.g. for scans in MultiDimFit, or for Bayesian integration). This option supports the use of regular expressions by replacing name with rgx{some regular expression}.

  • --redefineSignalPOIs name[,name2,...] redefines the set of parameters of interest.

    • If the parameters were constant in the input workspace, they are set to be floating.
    • Nuisance parameters promoted to parameters of interest are removed from the list of nuisances, and thus they are not randomized in methods that randomize nuisances (e.g. HybridNew in non-frequentist mode, or BayesianToyMC, or in toy generation with -t but without --toysFreq). This does not have any impact on algorithms that do not randomize nuisance parameters (e.g. fits, AsymptoticLimits, or HybridNew in fequentist mode) or on algorithms that treat all parameters in the same way (e.g. MarkovChainMC).
    • Note that constraint terms for the nuisances are dropped after promotion to a POI using --redefineSignalPOI. To produce a likelihood scan for a nuisance parameter, using MultiDimFit with --algo grid, you should instead use the --parameters (-P) option, which will not cause the loss of the constraint term when scanning.
    • Parameters of interest of the input workspace that are not selected by this command become unconstrained nuisance parameters, but they are not added to the list of nuisances so they will not be randomized (see above).
  • --freezeParameters name1[,name2,...] Will freeze the parameters with the given names to their set values. This option supports the use of regular expression by replacing name with rgx{some regular expression} for matching to constrained nuisance parameters or var{some regular expression} for matching to any parameter. For example --freezeParameters rgx{CMS_scale_j.*} will freeze all constrained nuisance parameters with the prefix CMS_scale_j, while --freezeParameters var{.*rate_scale} will freeze any parameter (constrained nuisance parameter or otherwise) with the suffix rate_scale.

    • Use the option --freezeParameters allConstrainedNuisances to freeze all nuisance parameters that have a constraint term (i.e not flatParams or rateParams or other freely floating parameters).
    • Similarly, the option --floatParameters name1[,name2,...] sets the parameter(s) floating and also accepts regular expressions.
    • Groups of nuisance parameters (constrained or otherwise), as defined in the datacard, can be frozen using --freezeNuisanceGroups. You can also freeze all nuisances that are not contained in a particular group using a ^ before the group name (--freezeNuisanceGroups=^group_name will freeze everything except nuisance parameters in the group \"group_name\".)
    • All constrained nuisance parameters (not flatParam or rateParam) can be set floating using --floatAllNuisances.

Warning

Note that the floating/freezing options have a priority ordering from lowest to highest as floatParameters < freezeParameters < freezeNuisanceGroups < floatAllNuisances. Options with higher priority will take precedence over those with lower priority.

  • --trackParameters name1[,name2,...] will add a branch to the output tree for each of the named parameters. This option supports the use of regular expressions by replacing name with rgx{some regular expression}

    • The name of the branch will be trackedParam_name.
    • The exact behaviour depends on the method used. For example, when using MultiDimFit with --algo scan, the value of the parameter at each point in the scan will be saved, while for FitDiagnostics, only the value at the end of the fit will be saved.
  • --trackErrors name1[,name2,...] will add a branch to the output tree for the error of each of the named parameters. This option supports the use of regular expressions by replacing name with rgx{some regular expression}

    • The name of the branch will be trackedError_name.
    • The behaviour, in terms of which values are saved, is the same as --trackParameters above.

By default, the data set used by Combine will be the one listed in the datacard. You can tell Combine to use a different data set (for example a toy data set that you generated) by using the option --dataset. The argument should be rootfile.root:workspace:location or rootfile.root:location. In order to use this option, you must first convert your datacard to a binary workspace and use this binary workspace as the input to Combine.

"},{"location":"part3/runningthetool/#generic-minimizer-options","title":"Generic Minimizer Options","text":"

Combine uses its own minimizer class, which is used to steer Minuit (via RooMinimizer), named the CascadeMinimizer. This allows for sequential minimization, which can help in case a particular setting or algorithm fails. The CascadeMinimizer also knows about extra features of Combine such as discrete nuisance parameters.

All of the fits that are performed in Combine's methods use this minimizer. This means that the fits can be tuned using these common options,

  • --cminPoiOnlyFit: First, perform a fit floating only the parameters of interest. This can be useful to find, roughly, where the global minimum is.
  • --cminPreScan: Do a scan before the first minimization.
  • --cminPreFit arg If set to a value N > 0, the minimizer will perform a pre-fit with strategy (N-1), with the nuisance parameters frozen.
    • --cminApproxPreFitTolerance arg: If non-zero, first do a pre-fit with this tolerance (or 10 times the final tolerance, whichever is largest)
    • --cminApproxPreFitStrategy arg: Strategy to use in the pre-fit. The default is strategy 0.
  • --cminDefaultMinimizerType arg: Set the default minimizer type. By default this is set to Minuit2.
  • --cminDefaultMinimizerAlgo arg: Set the default minimizer algorithm. The default algorithm is Migrad.
  • --cminDefaultMinimizerTolerance arg: Set the default minimizer tolerance, the default is 0.1.
  • --cminDefaultMinimizerStrategy arg: Set the default minimizer strategy between 0 (speed), 1 (balance - default), 2 (robustness). The Minuit documentation for this is pretty sparse but in general, 0 means evaluate the function less often, while 2 will waste function calls to get precise answers. An important note is that the Hesse algorithm (for error and correlation estimation) will be run only if the strategy is 1 or 2.
  • --cminFallbackAlgo arg: Provides a list of fallback algorithms, to be used in case the default minimizer fails. You can provide multiple options using the syntax Type[,algo],strategy[:tolerance]: eg --cminFallbackAlgo Minuit2,Simplex,0:0.1 will fall back to the simplex algorithm of Minuit2 with strategy 0 and a tolerance 0.1, while --cminFallbackAlgo Minuit2,1 will use the default algorithm (Migrad) of Minuit2 with strategy 1.
  • --cminSetZeroPoint (0/1): Set the reference of the NLL to 0 when minimizing, this can help faster convergence to the minimum if the NLL itself is large. The default is true (1), set to 0 to turn off.

The allowed combinations of minimizer types and minimizer algorithms are as follows:

Minimizer type Minimizer algorithm Minuit Migrad, Simplex, Combined, Scan Minuit2 Migrad, Simplex, Combined, Scan GSLMultiMin ConjugateFR, ConjugatePR, BFGS, BFGS2, SteepestDescent

You can find details about these in the Minuit2 documentation here.

More of these options can be found in the Cascade Minimizer options section when running --help.

"},{"location":"part3/runningthetool/#output-from-combine","title":"Output from combine","text":"

Most methods will print the results of the computation to the screen. However, in addition, Combine will also produce a root file containing a tree called limit with these results. The name of this file will be of the format,

higgsCombineTest.MethodName.mH$MASS.[word$WORD].root\n

where $WORD is any user defined keyword from the datacard which has been set to a particular value.

A few command-line options can be used to control this output:

  • The option -n allows you to specify part of the name of the root file. e.g. if you pass -n HWW the root file will be called higgsCombineHWW.... instead of higgsCombineTest
  • The option -m allows you to specify the (Higgs boson) mass hypothesis, which gets written in the filename and in the output tree. This simplifies the bookeeping, as it becomes possible to merge multiple trees corresponding to different (Higgs boson) masses using hadd. Quantities can then be plotted as a function of the mass. The default value is m=120.
  • The option -s can be used to specify the seed (eg -s 12345) used in toy generation. If this option is given, the name of the file will be extended by this seed, eg higgsCombineTest.AsymptoticLimits.mH120.12345.root
  • The option --keyword-value allows you to specify the value of a keyword in the datacard such that $WORD (in the datacard) will be given the value of VALUE in the command --keyword-value WORD=VALUE, eg higgsCombineTest.AsymptoticLimits.mH120.WORDVALUE.12345.root

The output file will contain a TDirectory named toys, which will be empty if no toys are generated (see below for details) and a TTree called limit with the following branches;

Branch name Type Description limit Double_t Main result of combine run, with method-dependent meaning limitErr Double_t Estimated uncertainty on the result mh Double_t Value of MH, specified with -m option iToy Int_t Toy number identifier if running with -t iSeed Int_t Seed specified with -s t_cpu Float_t Estimated CPU time for algorithm t_real Float_t Estimated real time for algorithm quantileExpected Float_t Quantile identifier for methods that calculated expected (quantiles) and observed results (eg conversions from \\(\\Delta\\ln L\\) values), with method-dependent meaning. Negative values are reserved for entries that do not relate to quantiles of a calculation, with the default being set to -1 (usually meaning the observed result).

The value of any user-defined keyword $WORD that is set using keyword-value described above will also be included as a branch with type string named WORD. The option can be repeated multiple times for multiple keywords.

In some cases, the precise meanings of the branches will depend on the method being used. In this case, it will be specified in this documentation.

"},{"location":"part3/runningthetool/#toy-data-generation","title":"Toy data generation","text":"

By default, each of the methods described so far will be run using the observed data as the input. In several cases (as detailed below), it is useful to run the tool using toy datasets, including Asimov data sets.

The option -t is used to tell Combine to first generate one or more toy data sets, which will be used instead of the observed data. There are two versions,

  • -t N with N > 0. Combine will generate N toy datasets from the model and re-run the method once per toy. The seed for the toy generation can be modified with the option -s (use -s -1 for a random seed). The output file will contain one entry in the tree for each of these toys.

  • -t -1 will produce an Asimov data set, in which statistical fluctuations are suppressed. The procedure for generating this Asimov data set depends on the type of analysis you are using. More details are given below.

Warning

The default values of the nuisance parameters (or any parameter) are used to generate the toy. This means that if, for example, you are using parametric shapes and the parameters inside the workspace are set to arbitrary values, those arbitrary values will be used to generate the toy. This behaviour can be modified through the use of the option --setParameters x=value_x,y=value_y..., which will set the values of the parameters (x and y) before toy generation. You can also load a snapshot from a previous fit to set the nuisance parameters to their post-fit values (see below).

The output file will contain the toys (as RooDataSets for the observables, including global observables) in the toys directory if the option --saveToys is provided. If you include this option, the limit TTree in the output will have an entry corresponding to the state of the POI used for the generation of the toy, with the value of quantileExpected set to -2.

The branches that are created by methods like MultiDimFit will not show the values used to generate the toy. If you also want the TTree to show the values of the POIs used to generate the toy, you should add additional branches using the --trackParameters option as described in the common command-line options section above. These branches will behave as expected when adding the option --saveToys.

Warning

For statistical methods that make use of toys (including HybridNew, MarkovChainMC and running with -t N), the results of repeated Combine commands will not be identical when using the datacard as the input. This is due to a feature in the tool that allows one to run concurrent commands that do not interfere with one another. In order to produce reproducible results with toy-based methods, you should first convert the datacard to a binary workspace using text2workspace.py and then use the resulting file as input to the Combine commands

"},{"location":"part3/runningthetool/#asimov-datasets","title":"Asimov datasets","text":"

If you are using either -t -1 or AsymptoticLimits, Combine will calculate results based on an Asimov data set.

  • For counting experiments, the Asimov data set will just be the total number of expected events (given the values of the nuisance parameters and POIs of the model)

  • For shape analyses with templates, the Asimov data set will be constructed as a histogram using the same binning that is defined for your analysis.

  • If your model uses parametric shapes, there are some options as to what Asimov data set to produce. By default, Combine will produce the Asimov data set as a histogram using the binning that is associated with each observable (ie as set using RooRealVar::setBins). If this binning does not exist, Combine will guess a suitable binning - it is therefore best to use RooRealVar::setBins to associate a binning with each observable, even if your data is unbinned, if you intend to use Asimov data sets.

You can also ask Combine to use a Pseudo-Asimov dataset, which is created from many weighted unbinned events.

Setting --X-rtd TMCSO_AdaptivePseudoAsimov=\\(\\beta\\) with \\(\\beta>0\\) will trigger the internal logic of whether to produce a Pseudo-Asimov dataset. This logic is as follows;

  1. For each observable in your dataset, the number of bins, \\(n_{b}\\) is determined either from the value of RooRealVar::getBins, if it exists, or assumed to be 100.

  2. If \\(N_{b}=\\prod_{b}n_{b}>5000\\), the number of expected events \\(N_{ev}\\) is determined. Note if you are combining multiple channels, \\(N_{ev}\\) refers to the number of expected events in a single channel. The logic is separate for each channel. If \\(N_{ev}/N_{b}<0.01\\) then a Pseudo-Asimov data set is created with the number of events equal to \\(\\beta \\cdot \\mathrm{max}\\{100*N_{ev},1000\\}\\). If \\(N_{ev}/N_{b}\\geq 0.01\\) , then a normal Asimov data set is produced.

  3. If \\(N_{b}\\leq 5000\\) then a normal Asimov data set will be produced

The production of a Pseudo-Asimov data set can be forced by using the option --X-rtd TMCSO_PseudoAsimov=X where X>0 will determine the number of weighted events for the Pseudo-Asimov data set. You should try different values of X, since larger values lead to more events in the Pseudo-Asimov data set, resulting in higher precision. However, in general, the fit will be slower.

You can turn off the internal logic by setting --X-rtd TMCSO_AdaptivePseudoAsimov=0 --X-rtd TMCSO_PseudoAsimov=0, thereby forcing histograms to be generated.

Info

If you set --X-rtd TMCSO_PseudoAsimov=X with X>0 and also turn on --X-rtd TMCSO_AdaptivePseudoAsimov=\\(\\beta\\), with \\(\\beta>0\\), the internal logic will be used, but this time the default will be to generate Pseudo-Asimov data sets, rather than the standard Asimov ones.

"},{"location":"part3/runningthetool/#nuisance-parameter-generation","title":"Nuisance parameter generation","text":"

The default method of handling systematics is to generate random values (around their nominal values, see above) for the nuisance parameters, according to their prior PDFs centred around their default values, before generating the data. The unconstrained nuisance parameters (eg flatParam or rateParam), or those with flat priors are not randomized before the data generation. If you wish to also randomize these parameters, you must declare them as flatParam in your datacard and, when running text2workspace, you must add the option --X-assign-flatParam-prior to the command line.

The following options define how the toys will be generated,

  • --toysNoSystematics the nuisance parameters in each toy are not randomized when generating the toy data sets - i.e their nominal values are used to generate the data. Note that for methods which profile (fit) the nuisances, the parameters are still floating when evaluating the likelihood.

  • --toysFrequentist the nuisance parameters in each toy are set to their nominal values which are obtained after first fitting to the observed data, with the POIs fixed, before generating the toy data sets. For evaluating likelihoods, the constraint terms are instead randomized within their PDFs around the post-fit nuisance parameter values.

If you are using toysFrequentist, be aware that the values set by --setParameters will be ignored for the toy generation as the post-fit values will instead be used (except for any parameter that is also a parameter of interest). You can override this behaviour and choose the nominal values for toy generation for any parameter by adding the option --bypassFrequentistFit, which will skip the initial fit to data, or by loading a snapshot (see below).

Warning

For methods such as AsymptoticLimits and HybridNew --LHCmode LHC-limits, the \"nominal\" nuisance parameter values are taken from fits to the data and are, therefore, not \"blind\" to the observed data by default (following the fully frequentist paradigm). See the detailed documentation on these methods for how to run in fully \"blinded\" mode.

"},{"location":"part3/runningthetool/#generate-only","title":"Generate only","text":"

It is also possible to generate the toys first, and then feed them to the methods in Combine. This can be done using -M GenerateOnly --saveToys. The toys can then be read and used with the other methods by specifying --toysFile=higgsCombineTest.GenerateOnly... and using the same options for the toy generation.

Warning

Some methods also use toys within the method itself (eg AsymptoticLimits and HybridNew). For these, you should not specify the toy generation with -t or the options above. Instead, you should follow the method-specific instructions.

"},{"location":"part3/runningthetool/#loading-snapshots","title":"Loading snapshots","text":"

Snapshots from workspaces can be loaded and used in order to generate toys using the option --snapshotName <name of snapshot>. This will first set the parameters to the values in the snapshot, before any other parameter options are set and toys are generated.

See the section on saving post-fit workspaces for creating workspaces with post-fit snapshots from MultiDimFit.

Here are a few examples of calculations with toys from post-fit workspaces using a workspace with \\(r, m_{H}\\) as parameters of interest

  • Throw post-fit toy with b from s+b(floating \\(r,m_{H}\\)) fit, s with r=1.0, m=best fit MH, using nuisance parameter values and constraints re-centered on s+b(floating \\(r,m_{H}\\)) fit values (aka frequentist post-fit expected) and compute post-fit expected r uncertainty profiling MH combine higgsCombinemumhfit.MultiDimFit.mH125.root --snapshotName MultiDimFit -M MultiDimFit --verbose 9 -n randomtest --toysFrequentist --bypassFrequentistFit -t -1 --expectSignal=1 -P r --floatOtherPOIs=1 --algo singles

  • Throw post-fit toy with b from s+b(floating \\(r,m_{H}\\)) fit, s with r=1.0, m=128.0, using nuisance parameter values and constraints re-centered on s+b(floating \\(r,m_{H}\\)) fit values (aka frequentist post-fit expected) and compute post-fit expected significance (with MH fixed at 128 implicitly) combine higgsCombinemumhfit.MultiDimFit.mH125.root -m 128 --snapshotName MultiDimFit -M ProfileLikelihood --significance --verbose 9 -n randomtest --toysFrequentist --bypassFrequentistFit --overrideSnapshotMass -t -1 --expectSignal=1 --redefineSignalPOIs r --freezeParameters MH

  • Throw post-fit toy with b from s+b(floating \\(r,m_{H}\\)) fit, s with r=0.0, using nuisance parameter values and constraints re-centered on s+b(floating \\(r,m_{H}\\)) fit values (aka frequentist post-fit expected) and compute post-fit expected and observed asymptotic limit (with MH fixed at 128 implicitly) combine higgsCombinemumhfit.MultiDimFit.mH125.root -m 128 --snapshotName MultiDimFit -M AsymptoticLimits --verbose 9 -n randomtest --bypassFrequentistFit --overrideSnapshotMass--redefineSignalPOIs r --freezeParameters MH

"},{"location":"part3/runningthetool/#combinetool-for-job-submission","title":"combineTool for job submission","text":"

For longer tasks that cannot be run locally, several methods in Combine can be split to run on a batch system or on the Grid. The splitting and submission is handled using the combineTool (see this getting started section to check out the tool)

"},{"location":"part3/runningthetool/#submission-to-condor","title":"Submission to Condor","text":"

The syntax for running on condor with the tool is

combineTool.py -M ALGO [options] --job-mode condor --sub-opts='CLASSADS' --task-name NAME [--dry-run]\n

with options being the usual list of Combine options. The help option -h will give a list of both Combine and combineTool options. It is possible to use this tool with several different methods from Combine.

The --sub-opts option takes a string with the different ClassAds that you want to set, separated by \\n as argument (e.g. '+JobFlavour=\"espresso\"\\nRequestCpus=1').

The --dry-run option will show what will be run without actually doing so / submitting the jobs.

For example, to generate toys (eg for use with limit setting) users running on lxplus at CERN can use the condor mode:

combineTool.py -d workspace.root -M HybridNew --LHCmode LHC-limits --clsAcc 0  -T 2000 -s -1 --singlePoint 0.2:2.0:0.05 --saveHybridResult -m 125 --job-mode condor --task-name condor-test --sub-opts='+JobFlavour=\"tomorrow\"'\n

The --singlePoint option is over-ridden, so that this will produce a script for each value of the POI in the range 0.2 to 2.0 in steps of 0.05. You can merge multiple points into a script using --merge - e.g adding --merge 10 to the above command will mean that each job contains at most 10 of the values. The scripts are labelled by the --task-name option. They will be submitted directly to condor, adding any options in --sub-opts to the condor submit script. Make sure multiple options are separated by \\n. The jobs will run and produce output in the current directory.

Below is an example for splitting points in a multi-dimensional likelihood scan.

"},{"location":"part3/runningthetool/#splitting-jobs-for-a-multi-dimensional-likelihood-scan","title":"Splitting jobs for a multi-dimensional likelihood scan","text":"

The option --split-points issues the command to split the jobs for MultiDimFit when using --algo grid. The following example will split the jobs such that there are 10 points in each of the jobs, which will be submitted to the workday queue.

combineTool.py datacard.txt -M MultiDimFit --algo grid --points 50 --rMin 0 --rMax 1 --job-mode condor --split-points 10 --sub-opts='+JobFlavour=\"workday\"' --task-name mytask -n mytask\n

Remember, any usual options (such as redefining POIs or freezing parameters) are passed to Combine and can be added to the command line for combineTool.

Info

The option -n NAME should be included to avoid overwriting output files, as the jobs will be run inside the directory from which the command is issued.

"},{"location":"part3/runningthetool/#grid-submission-with-combinetool","title":"Grid submission with combineTool","text":"

For more CPU-intensive tasks, for example determining limits for complex models using toys, it is generally not feasible to compute all the results interactively. Instead, these jobs can be submitted to the Grid.

In this example we will use the HybridNew method of Combine to determine an upper limit for a sub-channel of the Run 1 SM \\(H\\rightarrow\\tau\\tau\\) analysis. For full documentation, see the section on computing limits with toys.

With this model it would take too long to find the limit in one go, so instead we create a set of jobs in which each one throws toys and builds up the test statistic distributions for a fixed value of the signal strength. These jobs can then be submitted to a batch system or to the Grid using crab3. From the set of output distributions it is possible to extract the expected and observed limits.

For this we will use combineTool.py

First we need to build a workspace from the \\(H\\rightarrow\\tau\\tau\\) datacard,

$ text2workspace.py data/tutorials/htt/125/htt_mt.txt -m 125\n$ mv data/tutorials/htt/125/htt_mt.root ./\n

To get an idea of the range of signal strength values we will need to build test-statistic distributions for, we will first use the AsymptoticLimits method of Combine,

$ combine -M Asymptotic htt_mt.root -m 125\n << Combine >>\n[...]\n -- AsymptoticLimits (CLs) --\nObserved Limit: r < 1.7384\nExpected  2.5%: r < 0.4394\nExpected 16.0%: r < 0.5971\nExpected 50.0%: r < 0.8555\nExpected 84.0%: r < 1.2340\nExpected 97.5%: r < 1.7200\n

Based on this, a range of 0.2 to 2.0 should be suitable.

We can use the same command for generating the distribution of test statistics with combineTool. The --singlePoint option is now enhanced to support expressions that generate a set of calls to Combine with different values. The accepted syntax is of the form MIN:MAX:STEPSIZE, and multiple comma-separated expressions can be specified.

The script also adds an option --dry-run, which will not actually call comCombinebine but just prints out the commands that would be run, e.g,

combineTool.py -M HybridNew -d htt_mt.root --LHCmode LHC-limits --singlePoint 0.2:2.0:0.2 -T 2000 -s -1 --saveToys --saveHybridResult -m 125 --dry-run\n...\n[DRY-RUN]: combine -d htt_mt.root --LHCmode LHC-limits -T 2000 -s -1 --saveToys --saveHybridResult -M HybridNew -m 125 --singlePoint 0.2 -n .Test.POINT.0.2\n[DRY-RUN]: combine -d htt_mt.root --LHCmode LHC-limits -T 2000 -s -1 --saveToys --saveHybridResult -M HybridNew -m 125 --singlePoint 0.4 -n .Test.POINT.0.4\n[...]\n[DRY-RUN]: combine -d htt_mt.root --LHCmode LHC-limits -T 2000 -s -1 --saveToys --saveHybridResult -M HybridNew -m 125 --singlePoint 2.0 -n .Test.POINT.2.0\n

When the --dry-run option is removed each command will be run in sequence.

"},{"location":"part3/runningthetool/#grid-submission-with-crab3","title":"Grid submission with crab3","text":"

Submission to the grid with crab3 works in a similar way. Before doing so, ensure that the crab3 environment has been sourced in addition to the CMSSW environment. We will use the example of generating a grid of test-statistic distributions for limits.

$ cmsenv; source /cvmfs/cms.cern.ch/crab3/crab.sh\n$ combineTool.py -d htt_mt.root -M HybridNew --LHCmode LHC-limits --clsAcc 0 -T 2000 -s -1 --singlePoint 0.2:2.0:0.05 --saveToys --saveHybridResult -m 125 --job-mode crab3 --task-name grid-test --custom-crab custom_crab.py\n

The option --custom-crab should point to a python file python containing a function of the form custom_crab(config) that will be used to modify the default crab configuration. You can use this to set the output site to your local grid site, or modify other options such as the voRole, or the site blacklist/whitelist.

For example

def custom_crab(config):\n  print '>> Customising the crab config'\n  config.Site.storageSite = 'T2_CH_CERN'\n  config.Site.blacklist = ['SOME_SITE', 'SOME_OTHER_SITE']\n

Again it is possible to use the option --dry-run to see what the complete crab config will look like before actually submitting it.

Once submitted, the progress can be monitored using the standard crab commands. When all jobs are completed, copy the output from your site's storage element to the local output folder.

$ crab getoutput -d crab_grid-test\n# Now we have to un-tar the output files\n$ cd crab_grid-test/results/\n$ for f in *.tar; do tar xf $f; done\n$ mv higgsCombine*.root ../../\n$ cd ../../\n

These output files should be combined with hadd, after which we invoke Combine as usual to calculate observed and expected limits from the merged grid, as usual.

"},{"location":"part3/simplifiedlikelihood/","title":"Procedure for creating and validating simplified likelihood inputs","text":"

This page is to give a brief outline for the creation of (potentially aggregated) predictions and their covariance to facilitate external reinterpretation using the simplified likelihood (SL) approach. Instructions for validating the simplified likelihood method (detailed in the CMS note here and \"The Simplified Likelihood Framework\" paper) are also given.

"},{"location":"part3/simplifiedlikelihood/#requirements","title":"Requirements","text":"

You need an up to date version of Combine. Note You should use the latest release of Combine for the exact commands on this page. You should be using Combine tag v9.0.0 or higher or the latest version of the 112x branch to follow these instructions.

You will find the python scripts needed to convert Combine outputs into simplified likelihood inputs under test/simplifiedLikelihood

If you're using the 102x branch (not recommended), then you can obtain these scripts from here by running:

curl -s https://raw.githubusercontent.com/nucleosynthesis/work-tools/master/sparse-checkout-SL-ssh.sh > checkoutSL.sh\nbash checkoutSL.sh\nls work-tools/stats-tools\n

If you also want to validate your inputs and perform fits/scans using them, you can use the package SLtools from The Simplified Likelihood Framework paper for this.

git clone https://gitlab.cern.ch/SimplifiedLikelihood/SLtools.git\n
"},{"location":"part3/simplifiedlikelihood/#producing-covariance-for-recasting","title":"Producing covariance for recasting","text":"

Producing the necessary predictions and covariance for recasting varies depending on whether or not control regions are explicitly included in the datacard when running fits. Instructions for cases where the control regions are and are not included are detailed below.

Warning

The instructions below will calculate moments based on the assumption that \\(E[x]=\\hat{x}\\), i.e it will use the maximum likelihood estimators for the yields as the expectation values. If instead you want to use the full definition of the moments, you can run the FitDiagnostics method with the -t option and include --savePredictionsPerToy and remove the other options, which will produce a tree of the toys in the output from which moments can be calculated.

"},{"location":"part3/simplifiedlikelihood/#type-a-control-regions-included-in-datacard","title":"Type A - Control regions included in datacard","text":"

For an example datacard 'datacard.txt' including two signal channels 'Signal1' and 'Signal2', make the workspace including the masking flags

text2workspace.py --channel-masks --X-allow-no-signal --X-allow-no-background datacard.txt -o datacard.root\n

Run the fit making the covariance (output saved as fitDiagnostics.root) masking the signal channels. Note that all signal channels must be masked!

combine datacard.root -M FitDiagnostics --saveShapes --saveWithUnc --numToysForShape 2000 --setParameters mask_Signal1=1,mask_Signal2=1 --saveOverall  -N Name\n

Where \"Name\" can be specified by you.

Outputs, including predictions and covariance, will be saved in fitDiagnosticsName.root folder shapes_fit_b

"},{"location":"part3/simplifiedlikelihood/#type-b-control-regions-not-included-in-datacard","title":"Type B - Control regions not included in datacard","text":"

For an example datacard 'datacard.txt' including two signal channels 'Signal1' and 'Signal2', make the workspace

text2workspace.py --X-allow-no-signal --X-allow-no-background datacard.txt -o datacard.root\n

Run the fit making the covariance (output saved as fitDiagnosticsName.root) setting no pre-fit signal contribution. Note we must set --preFitValue 0 in this case since, we will be using the pre-fit uncertainties for the covariance calculation and we do not want to include the uncertainties on the signal.

combine datacard.root -M FitDiagnostics --saveShapes --saveWithUnc --numToysForShape 2000 --saveOverall --preFitValue 0   -n Name\n

Where \"Name\" can be specified by you.

Outputs, including predictions and covariance, will be saved in fitDiagnosticsName.root folder shapes_prefit

In order to also extract the signal yields corresponding to r=1 (in case you want to run the validation step later), you also need to produce a second file with the pre-fit value set to 1. For this you do not need to run many toys. To save time you can set --numToysForShape to a low value.

combine datacard.root -M FitDiagnostics --saveShapes --saveWithUnc --numToysForShape 1 --saveOverall --preFitValue 1   -n Name2\n

You should check that the order of the bins in the covariance matrix is as expected.

"},{"location":"part3/simplifiedlikelihood/#produce-simplified-likelihood-inputs","title":"Produce simplified likelihood inputs","text":"

Head over to the test/simplifiedLikelihoods directory inside your Combine area. The following instructions depend on whether you are aggregating or not aggregating your signal regions. Choose the instructions for your case.

"},{"location":"part3/simplifiedlikelihood/#not-aggregating","title":"Not Aggregating","text":"

Run the makeLHInputs.py script to prepare the inputs for the simplified likelihood. The filter flag can be used to select only signal regions based on the channel names. To include all channels do not include the filter flag.

The SL input must NOT include any control regions that were not masked in the fit.

If your analysis is Type B (i.e everything in the datacard is a signal region), then you can just run

python makeLHInputs.py -i fitDiagnosticsName.root -o SLinput.root \n

If necessary (i.e as in Type B analyses) you may also need to run the same on the output of the run where the pre-fit value was set to 1.

python makeLHInputs.py -i fitDiagnosticsName2.root -o SLinput2.root \n

If you instead have a Type A analysis (some of the regions are control regions that were used to fit but not masked) then you should add the option --filter SignalName where SignalName is some string that defines the signal regions in your datacards (for example, \"SR\" is a common name for these).

Note: If your signal regions cannot be easily identified by a string, follow the instructions below for aggregating, but define only one channel for each aggregate region. This will maintain the full information and will not actually aggregate any regions.

"},{"location":"part3/simplifiedlikelihood/#aggregating","title":"Aggregating","text":"

If aggregating based on covariance, edit the config file aggregateCFG.py to define aggregate regions based on channel names. Note that wildcards are supported. You can then make likelihood inputs using

python makeLHInputs.py -i fitDiagnosticsName.root -o SLinput.root --config aggregateCFG.py\n

At this point you have the inputs as ROOT files necessary to publish and run the simplified likelihood.

"},{"location":"part3/simplifiedlikelihood/#validating-the-simplified-likelihood-approach","title":"Validating the simplified likelihood approach","text":"

The simplified likelihood relies on several assumptions (detailed in the documentation at the top). To test the validity for your analysis, statistical results between Combine and the simplified likelihood can be compared.

We will use the package SLtools from the Simplified Likelihood Paper for this. The first step is to convert the ROOT files into python configs to run in the tool.

"},{"location":"part3/simplifiedlikelihood/#convert-root-to-python","title":"Convert ROOT to Python","text":"

If you followed the steps above, you have all of the histograms already necessary to generate the python configs. The script test/simplifiedLikelihoods/convertSLRootToPython.py can be used to do the conversion. Just provide the following options when running with python.

  • -O/--outname : The output python file containing the model (default is test.py)
  • -s/--signal : The signal histogram, should be of format file.root:location/to/histogram
  • -b/--background : The background histogram, should be of format file.root:location/to/histogram
  • -d/--data : The data TGraph, should be of format file.root:location/to/graph
  • -c/--covariance : The covariance TH2 histogram, should be of format file.root:location/to/histogram

For example, to get the correct output from a Type B analysis with no aggregating, you can run

python test/simplifiedLikelihoods/convertSLRootToPython.py -O mymodel.py -s SLinput.root:shapes_prefit/total_signal  -b SLinput.root:shapes_prefit/total_M2 d -d SLinput.root:shapes_prefit/total_data -c SLinput.root:shapes_prefit/total_M2\n

The output will be a python file with the right format for the SL tool. You can mix different ROOT files for these inputs. Note that the SLtools package also has some tools to covert .yaml-based inputs into the python config for you.

"},{"location":"part3/simplifiedlikelihood/#run-a-likelihood-scan-with-the-sl","title":"Run a likelihood scan with the SL","text":"

If you have checked out the SLtools, you can create a simple python script as the one below to produce a scan of the simplified likelihood from your inputs.

#! /usr/bin/env python\nimport simplike as sl\n\nexec(open(\"mymodel.py\").read())\nslp1 = sl.SLParams(background, covariance, obs=data, sig=signal)\n\nimport numpy as np\nnpoints = 50\nmus = np.arange(-0.5, 2, (2+0.5)/npoints)\ntmus1 = [slp1.tmu(mu) for mu in mus]\nfrom matplotlib import pyplot as plt\nplt.plot(mus,tmus1)\nplt.show()\n

Where the mymodel.py config is a simple python file defined as;

  • data : A python array of observed data, one entry per bin.
  • background : A python array of expected background, one entry per bin.
  • covariance : A python array of the covariance between expected backgrounds. The format is a flat array which is converted into a 2D array inside the tool
  • signal : A python array of the expected signal, one entry per bin. This should be replaced with whichever signal model you are testing.

This model.py can also just be the output of the previous section converted from the ROOT files for you.

The example below is from the note CMS-NOTE-2017-001

Show example
\nimport numpy\nimport array\n\nname = \"CMS-NOTE-2017-001 dummy model\"\nnbins = 8\ndata = array.array('d',[1964,877,354,182,82,36,15,11])\nbackground = array.array('d',[2006.4,836.4,350.,147.1,62.0,26.2,11.1,4.7])\nsignal = array.array('d',[47,29.4,21.1,14.3,9.4,7.1,4.7,4.3])\ncovariance = array.array('d', [ 18774.2, -2866.97, -5807.3, -4460.52, -2777.25, -1572.97, -846.653, -442.531, -2866.97, 496.273, 900.195, 667.591, 403.92, 222.614, 116.779, 59.5958, -5807.3, 900.195, 1799.56, 1376.77, 854.448, 482.435, 258.92, 134.975, -4460.52, 667.591, 1376.77, 1063.03, 664.527, 377.714, 203.967, 106.926, -2777.25, 403.92, 854.448, 664.527, 417.837, 238.76, 129.55, 68.2075, -1572.97, 222.614, 482.435, 377.714, 238.76, 137.151, 74.7665, 39.5247, -846.653, 116.779, 258.92, 203.967, 129.55, 74.7665, 40.9423, 21.7285, -442.531, 59.5958, 134.975, 106.926, 68.2075, 39.5247, 21.7285, 11.5732])\n"},{"location":"part3/simplifiedlikelihood/#example-using-tutorial-datacard","title":"Example using tutorial datacard","text":"

For this example, we will use the tutorial datacard data/tutorials/longexercise/datacard_part3.txt. This datacard is of Type B since there are no control regions (all regions are signal regions).

\n

First, we will create the binary file (run text2workspace)

\n
text2workspace.py --X-allow-no-signal --X-allow-no-background data/tutorials/longexercise/datacard_part3.txt  -m 200\n
\n

And next, we will generate the covariance between the bins of the background model.

\n
combine data/tutorials/longexercise/datacard_part3.root -M FitDiagnostics --saveShapes --saveWithUnc --numToysForShape 10000 --saveOverall --preFitValue 0   -n SimpleTH1 -m 200\n\ncombine data/tutorials/longexercise/datacard_part3.root -M FitDiagnostics --saveShapes --saveWithUnc --numToysForShape 1 --saveOverall --preFitValue 1   -n SimpleTH1_Signal1 -m 200\n
\n

We will also want to compare our scan to that from the full likelihood, which we can get as usual from Combine.

\n
combine -M MultiDimFit data/tutorials/longexercise/datacard_part3.root --rMin -0.5 --rMax 2 --algo grid -n SimpleTH1 -m 200\n
\n

Next, since we do not plan to aggregate any of the bins, we will follow the instructions for this and pick out the right covariance matrix.

\n
python test/simplifiedLikelihoods/makeLHInputs.py -i fitDiagnosticsSimpleTH1.root -o SLinput.root \n\npython test/simplifiedLikelihoods/makeLHInputs.py -i fitDiagnosticsSimpleTH1_Signal1.root -o SLinput_Signal1.root \n
\n

We now have everything we need to provide the simplified likelihood inputs:

\n
$ root -l SLinput.root\nroot [0] .ls\n\nAttaching file SLinput.root as _file0...\n(TFile *) 0x3667820\nroot [1] .ls\nTFile**         SLinput.root\n TFile*         SLinput.root\n  KEY: TDirectoryFile   shapes_fit_b;1  shapes_fit_b\n  KEY: TDirectoryFile   shapes_prefit;1 shapes_prefit\n  KEY: TDirectoryFile   shapes_fit_s;1  shapes_fit_s\n
\n

We can convert this to a python module that we can use to run a scan with the SLtools package. Note, since we have a Type B datacard, we will be using the pre-fit covariance matrix. Also, this means we want to take the signal from the file where the prefit value of r was 1.

\n
python test/simplifiedLikelihoods/convertSLRootToPython.py -O mymodel.py -s SLinput_Signal1.root:shapes_prefit/total_signal  -b SLinput.root:shapes_prefit/total_M1-d SLinput.root:shapes_prefit/total_data -c SLinput.root:shapes_prefit/total_M2\n
\n

We can compare the profiled likelihood scans from our simplified likelihood (using the python file we just created) and from the full likelihood (that we created with Combine.). For the former, we need to first checkout the SLtools package

\n
git clone https://gitlab.cern.ch/SimplifiedLikelihood/SLtools.git\nmv higgsCombineSimpleTH1.MultiDimFit.mH200.root SLtools/ \nmv mymodel.py SLtools/\ncd SLtools\n
\n

The script below will create a plot of the comparison for us.

\n
#! /usr/bin/env python\nimport simplike as sl\n\nexec(open(\"mymodel.py\").read())\n\nslp1 = sl.SLParams(background, covariance, obs=data, sig=signal)\n\nimport ROOT \nfi = ROOT.TFile.Open(\"higgsCombineSimpleTH1.MultiDimFit.mH200.root\")\ntr = fi.Get(\"limit\")\n\npoints = []\nfor i in range(tr.GetEntries()):\n  tr.GetEntry(i)\n  points.append([tr.r,2*tr.deltaNLL])\npoints.sort()\n\nmus2=[pt[0] for pt in points]\ntmus2=[pt[1] for pt in points]\n\nimport numpy as np\nnpoints = 50\nmus1 = np.arange(-0.5, 2, (2+0.5)/npoints)\ntmus1 = [slp1.tmu(mu) for mu in mus1]\n\nfrom matplotlib import pyplot as plt\nplt.plot(mus1,tmus1,label='simplified likelihood')\nplt.plot(mus2,tmus2,label='full likelihood')\nplt.legend()\nplt.xlabel(\"$\\mu$\")\nplt.ylabel(\"$-2\\Delta \\ln L$\")\n\nplt.savefig(\"compareLH.pdf\")\n
\n

This will produce a figure like the one below.

\n

\n

It is also possible to include the third moment of each bin to improve the precision of the simplified likelihood [ JHEP 64 2019 ]. The necessary information is stored in the outputs from Combine, therefore you just need to include the option -t SLinput.root:shapes_prefit/total_M3 in the options list for convertSLRootToPython.py to include this in the model file. The third moment information can be included in SLtools by using sl.SLParams(background, covariance, third_moment, obs=data, sig=signal)

"},{"location":"part3/validation/","title":"Validating datacards","text":"

This section covers the main features of the datacard validation tool that helps you spot potential problems with your datacards at an early stage. The tool is implemented in the CombineHarvester/CombineTools subpackage. See the combineTool section of the documentation for checkout instructions.

The datacard validation tool contains a number of checks. It is possible to call subsets of these checks when creating datacards within CombineHarvester. However, for now we will only describe the usage of the validation tool on already existing datacards. If you create your datacards with CombineHarvester and would like to include the checks at the datacard creation stage, please contact us via https://cms-talk.web.cern.ch/c/physics/cat/cat-stats/279.

"},{"location":"part3/validation/#how-to-use-the-tool","title":"How to use the tool","text":"

The basic syntax is:

ValidateDatacards.py datacard.txt\n

This will write the results of the checks to a json file (default: validation.json), and will print a summary to the screen, for example:

================================\n=======Validation results=======\n================================\n>>>There were  7800 warnings of type  'up/down templates vary the yield in the same direction'\n>>>There were  5323 warnings of type  'up/down templates are identical'\n>>>There were no warnings of type  'At least one of the up/down systematic uncertainty templates is empty'\n>>>There were  4406 warnings of type  'Uncertainty has normalisation effect of more than 10.0%'\n>>>There were  8371 warnings of type  'Uncertainty probably has no genuine shape effect'\n>>>There were no warnings of type 'Empty process'\n>>>There were no warnings of type 'Bins of the template empty in background'\n>>>INFO: there were  169  alerts of type  'Small signal process'\n

The meaning of each of these warnings/alerts is discussed below.

The following arguments are possible:

usage: ValidateDatacards.py [-h] [--printLevel PRINTLEVEL] [--readOnly]\n                            [--checkUncertOver CHECKUNCERTOVER]\n                            [--reportSigUnder REPORTSIGUNDER]\n                            [--jsonFile JSONFILE] [--mass MASS]\n                            cards\n\npositional arguments:\n  cards                 Specifies the full path to the datacards to check\n\noptional arguments:\n  -h, --help            show this help message and exit\n  --printLevel PRINTLEVEL, -p PRINTLEVEL\n                        Specify the level of info printing (0-3, default:1)\n  --readOnly            If this is enabled, skip validation and only read the\n                        output json\n  --checkUncertOver CHECKUNCERTOVER, -c CHECKUNCERTOVER\n                        Report uncertainties which have a normalization effect\n                        larger than this fraction (default:0.1)\n  --reportSigUnder REPORTSIGUNDER, -s REPORTSIGUNDER\n                        Report signals contributing less than this fraction of\n                        the total in a channel (default:0.001)\n  --jsonFile JSONFILE   Path to the json file to read/write results from\n                        (default:validation.json)\n  --mass MASS           Signal mass to use (default:*)\n

printLevel adjusts how much information is printed to the screen. When set to 0, the results are only written to the json file, but not to the screen. When set to 1 (default), the number of warnings/alerts of a given type is printed to the screen. Setting this option to 2 prints the same information as level 1, and additionally prints which uncertainties are affected (if the check is related to uncertainties) or which processes are affected (if the check is related only to processes). When printLevel is set to 3, the information from level 2 is printed, and additionaly for checks related to uncertainties it prints which processes are affected.

To print information to screen, the script parses the json file that contains the results of the validation checks. Therefore, if you have already run the validation tool and produced this json file, you can simply change the printLevel by re-running the tool with printLevel set to a different value, and enabling the --readOnly option.

The options --checkUncertOver and --reportSigUnder will be described in more detail in the section that discusses the checks for which they are relevant.

Note: the --mass argument should only be set if you normally use it when running Combine, otherwise you can leave it at the default.

The datacard validation tool is primarily intended for shape (histogram) based analyses. However, when running on a parametric model or counting experiment the checks for small signal processes, empty processes, and uncertainties with large normalization effects can still be performed.

"},{"location":"part3/validation/#details-on-checks","title":"Details on checks","text":""},{"location":"part3/validation/#uncertainties-with-large-normalization-effect","title":"Uncertainties with large normalization effect","text":"

This check highlights nuisance parameters that have a normalization effect larger than the fraction set by the option --checkUncertOver. The default value is 0.1, meaning that any uncertainties with a normalization effect larger than 10% are flagged up.

The output file contains the following information for this check:

largeNormEff: {\n  <Uncertainty name>: {\n    <analysis category>: {\n      <process>: {\n        \"value_d\":<value>\n        \"value_u\":<value>\n      } \n    }\n  }\n}\n

Where value_u and value_d are the values of the 'up' and 'down' normalization effects.

"},{"location":"part3/validation/#at-least-one-of-the-updown-systematic-templates-is-empty","title":"At least one of the Up/Down systematic templates is empty","text":"

For shape uncertainties, this check reports all cases where the up and/or down template(s) are empty, when the nominal template is not.

The output file contains the following information for this check:

emptySystematicShape: {\n  <Uncertainty name>: {\n    <analysis category>: {\n      <process>: {\n        \"value_d\":<value>\n        \"value_u\":<value>\n      } \n    }\n  }\n}\n

Where value_u and value_d are the values of the 'up' and 'down' normalization effects.

"},{"location":"part3/validation/#identical-updown-templates","title":"Identical Up/Down templates","text":"

This check applies to shape uncertainties only, and will highlight cases where the shape uncertainties have identical Up and Down templates (identical in shape and in normalization).

The information given in the output file for this check is:

uncertTemplSame: {\n  <Uncertainty name>: {\n    <analysis category>: {\n      <process>: {\n        \"value_d\":<value>\n        \"value_u\":<value>\n      } \n    }\n  }\n}\n

Where value_u and value_d are the values of the 'up' and 'down' normalization effects.

"},{"location":"part3/validation/#up-and-down-templates-vary-the-yield-in-the-same-direction","title":"Up and Down templates vary the yield in the same direction","text":"

Again, this check only applies to shape uncertainties - it highlights cases where the 'Up' template and the 'Down' template both have the effect of increasing or decreasing the normalization of a process.

The information given in the output file for this check is:

uncertVarySameDirect: {\n  <Uncertainty name>: {\n    <analysis category>: {\n      <process>: {\n        \"value_d\":<value>\n        \"value_u\":<value>\n      } \n    }\n  }\n}\n

Where value_u and value_d are the values of the 'up' and 'down' normalization effects.

"},{"location":"part3/validation/#uncertainty-probably-has-no-genuine-shape-effect","title":"Uncertainty probably has no genuine shape effect","text":"

In this check, applying only to shape uncertainties, the normalized nominal templates are compared with the normalized templates for the 'up' and 'down' systematic variations. The script calculates $$ \\Sigma_i \\frac{2|\\text{up}(i) - \\text{nominal}(i)|}{|\\text{up}(i)| + |\\text{nominal}(i)|}$$ and $$ \\Sigma_i \\frac{2|\\text{down}(i) - \\text{nominal}(i)|}{|\\text{down}(i)| + |\\text{nominal}(i)|} $$

where the sums run over all bins in the histograms, and 'nominal', 'up', and 'down' are the central template and up and down varied templates, all normalized.

If both sums are smaller than 0.001, the uncertainty is flagged up as probably not having a genuine shape effect. This means a 0.1% variation in one bin is enough to avoid being reported, but many smaller variations can also sum to be large enough to pass the threshold. It should be noted that the chosen threshold is somewhat arbitrary: if an uncertainty is flagged up as probably having no genuine shape effect you should take this as a starting point to investigate.

The information given in the output file for this check is:

smallShapeEff: {\n  <Uncertainty name>: {\n    <analysis category>: {\n      <process>: {\n        \"diff_d\":<value>\n        \"diff_u\":<value>\n      } \n    }\n  }\n}\n

Where diff_d and diff_u are the values of the sums described above for the 'down' variation and the 'up' variation.

"},{"location":"part3/validation/#empty-process","title":"Empty process","text":"

If a process is listed in the datacard, but the yield is 0, it is flagged up by this check.

The information given in the output file for this check is:

emptyProcessShape: {\n  <analysis category>: {\n    <process1>,\n    <process2>,\n    <process3>\n  }\n}\n
"},{"location":"part3/validation/#bins-that-have-signal-but-no-background","title":"Bins that have signal but no background","text":"

For shape-based analyses, this checks whether there are any bins in the nominal templates that have signal contributions, but no background contributions.

The information given in the output file for this check is:

emptyBkgBin: {\n  <analysis category>: {\n    <bin_nr1>,\n    <bin_nr2>,\n    <bin_nr3>\n  }\n}\n
"},{"location":"part3/validation/#small-signal-process","title":"Small signal process","text":"

This reports signal processes that contribute less than the fraction specified by --reportSigUnder (default 0.001 = 0.1%) of the total signal in a given category. This produces an alert, not a warning, as it does not hint at a potential problem. However, in analyses with many signal contributions and with long fitting times, it can be helpful to remove signals from a category in which they do not contribute a significant amount.

The information given in the output file for this check is:

smallSignalProc: {\n  <analysis category>: {\n    <process>: {\n      \"sigrate_tot\":<value>\n      \"procrate\":<value>\n    } \n  }\n}\n

Where sigrate_tot is the total signal yield in the analysis category and procrate is the yield of signal process <process>.

"},{"location":"part3/validation/#what-to-do-in-case-of-a-warning","title":"What to do in case of a warning","text":"

These checks are mostly a tool to help you investigate your datacards: a warning does not necessarily mean there is a mistake in your datacard, but you should use it as a starting point to investigate. Empty processes and emtpy shape uncertainties connected to nonempty processes will most likely be unintended. The same holds for cases where the 'up' and 'down' shape templates are identical. If there are bins that contain signal but no background contributions, this should be corrected. See the FAQ for more information on that point.

For other checks it depends on the situation whether there is a problem or not. Some examples:

  • An analysis-specific nonclosure uncertainty could be larger than 10%. A theoretical uncertainty in the ttbar normalization probably not.
  • In an analysis with a selection that requires the presence of exactly 1 jet, 'up' and 'down' variations in the jet energy uncertainty could both change the process normalization in the same direction. (But they do not have to!)

As always: think about whether you expect a check to yield a warning in case of your analysis, and if not, investigate to make sure there are no issues.

"},{"location":"part4/usefullinks/","title":"Useful links and further reading","text":""},{"location":"part4/usefullinks/#tutorials-and-reading-material","title":"Tutorials and reading material","text":"

There are several tutorials that have been run over the last few years with instructions and examples for running the Combine tool.

Tutorial Sessions:

  • 1st tutorial 17th Nov 2015.
  • 2nd tutorial 30th Nov 2016.
  • 3rd tutorial 29th Nov 2017
  • 4th tutorial 31st Oct 2018 - Latest for 81x-root606 branch.
  • 5th tutorial 2nd-4th Dec 2019
  • 6th tutorial 14th-16th Dec 2020 - Latest for 102x branch
  • 7th tutorial 3rd Feb 2023 - Uses 113x branch

Worked examples from Higgs analyses using Combine:

  • The CMS DAS at CERN 2014
  • The CMS DAS at DESY 2018

Higgs combinations procedures

  • Conventions to be used when preparing inputs for Higgs combinations

  • CMS AN-2011/298 Procedure for the LHC Higgs boson search combination in summer 2011. This describes in more detail some of the methods used in Combine.

"},{"location":"part4/usefullinks/#citations","title":"Citations","text":"

There is no document currently which can be cited for using the Combine tool, however, you can use the following publications for the procedures we use,

  • Summer 2011 public ATLAS-CMS note for any Frequentist limit setting procedures with toys or Bayesian limits, constructing likelihoods, descriptions of nuisance parameter options (like log-normals (lnN) or gamma (gmN), and for definitions of test-statistics.

  • CCGV paper if you use any of the asymptotic (eg with -M AsymptoticLimits or -M Significance approximations for limits/p-values.

  • If you use the Barlow-Beeston approach to MC stat (bin-by-bin) uncertainties, please cite their paper Barlow-Beeston. You should also cite this note if you use the autoMCStats directive to produce a single parameter per bin.

  • If you use shape uncertainties for template (TH1 or RooDataHist) based datacards, you can cite this note from J. Conway.

  • If you are extracting uncertainties from LH scans - i.e using \\(-2\\Delta Log{L}=1\\) etc for the 1\\(\\sigma\\) intervals, you can cite either the ATLAS+CMS or CMS Higgs paper.

  • There is also a long list of citation recommendations from the CMS Statistics Committee pages.

"},{"location":"part4/usefullinks/#combine-based-packages","title":"Combine based packages","text":"
  • SWGuideHiggs2TauLimits (Deprecated)

  • ATGCRooStats

  • CombineHarvester

"},{"location":"part4/usefullinks/#contacts","title":"Contacts","text":"
  • CMStalk forum: https://cms-talk.web.cern.ch/c/physics/cat/cat-stats/279
"},{"location":"part4/usefullinks/#cms-statistics-committee","title":"CMS Statistics Committee","text":"
  • You can find much more statistics theory and reccomendations on various statistical procedures in the CMS Statistics Committee Twiki Pages
"},{"location":"part4/usefullinks/#faq","title":"FAQ","text":"
  • Why does Combine have trouble with bins that have zero expected contents?
    • If you are computing only upper limits, and your zero-prediction bins are all empty in data, then you can just set the background to a very small value instead of zero as the computation is regular for background going to zero (e.g. a counting experiment with \\(B\\leq1\\) will have essentially the same expected limit and observed limit as one with \\(B=0\\)). If you are computing anything else, e.g. p-values, or if your zero-prediction bins are not empty in data, you're out of luck, and you should find a way to get a reasonable background prediction there (and set an uncertainty on it, as per the point above)
  • How can an uncertainty be added to a zero quantity?
    • You can put an uncertainty even on a zero event yield if you use a gamma distribution. That is in fact the more proper way of doing it if the prediction of zero comes from the limited size of your MC or data sample used to compute it.
  • Why does changing the observation in data affect my expected limit?
    • The expected limit (if using either the default behaviour of -M AsymptoticLimits or using the LHC-limits style limit setting with toys) uses the post-fit expectation of the background model to generate toys. This means that first the model is fit to the observed data before toy generation. See the sections on blind limits and toy generation to avoid this behavior.
  • How can I deal with an interference term which involves a negative contribution?
    • You will need to set up a specific PhysicsModel to deal with this, however you can see this section to implement such a model that can incorperate a negative contribution to the physics process
  • How does Combine work?
    • That is not a question that can be answered without someone's head exploding; please try to formulate something specific.
  • What does fit status XYZ mean?
    • Combine reports the fit status in some routines (for example in the FitDiagnostics method). These are typically the status of the last call from Minuit. For details on the meanings of these status codes see the Minuit2Minimizer documentation page.
  • Why does my fit not converge?
    • There are several reasons why some fits may not converge. Often some indication can be obtained from the RooFitResult or status that you will see information from when using the --verbose X (with \\(X>2\\)) option. Sometimes however, it can be that the likelihood for your data is very unusual. You can get a rough idea about what the likelihood looks like as a function of your parameters (POIs and nuisances) using combineTool.py -M FastScan -w myworkspace.root (use --help for options).
    • We have often seen that fits in Combine using RooCBShape as a parametric function will fail. This is related to an optimization that fails. You can try to fix the problem as described in this issue: issues#347 (i.e add the option --X-rtd ADDNLL_CBNLL=0).
  • Why does the fit/fits take so long?
    • The minimization routines are common to many methods in Combine. You can tune the fits using the generic optimization command line options described here. For example, setting the default minimizer strategy to 0 can greatly improve the speed, since this avoids running HESSE. In calculations such as AsymptoticLimits, HESSE is not needed and hence this can be done, however, for FitDiagnostics the uncertainties and correlations are part of the output, so using strategy 0 may not be particularly accurate.
  • Why are the results for my counting experiment so slow or unstable?
    • There is a known issue with counting experiments with large numbers of events that will cause unstable fits or even the fit to fail. You can avoid this by creating a \"fake\" shape datacard (see this section from the setting up the datacards page). The simplest way to do this is to run combineCards.py -S mycountingcard.txt > myshapecard.txt. You may still find that your parameter uncertainties are not correct when you have large numbers of events. This can be often fixed using the --robustHesse option. An example of this issue is detailed here.
  • Why do some of my nuisance parameters have uncertainties > 1?
    • When running -M FitDiagnostics you may find that the post-fit uncertainties of the nuisances are \\(> 1\\) (or larger than their pre-fit values). If this is the case, you should first check if the same is true when adding the option --minos all, which will invoke MINOS to scan the likelihood as a function of these parameters to determine the crossing at \\(-2\\times\\Delta\\log\\mathcal{L}=1\\) rather than relying on the estimate from HESSE. However, this is not guaranteed to succeed, in which case you can scan the likelihood yourself using MultiDimFit (see here ) and specifying the option --poi X where X is your nuisance parameter.
  • How can I avoid using the data?
    • For almost all methods, you can use toy data (or an Asimov dataset) in place of the real data for your results to be blind. You should be careful however as in some methods, such as -M AsymptoticLimits or -M HybridNew --LHCmode LHC-limits or any other method using the option --toysFrequentist, the data will be used to determine the most likely nuisance parameter values (to determine the so-called a-posteriori expectation). See the section on toy data generation for details on this.
  • What if my nuisance parameters have correlations which are not 0 or 1?
    • Combine is designed under the assumption that each source of nuisance parameter is uncorrelated with the other sources. If you have a case where some pair (or set) of nuisances have some known correlation structure, you can compute the eigenvectors of their correlation matrix and provide these diagonalised nuisances to Combine. You can also model partial correlations, between different channels or data taking periods, of a given nuisance parameter using the combineTool as described in this page.
  • My nuisances are (artificially) constrained and/or the impact plot show some strange behaviour, especially after including MC statistical uncertainties. What can I do?
    • Depending on the details of the analysis, several solutions can be adopted to mitigate these effects. We advise to run the validation tools at first, to identify possible redundant shape uncertainties that can be safely eliminated or replaced with lnN ones. Any remaining artificial constraints should be studies. Possible mitigating strategies can be to (a) smooth the templates or (b) adopt some rebinning in order to reduce statistical fluctuations in the templates. A description of possible strategies and effects can be found in this talk by Margaret Eminizer
  • What do CLs, CLs+b and CLb in the code mean?
    • The names CLs+b and CLb what are found within some of the RooStats tools are rather outdated and should instead be referred to as p-values - \\(p_{\\mu}\\) and \\(1-p_{b}\\), respectively. We use the CLs (which itself is not a p-value) criterion often in High energy physics as it is designed to avoid excluding a signal model when the sensitivity is low (and protects against excluding due to underfluctuations in the data). Typically, when excluding a signal model the p-value \\(p_{\\mu}\\) often refers to the p-value under the signal+background hypothesis, assuming a particular value of the signal strength (\\(\\mu\\)) while \\(p_{b}\\) is the p-value under the background only hypothesis. You can find more details and definitions of the CLs criterion and \\(p_{\\mu}\\) and \\(p_{b}\\) in section 39.4.2.4 of the 2016 PDG review.
"},{"location":"part5/longexercise/","title":"Main Features of Combine (Long Exercises)","text":"

This exercise is designed to give a broad overview of the tools available for statistical analysis in CMS using the combine tool. Combine is a high-level tool for building RooFit/RooStats models and running common statistical methods. We will cover the typical aspects of setting up an analysis and producing the results, as well as look at ways in which we can diagnose issues and get a deeper understanding of the statistical model. This is a long exercise - expect to spend some time on it especially if you are new to Combine. If you get stuck while working through this exercise or have questions specifically about the exercise, you can ask them on this mattermost channel. Finally, we also provide some solutions to some of the questions that are asked as part of the exercise. These are available here.

For the majority of this course we will work with a simplified version of a real analysis, that nonetheless will have many features of the full analysis. The analysis is a search for an additional heavy neutral Higgs boson decaying to tau lepton pairs. Such a signature is predicted in many extensions of the standard model, in particular the minimal supersymmetric standard model (MSSM). You can read about the analysis in the paper here. The statistical inference makes use of a variable called the total transverse mass (\\(M_{\\mathrm{T}}^{\\mathrm{tot}}\\)) that provides good discrimination between the resonant high-mass signal and the main backgrounds, which have a falling distribution in this high-mass region. The events selected in the analysis are split into a several categories which target the main di-tau final states as well as the two main production modes: gluon-fusion (ggH) and b-jet associated production (bbH). One example is given below for the fully-hadronic final state in the b-tag category which targets the bbH signal:

Initially we will start with the simplest analysis possible: a one-bin counting experiment using just the high \\(M_{\\mathrm{T}}^{\\mathrm{tot}}\\) region of this distribution, and from there each section of this exercise will expand on this, introducing a shape-based analysis and adding control regions to constrain the backgrounds.

"},{"location":"part5/longexercise/#background","title":"Background","text":"

You can find a presentation with some more background on likelihoods and extracting confidence intervals here. A presentation that discusses limit setting in more detail can be found here. If you are not yet familiar with these concepts, or would like to refresh your memory, we recommend that you have a look at these presentations before you start with the exercise.

"},{"location":"part5/longexercise/#getting-started","title":"Getting started","text":"

We need to set up a new CMSSW area and checkout the Combine package:

cmsrel CMSSW_11_3_4\ncd CMSSW_11_3_4/src\ncmsenv\ngit clone https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit.git HiggsAnalysis/CombinedLimit\ncd HiggsAnalysis/CombinedLimit\n\ncd $CMSSW_BASE/src/HiggsAnalysis/CombinedLimit\ngit fetch origin\ngit checkout v9.0.0\n

We will also make use another package, CombineHarvester, which contains some high-level tools for working with Combine. The following command will download the repository and checkout just the parts of it we need for this tutorial:

bash <(curl -s https://raw.githubusercontent.com/cms-analysis/CombineHarvester/main/CombineTools/scripts/sparse-checkout-https.sh)\n

Now make sure the CMSSW area is compiled:

scramv1 b clean; scramv1 b\n

Now we will move to the working directory for this tutorial, which contains all the inputs needed to run the exercises below:

cd $CMSSW_BASE/src/HiggsAnalysis/CombinedLimit/data/tutorials/longexercise/\n
"},{"location":"part5/longexercise/#part-1-a-one-bin-counting-experiment","title":"Part 1: A one-bin counting experiment","text":"

Topics covered in this section:

  • A: Computing limits using the asymptotic approximation
  • Advanced section: B: Computing limits with toys

We will begin with a simplified version of a datacard from the MSSM \\(\\phi\\rightarrow\\tau\\tau\\) analysis that has been converted to a one-bin counting experiment, as described above. While the full analysis considers a range of signal mass hypotheses, we will start by considering just one: \\(m_{\\phi}\\)=800GeV. Click the text below to study the datacard (datacard_part1.txt in the longexercise directory):

Show datacard
imax    1 number of bins\njmax    4 number of processes minus 1\nkmax    * number of nuisance parameters\n--------------------------------------------------------------------------------\n--------------------------------------------------------------------------------\nbin          signal_region\nobservation  10.0\n--------------------------------------------------------------------------------\nbin                      signal_region   signal_region   signal_region   signal_region   signal_region\nprocess                  ttbar           diboson         Ztautau         jetFakes        bbHtautau\nprocess                  1               2               3               4               0\nrate                     4.43803         3.18309         3.7804          1.63396         0.711064\n--------------------------------------------------------------------------------\nCMS_eff_b          lnN   1.02            1.02            1.02            -               1.02\nCMS_eff_t          lnN   1.12            1.12            1.12            -               1.12\nCMS_eff_t_highpt   lnN   1.1             1.1             1.1             -               1.1\nacceptance_Ztautau lnN   -               -               1.08            -               -\nacceptance_bbH     lnN   -               -               -               -               1.05\nacceptance_ttbar   lnN   1.005           -               -               -               -\nnorm_jetFakes      lnN   -               -               -               1.2             -\nxsec_diboson       lnN   -               1.05            -               -               -\n

The layout of the datacard is as follows:

  • At the top are the numbers imax, jmax and kmax representing the number of bins, processes and nuisance parameters respectively. Here a \"bin\" can refer to a literal single event count as in this example, or a full distribution we are fitting, in general with many histogram bins, as we will see later. We will refer to both as \"channels\" from now on. It is possible to replace these numbers with * and they will be deduced automatically.
  • The first line starting with bin gives a unique label to each channel, and the following line starting with observation gives the number of events observed in data.
  • In the remaining part of the card there are several columns: each one represents one process in one channel. The first four lines labelled bin, process, process and rate give the channel label, the process label, a process identifier (<=0 for signal, >0 for background) and the number of expected events respectively.
  • The remaining lines describe sources of systematic uncertainty. Each line gives the name of the uncertainty, (which will become the name of the nuisance parameter inside our RooFit model), the type of uncertainty (\"lnN\" = log-normal normalisation uncertainty) and the effect on each process in each channel. E.g. a 20% uncertainty on the yield is written as 1.20.
  • It is also possible to add a hash symbol (#) at the start of a line, which Combine will then ignore when it reads the card.

We can now run Combine directly using this datacard as input. The general format for running Combine is:

combine -M [method] [datacard] [additional options...]\n
"},{"location":"part5/longexercise/#a-computing-limits-using-the-asymptotic-approximation","title":"A: Computing limits using the asymptotic approximation","text":"

As we are searching for a signal process that does not exist in the standard model, it's natural to set an upper limit on the cross section times branching fraction of the process (assuming our dataset does not contain a significant discovery of new physics). Combine has dedicated method for calculating upper limits. The most commonly used one is AsymptoticLimits, which implements the CLs criterion and uses the profile likelihood ratio as the test statistic. As the name implies, the test statistic distributions are determined analytically in the asymptotic approximation, so there is no need for more time-intensive toy throwing and fitting. Try running the following command:

combine -M AsymptoticLimits datacard_part1.txt -n .part1A\n

You should see the results of the observed and expected limit calculations printed to the screen. Here we have added an extra option, -n .part1A, which is short for --name, and is used to label the output file Combine produces, which in this case will be called higgsCombine.part1A.AsymptoticLimits.mH120.root. The file name depends on the options we ran with, and is of the form: higgsCombine[name].[method].mH[mass].root. The file contains a TTree called limit which stores the numerical values returned by the limit computation. Note that in our case we did not set a signal mass when running Combine (i.e. -m 800), so the output file just uses the default value of 120. This does not affect our result in any way though, just the label that is used on the output file.

The limits are given on a parameter called r. This is the default parameter of interest (POI) that is added to the model automatically. It is a linear scaling of the normalization of all signal processes given in the datacard, i.e. if \\(s_{i,j}\\) is the nominal number of signal events in channel \\(i\\) for signal process \\(j\\), then the normalization of that signal in the model is given as \\(r\\cdot s_{i,j}(\\vec{\\theta})\\), where \\(\\vec{\\theta}\\) represents the set of nuisance parameters which may also affect the signal normalization. We therefore have some choice in the interpretation of r: for the measurement of a process with a well-defined SM prediction we may enter this as the nominal yield in the datacard, such that \\(r=1\\) corresponds to this SM expectation, whereas for setting limits on BSM processes we may choose the nominal yield to correspond to some cross section, e.g. 1 pb, such that we can interpret the limit as a cross section limit directly. In this example the signal has been normalised to a cross section times branching fraction of 1 fb.

The expected limit is given under the background-only hypothesis. The median value under this hypothesis as well as the quantiles needed to give the 68% and 95% intervals are also calculated. These are all the ingredients needed to produce the standard limit plots you will see in many CMS results, for example the \\(\\sigma \\times \\mathcal{B}\\) limits for the \\(\\text{bb}\\phi\\rightarrow\\tau\\tau\\) process:

In this case we only computed the values for one signal mass hypothesis, indicated by a red dashed line.

Tasks and questions:

  • There are some important uncertainties missing from the datacard above. Add the uncertainty on the luminosity (name: lumi_13TeV) which has a 2.5% effect on all processes (except the jetFakes, which are taken from data), and uncertainties on the inclusive cross sections of the Ztautau and ttbar processes (with names xsec_Ztautau and xsec_ttbar) which are 4% and 6% respectively.
  • Try changing the values of some uncertainties (up or down, or removing them altogether) - how do the expected and observed limits change?
  • Now try changing the number of observed events. The observed limit will naturally change, but the expected does too - why might this be?

There are other command line options we can supply to Combine which will change its behaviour when run. You can see the full set of supported options by doing combine -h. Many options are specific to a given method, but others are more general and are applicable to all methods. Throughout this tutorial we will highlight some of the most useful options you may need to use, for example:

  • The range on the signal strength modifier: --rMin=X and --rMax=Y: In RooFit parameters can optionally have a range specified. The implication of this is that their values cannot be adjusted beyond the limits of this range. The min and max values can be adjusted though, and we might need to do this for our POI r if the order of magnitude of our measurement is different from the default range of [0, 20]. This will be discussed again later in the tutorial.
  • Verbosity: -v X: By default combine does not usually produce much output on the screen other the main result at the end. However, much more detailed information can be printed by setting the -v N with N larger than zero. For example at -v 3 the logs from the minimizer, Minuit, will also be printed. These are very useful for debugging problems with the fit.
"},{"location":"part5/longexercise/#advanced-section-b-computing-limits-with-toys","title":"Advanced section: B: Computing limits with toys","text":"

Now we will look at computing limits without the asymptotic approximation, so instead using toy datasets to determine the test statistic distributions under the signal+background and background-only hypotheses. This can be necessary if we are searching for signal in bins with a small number of events expected. In Combine we will use the HybridNew method to calculate limits using toys. This mode is capable of calculating limits with several different test statistics and with fine-grained control over how the toy datasets are generated internally. To calculate LHC-style profile likelihood limits (i.e. the same as we did with the asymptotic) we set the option --LHCmode LHC-limits. You can read more about the different options in the Combine documentation.

Run the following command:

combine -M HybridNew datacard_part1.txt --LHCmode LHC-limits -n .part1B --saveHybridResult\n

In contrast to AsymptoticLimits this will only determine the observed limit, and will take a few minutes. There will not be much output to the screen while combine is running. You can add the option -v 1 to get a better idea of what is going on. You should see Combine stepping around in r, trying to find the value for which CLs = 0.05, i.e. the 95% CL limit. The --saveHybridResult option will cause the test statistic distributions that are generated at each tested value of r to be saved in the output ROOT file.

To get an expected limit add the option --expectedFromGrid X, where X is the desired quantile, e.g. for the median:

combine -M HybridNew datacard_part1.txt --LHCmode LHC-limits -n .part1B --saveHybridResult --expectedFromGrid 0.500\n

Calculate the median expected limit and the 68% range. The 95% range could also be done, but note it will take much longer to run the 0.025 quantile. While Combine is running you can move on to the next steps below.

Tasks and questions: - In contrast to AsymptoticLimits, with HybridNew each limit comes with an uncertainty. What is the origin of this uncertainty? - How good is the agreement between the asymptotic and toy-based methods? - Why does it take longer to calculate the lower expected quantiles (e.g. 0.025, 0.16)? Think about how the statistical uncertainty on the CLs value depends on Pmu and Pb.

Next plot the test statistic distributions stored in the output file:

python3 $CMSSW_BASE/src/HiggsAnalysis/CombinedLimit/test/plotTestStatCLs.py --input higgsCombine.part1B.HybridNew.mH120.root --poi r --val all --mass 120\n

This produces a new ROOT file cls_qmu_distributions.root containing the plots, to save them as pdf/png files run this small script and look at the resulting figures:

python3 printTestStatPlots.py cls_qmu_distributions.root\n
"},{"location":"part5/longexercise/#advanced-section-b-asymptotic-approximation-limitations","title":"Advanced section: B: Asymptotic approximation limitations","text":"

These distributions can be useful in understanding features in the CLs limits, especially in the low statistics regime. To explore this, try reducing the observed and expected yields in the datacard by a factor of 10, and rerun the above steps to compare the observed and expected limits with the asymptotic approach, and plot the test statistic distributions.

Tasks and questions:

  • Is the asymptotic limit still a good approximation?
  • You might notice that the test statistic distributions are not smooth but rather have several \"bump\" structures? Where might this come from? Try reducing the size of the systematic uncertainties to make them more pronounced.

Note that for more complex models the fitting time can increase significantly, making it infeasible to run all the toy-based limits interactively like this. An alternative strategy is documented here

"},{"location":"part5/longexercise/#part-2-a-shape-based-analysis","title":"Part 2: A shape-based analysis","text":"

Topics covered in this section:

  • A: Setting up the datacard
  • B: Running Combine for a blind analysis
  • C: Using FitDiagnostics
  • D: MC statistical uncertainties
"},{"location":"part5/longexercise/#a-setting-up-the-datacard","title":"A: Setting up the datacard","text":"

Now we move to the next step: instead of a one-bin counting experiment we will fit a binned distribution. In a typical analysis we will produce TH1 histograms of some variable sensitive to the presence of signal: one for the data and one for each signal and background processes. Then we add a few extra lines to the datacard to link the declared processes to these shapes which are saved in a ROOT file, for example:

Show datacard
imax 1\njmax 1\nkmax *\n---------------\nshapes * * simple-shapes-TH1_input.root $PROCESS $PROCESS_$SYSTEMATIC\nshapes signal * simple-shapes-TH1_input.root $PROCESS$MASS $PROCESS$MASS_$SYSTEMATIC\n---------------\nbin bin1\nobservation 85\n------------------------------\nbin             bin1       bin1\nprocess         signal     background\nprocess         0          1\nrate            10         100\n--------------------------------\nlumi     lnN    1.10       1.0\nbgnorm   lnN    1.00       1.3\nalpha  shape    -          1\n

Note that as with the one-bin card, the total nominal rate of a given process must be specified in the rate line of the datacard. This should agree with the value returned by TH1::Integral. However, we can also put a value of -1 and the Integral value will be substituted automatically.

There are two other differences with respect to the one-bin card:

  • A new block of lines at the top defining how channels and processes are mapped to the histograms (more than one line can be used)
  • In the list of systematic uncertainties some are marked as shape instead of lnN

The syntax of the \"shapes\" line is: shapes [process] [channel] [file] [histogram] [histogram_with_systematics]. It is possible to use the * wildcard to map multiple processes and/or channels with one line. The histogram entries can contain the $PROCESS, $CHANNEL and $MASS place-holders which will be substituted when searching for a given (process, channel) combination. The value of $MASS is specified by the -m argument when combine. By default the observed data process name will be data_obs.

Shape uncertainties can be added by supplying two additional histograms for a process, corresponding to the distribution obtained by shifting that parameter up and down by one standard deviation. These shapes will be interpolated (see the template shape uncertainties section for details) for shifts within \\(\\pm1\\sigma\\) and linearly extrapolated beyond. The normalizations are interpolated linearly in log scale just like we do for log-normal uncertainties.

The final argument of the \"shapes\" line above should contain the $SYSTEMATIC place-holder which will be substituted by the systematic name given in the datacard.

In the list of uncertainties the interpretation of the values for shape lines is a bit different from lnN. The effect can be \"-\" or 0 for no effect, 1 for normal effect, and possibly something different from 1 to test larger or smaller effects (in that case, the unit Gaussian is scaled by that factor before using it as parameter for the interpolation).

In this section we will use a datacard corresponding to the full distribution that was shown at the start of section 1, not just the high mass region. Have a look at datacard_part2.txt: this is still currently a one-bin counting experiment, however the yields are much higher since we now consider the full range of \\(M_{\\mathrm{T}}^{\\mathrm{tot}}\\). If you run the asymptotic limit calculation on this you should find the sensitivity is significantly worse than before.

The first task is to convert this to a shape analysis: the file datacard_part2.shapes.root contains all the necessary histograms, including those for the relevant shape systematic uncertainties. Add the relevant shapes lines to the top of the datacard (after the kmax line) to map the processes to the correct TH1s in this file. Hint: you will need a different line for the signal process.

Compared to the counting experiment we must also consider the effect of uncertainties that change the shape of the distribution. Some, like CMS_eff_t_highpt, were present before, as it has both a shape and normalisation effect. Others are primarily shape effects so were not included before.

Add the following shape uncertainties: top_pt_ttbar_shape affecting ttbar,the tau energy scale uncertainties CMS_scale_t_1prong0pi0_13TeV, CMS_scale_t_1prong1pi0_13TeV and CMS_scale_t_3prong0pi0_13TeV affecting all processes except jetFakes, and CMS_eff_t_highpt also affecting the same processes.

Once this is done you can run the asymptotic limit calculation on this datacard. From now on we will convert the text datacard into a RooFit workspace ourselves instead of combine doing it internally every time we run. This is a good idea for more complex analyses since the conversion step can take a notable amount of time. For this we use the text2workspace.py command:

text2workspace.py datacard_part2.txt -m 800 -o workspace_part2.root\n

And then we can use this as input to combine instead of the text datacard:

combine -M AsymptoticLimits workspace_part2.root -m 800\n

Tasks and questions:

  • Verify that the sensitivity of the shape analysis is indeed improved over the counting analysis in the first part.
  • Advanced task: You can open the workspace ROOT file interactively and print the contents: w->Print();. Each process is represented by a PDF object that depends on the shape morphing nuisance parameters. From the workspace, choose a process and shape uncertainty, and make a plot overlaying the nominal shape with different values of the shape morphing nuisance parameter. You can change the value of a parameter with w->var(\"X\")->setVal(Y), and access a particular pdf with w->pdf(\"Z\"). PDF objects in RooFit have a createHistogram method that requires the name of the observable (the variable defining the x-axis) - this is called CMS_th1x in combine datacards. Feel free to ask for help with this!
"},{"location":"part5/longexercise/#b-running-combine-for-a-blind-analysis","title":"B: Running combine for a blind analysis","text":"

Most analyses are developed and optimised while we are \"blind\" to the region of data where we expect our signal to be. With AsymptoticLimits we can choose just to run the expected limit (--run expected), so as not to calculate the observed. However the data is still used, even for the expected, since in the frequentist approach a background-only fit to the data is performed to define the Asimov dataset used to calculate the expected limits. To skip this fit to data and use the pre-fit state of the model the option --run blind or --noFitAsimov can be used. Task: Compare the expected limits calculated with --run expected and --run blind. Why are they different?

A more general way of blinding is to use combine's toy and Asimov dataset generating functionality. You can read more about this here. These options can be used with any method in combine, not just AsymptoticLimits.

Task: Calculate a blind limit by generating a background-only Asimov with the -t -1 option instead of using the AsymptoticLimits specific options. You should find the observed limit is the same as the expected. Then see what happens if you inject a signal into the Asimov dataset using the --expectSignal [X] option.

"},{"location":"part5/longexercise/#c-using-fitdiagnostics","title":"C: Using FitDiagnostics","text":"

We will now explore one of the most commonly used modes of Combine: FitDiagnostics . As well as allowing us to make a measurement of some physical quantity (as opposed to just setting a limit on it), this method is useful to gain additional information about the model and the behaviour of the fit. It performs two fits:

  • A \"background-only\" (b-only) fit: first POI (usually \"r\") fixed to zero
  • A \"signal+background\" (s+b) fit: all POIs are floating

With the s+b fit Combine will report the best-fit value of our signal strength modifier r. As well as the usual output file, a file named fitDiagnosticsTest.root is produced which contains additional information. In particular it includes two RooFitResult objects, one for the b-only and one for the s+b fit, which store the fitted values of all the nuisance parameters (NPs) and POIs as well as estimates of their uncertainties. The covariance matrix from both fits is also included, from which we can learn about the correlations between parameters. Run the FitDiagnostics method on our workspace:

combine -M FitDiagnostics workspace_part2.root -m 800 --rMin -20 --rMax 20\n

Open the resulting fitDiagnosticsTest.root interactively and print the contents of the s+b RooFitResult:

root [1] fit_s->Print()\n
Show output
RooFitResult: minimized FCN value: -2.55338e-05, estimated distance to minimum: 7.54243e-06\n                covariance matrix quality: Full, accurate covariance matrix\n                Status : MINIMIZE=0 HESSE=0\n\n    Floating Parameter    FinalValue +/-  Error\n  --------------------  --------------------------\n             CMS_eff_b   -4.5380e-02 +/-  9.93e-01\n             CMS_eff_t   -2.6311e-01 +/-  7.33e-01\n      CMS_eff_t_highpt   -4.7146e-01 +/-  9.62e-01\n  CMS_scale_t_1prong0pi0_13TeV   -1.5989e-01 +/-  5.93e-01\n  CMS_scale_t_1prong1pi0_13TeV   -1.6426e-01 +/-  4.94e-01\n  CMS_scale_t_3prong0pi0_13TeV   -3.0698e-01 +/-  6.06e-01\n    acceptance_Ztautau   -3.1262e-01 +/-  8.62e-01\n        acceptance_bbH   -2.8676e-05 +/-  1.00e+00\n      acceptance_ttbar    4.9981e-03 +/-  1.00e+00\n            lumi_13TeV   -5.6366e-02 +/-  9.89e-01\n         norm_jetFakes   -9.3327e-02 +/-  2.56e-01\n                     r   -2.7220e+00 +/-  2.59e+00\n    top_pt_ttbar_shape    1.7586e-01 +/-  7.00e-01\n          xsec_Ztautau   -1.6007e-01 +/-  9.66e-01\n          xsec_diboson    3.9758e-02 +/-  1.00e+00\n            xsec_ttbar    5.7794e-02 +/-  9.46e-01\n

There are several useful pieces of information here. At the top the status codes from the fits that were performed is given. In this case we can see that two algorithms were run: MINIMIZE and HESSE, both of which returned a successful status code (0). Both of these are routines in the Minuit2 minimization package - the default minimizer used in RooFit. The first performs the main fit to the data, and the second calculates the covariance matrix at the best-fit point. It is important to always check this second step was successful and the message \"Full, accurate covariance matrix\" is printed, otherwise the parameter uncertainties can be very inaccurate, even if the fit itself was successful.

Underneath this the best-fit values (\\(\\theta\\)) and symmetrised uncertainties for all the floating parameters are given. For all the constrained nuisance parameters a convention is used by which the nominal value (\\(\\theta_I\\)) is zero, corresponding to the mean of a Gaussian constraint PDF with width 1.0, such that the parameter values \\(\\pm 1.0\\) correspond to the \\(\\pm 1\\sigma\\) input uncertainties.

A more useful way of looking at this is to compare the pre- and post-fit values of the parameters, to see how much the fit to data has shifted and constrained these parameters with respect to the input uncertainty. The script diffNuisances.py can be used for this:

python diffNuisances.py fitDiagnosticsTest.root --all\n
Show output
name                                              b-only fit            s+b fit         rho\nCMS_eff_b                                        -0.04, 0.99        -0.05, 0.99       +0.01\nCMS_eff_t                                     * -0.24, 0.73*     * -0.26, 0.73*       +0.06\nCMS_eff_t_highpt                              * -0.56, 0.94*     * -0.47, 0.96*       +0.02\nCMS_scale_t_1prong0pi0_13TeV                  * -0.17, 0.58*     * -0.16, 0.59*       -0.04\nCMS_scale_t_1prong1pi0_13TeV                  ! -0.12, 0.45!     ! -0.16, 0.49!       +0.20\nCMS_scale_t_3prong0pi0_13TeV                  * -0.31, 0.61*     * -0.31, 0.61*       +0.02\nacceptance_Ztautau                            * -0.31, 0.86*     * -0.31, 0.86*       -0.05\nacceptance_bbH                                   +0.00, 1.00        -0.00, 1.00       +0.05\nacceptance_ttbar                                 +0.01, 1.00        +0.00, 1.00       +0.00\nlumi_13TeV                                       -0.05, 0.99        -0.06, 0.99       +0.01\nnorm_jetFakes                                 ! -0.09, 0.26!     ! -0.09, 0.26!       -0.05\ntop_pt_ttbar_shape                            * +0.24, 0.69*     * +0.18, 0.70*       +0.22\nxsec_Ztautau                                     -0.16, 0.97        -0.16, 0.97       -0.02\nxsec_diboson                                     +0.03, 1.00        +0.04, 1.00       -0.02\nxsec_ttbar                                       +0.08, 0.95        +0.06, 0.95       +0.02\n

The numbers in each column are respectively \\(\\frac{\\theta-\\theta_I}{\\sigma_I}\\) (This is often called the pull, but note that this is a misnomer. In this tutorial we will refer to it as the fitted value of the nuisance parameter relative to the input uncertainty. The true pull is defined as discussed under diffPullAsym here ), where \\(\\sigma_I\\) is the input uncertainty; and the ratio of the post-fit to the pre-fit uncertainty \\(\\frac{\\sigma}{\\sigma_I}\\).

Tasks and questions:

  • Which parameter has the largest shift from the nominal value (0) in the fitted value of the nuisance parameter relative to the input uncertainty? Which has the tightest constraint?
  • Should we be concerned when a parameter is more strongly constrained than the input uncertainty (i.e. \\(\\frac{\\sigma}{\\sigma_I}<1.0\\))?
  • Check the fitted values of the nuisance parameters and constraints on a b-only and s+b asimov dataset instead. This check is required for all analyses in the Higgs PAG. It serves both as a closure test (do we fit exactly what signal strength we input?) and a way to check whether there are any infeasibly strong constraints while the analysis is still blind (typical example: something has probably gone wrong if we constrain the luminosity uncertainty to 10% of the input!)
  • Advanced task: Sometimes there are problems in the fit model that aren't apparent from only fitting the Asimov dataset, but will appear when fitting randomised data. Follow the exercise on toy-by-toy diagnostics here to explore the tools available for this.
"},{"location":"part5/longexercise/#d-mc-statistical-uncertainties","title":"D: MC statistical uncertainties","text":"

So far there is an important source of uncertainty we have neglected. Our estimates of the backgrounds come either from MC simulation or from sideband regions in data, and in both cases these estimates are subject to a statistical uncertainty on the number of simulated or data events. In principle we should include an independent statistical uncertainty for every bin of every process in our model. It's important to note that Combine/RooFit does not take this into account automatically - statistical fluctuations of the data are implicitly accounted for in the likelihood formalism, but statistical uncertainties in the model must be specified by us.

One way to implement these uncertainties is to create a shape uncertainty for each bin of each process, in which the up and down histograms have the contents of the bin shifted up and down by the \\(1\\sigma\\) uncertainty. However this makes the likelihood evaluation computationally inefficient, and can lead to a large number of nuisance parameters in more complex models. Instead we will use a feature in Combine called autoMCStats that creates these automatically from the datacard, and uses a technique called \"Barlow-Beeston-lite\" to reduce the number of systematic uncertainties that are created. This works on the assumption that for high MC event counts we can model the uncertainty with a Gaussian distribution. Given the uncertainties in different bins are independent, the total uncertainty of several processes in a particular bin is just the sum of \\(N\\) individual Gaussians, which is itself a Gaussian distribution. So instead of \\(N\\) nuisance parameters we need only one. This breaks down when the number of events is small and we are not in the Gaussian regime. The autoMCStats tool has a threshold setting on the number of events below which the the Barlow-Beeston-lite approach is not used, and instead a Poisson PDF is used to model per-process uncertainties in that bin.

After reading the full documentation on autoMCStats here, add the corresponding line to your datacard. Start by setting a threshold of 0, i.e. [channel] autoMCStats 0, to force the use of Barlow-Beeston-lite in all bins.

Tasks and questions:

  • Check how much the cross section measurement and uncertainties change using FitDiagnostics.
  • It is also useful to check how the expected uncertainty changes using an Asimov dataset, say with r=10 injected.
  • Advanced task: See what happens if the Poisson threshold is increased. Based on your results, what threshold would you recommend for this analysis?
"},{"location":"part5/longexercise/#part-3-adding-control-regions","title":"Part 3: Adding control regions","text":"

Topics covered in this section:

  • A: Use of rateParams
  • B: Nuisance parameter impacts
  • C: Post-fit distributions
  • D: Calculating the significance
  • E: Signal strength measurement and uncertainty breakdown
  • F: Use of channel masking

In a modern analysis it is typical for some or all of the backgrounds to be estimated using the data, instead of relying purely on MC simulation. This can take many forms, but a common approach is to use \"control regions\" (CRs) that are pure and/or have higher statistics for a given process. These are defined by event selections that are similar to, but non-overlapping with, the signal region. In our \\(\\phi\\rightarrow\\tau\\tau\\) example the \\(\\text{Z}\\rightarrow\\tau\\tau\\) background normalisation can be calibrated using a \\(\\text{Z}\\rightarrow\\mu\\mu\\) CR, and the \\(\\text{t}\\bar{\\text{t}}\\) background using an \\(e+\\mu\\) CR. By comparing the number of data events in these CRs to our MC expectation we can obtain scale factors to apply to the corresponding backgrounds in the signal region (SR). The idea is that the data will gives us a more accurate prediction of the background with less systematic uncertainties. For example, we can remove the cross section and acceptance uncertainties in the SR, since we are no longer using the MC prediction (with a caveat discussed below). While we could simply derive these correction factors and apply them to our signal region datacard and better way is to include these regions in our fit model and tie the normalisations of the backgrounds in the CR and SR together. This has a number of advantages:

  • Automatically handles the statistical uncertainty due to the number of data events in the CR
  • Allows for the presence of some signal contamination in the CR to be handled correctly
  • The CRs are typically not 100% pure in the background they're meant to control - other backgrounds may be present, with their own systematic uncertainties, some of which may be correlated with the SR or other CRs. Propagating these effects through to the SR \"by hand\" can become very challenging.

In this section we will continue to use the same SR as in the previous one, however we will switch to a lower signal mass hypothesis, \\(m_{\\phi}=200\\)GeV, as its sensitivity depends more strongly on the background prediction than the high mass signal, so is better for illustrating the use of CRs. Here the nominal signal (r=1) has been normalised to a cross section of 1 pb.

The SR datacard for the 200 GeV signal is datacard_part3.txt. Two further datacards are provided: datacard_part3_ttbar_cr.txt and datacard_part3_DY_cr.txt which represent the CRs for the Drell-Yan and \\(\\text{t}\\bar{\\text{t}}\\) processes as described above. The cross section and acceptance uncertainties for these processes have pre-emptively been removed from the SR card. However we cannot get away with neglecting acceptance effects altogether. We are still implicitly using the MC simulation to predict to the ratio of events in the CR and SR, and this ratio will in general carry a theoretical acceptance uncertainty. If the CRs are well chosen then this uncertainty should be smaller than the direct acceptance uncertainty in the SR however. The uncertainties acceptance_ttbar_cr and acceptance_DY_cr have been added to these datacards cover this effect. Task: Calculate the ratio of CR to SR events for these two processes, as well as their CR purity to verify that these are useful CRs.

The next step is to combine these datacards into one, which is done with the combineCards.py script:

combineCards.py signal_region=datacard_part3.txt ttbar_cr=datacard_part3_ttbar_cr.txt DY_cr=datacard_part3_DY_cr.txt &> part3_combined.txt\n

Each argument is of the form [new channel name]=[datacard.txt]. The new datacard is written to the screen by default, so we redirect the output into our new datacard file. The output looks like:

Show datacard
imax 3 number of bins\njmax 8 number of processes minus 1\nkmax 15 number of nuisance parameters\n----------------------------------------------------------------------------------------------------------------------------------\nshapes *              DY_cr          datacard_part3_DY_cr.shapes.root DY_control_region/$PROCESS DY_control_region/$PROCESS_$SYSTEMATIC\nshapes *              signal_region  datacard_part3.shapes.root signal_region/$PROCESS signal_region/$PROCESS_$SYSTEMATIC\nshapes bbHtautau      signal_region  datacard_part3.shapes.root signal_region/bbHtautau$MASS signal_region/bbHtautau$MASS_$SYSTEMATIC\nshapes *              ttbar_cr       datacard_part3_ttbar_cr.shapes.root tt_control_region/$PROCESS tt_control_region/$PROCESS_$SYSTEMATIC\n----------------------------------------------------------------------------------------------------------------------------------\nbin          signal_region  ttbar_cr       DY_cr        \nobservation  3416           79251          365754       \n----------------------------------------------------------------------------------------------------------------------------------\nbin                                               signal_region  signal_region  signal_region  signal_region  signal_region  ttbar_cr       ttbar_cr       ttbar_cr       ttbar_cr       ttbar_cr       DY_cr          DY_cr          DY_cr          DY_cr          DY_cr          DY_cr        \nprocess                                           bbHtautau      ttbar          diboson        Ztautau        jetFakes       W              QCD            ttbar          VV             Ztautau        W              QCD            Zmumu          ttbar          VV             Ztautau      \nprocess                                           0              1              2              3              4              5              6              1              7              3              5              6              8              1              7              3            \nrate                                              198.521        683.017        96.5185        742.649        2048.94        597.336        308.965        67280.4        10589.6        150.025        59.9999        141.725        305423         34341.1        5273.43        115.34       \n----------------------------------------------------------------------------------------------------------------------------------\nCMS_eff_b               lnN                       1.02           1.02           1.02           1.02           -              -              -              -              -              -              -              -              -              -              -              -            \nCMS_eff_e               lnN                       -              -              -              -              -              1.02           -              -              1.02           1.02           -              -              -              -              -              -            \n...\n

The [new channel name]= part of the input arguments is not required, but it gives us control over how the channels in the combined card will be named, otherwise default values like ch1, ch2 etc will be used.

"},{"location":"part5/longexercise/#a-use-of-rateparams","title":"A: Use of rateParams","text":"

We now have a combined datacard that we can run text2workspace.py on and start doing fits, however there is still one important ingredient missing. Right now the yields of the Ztautau process in the SR and Zmumu in the CR are not connected to each other in any way, and similarly for the ttbar processes. In the fit both would be adjusted by the nuisance parameters only, and constrained to the nominal yields. To remedy this we introduce rateParam directives to the datacard. A rateParam is a new free parameter that multiples the yield of a given process, just in the same way the signal strength r multiplies the signal yield. The syntax of a rateParam line in the datacard is

[name] rateParam [channel] [process] [init] [min,max]\n

where name is the chosen name for the parameter, channel and process specify which (channel, process) combination it should affect, init gives the initial value, and optionally [min,max] specifies the ranges on the RooRealVar that will be created. The channel and process arguments support the use of the wildcard * to match multiple entries. Task: Add two rateParams with nominal values of 1.0 to the end of the combined datacard named rate_ttbar and rate_Zll. The former should affect the ttbar process in all channels, and the latter should affect the Ztautau and Zmumu processes in all channels. Set ranges of [0,5] to both. Note that a rateParam name can be repeated to apply it to multiple processes, e.g.:

rateScale rateParam * procA 1.0\nrateScale rateParam * procB 1.0\n

is perfectly valid and only one rateParam will be created. These parameters will allow the yields to float in the fit without prior constraint (unlike a regular lnN or shape systematic), with the yields in the CRs and SR tied together.

Tasks and questions:

  • Run text2workspace.py on this combined card (don't forget to set the mass and output name -m 200 -o workspace_part3.root) and then use FitDiagnostics on an Asimov dataset with r=1 to get the expected uncertainty. Suggested command line options: --rMin 0 --rMax 2
  • Using the RooFitResult in the fitDiagnosticsTest.root file, check the post-fit value of the rateParams. To what level are the normalisations of the DY and ttbar processes constrained?
  • To compare to the previous approach of fitting the SR only, with cross section and acceptance uncertainties restored, an additional card is provided: datacard_part3_nocrs.txt. Run the same fit on this card to verify the improvement of the SR+CR approach
"},{"location":"part5/longexercise/#b-nuisance-parameter-impacts","title":"B: Nuisance parameter impacts","text":"

It is often useful to examine in detail the effects the systematic uncertainties have on the signal strength measurement. This is often referred to as calculating the \"impact\" of each uncertainty. What this means is to determine the shift in the signal strength, with respect to the best-fit, that is induced if a given nuisance parameter is shifted by its \\(\\pm1\\sigma\\) post-fit uncertainty values. If the signal strength shifts a lot, it tells us that it has a strong dependency on this systematic uncertainty. In fact, what we are measuring here is strongly related to the correlation coefficient between the signal strength and the nuisance parameter. The MultiDimFit method has an algorithm for calculating the impact for a given systematic: --algo impact -P [parameter name], but it is typical to use a higher-level script, combineTool.py (part of the CombineHarvester package you checked out at the beginning) to automatically run the impacts for all parameters. Full documentation on this is given here. There is a three step process for running this. First we perform an initial fit for the signal strength and its uncertainty:

combineTool.py -M Impacts -d workspace_part3.root -m 200 --rMin -1 --rMax 2 --robustFit 1 --doInitialFit\n

Then we run the impacts for all the nuisance parameters:

combineTool.py -M Impacts -d workspace_part3.root -m 200 --rMin -1 --rMax 2 --robustFit 1 --doFits\n

This will take a little bit of time. When finished we collect all the output and convert it to a json file:

combineTool.py -M Impacts -d workspace_part3.root -m 200 --rMin -1 --rMax 2 --robustFit 1 --output impacts.json\n

We can then make a plot showing the fitted values of the nuisance parameters, relative to the input uncertainty, and parameter impacts, sorted by the largest impact:

plotImpacts.py -i impacts.json -o impacts\n

Tasks and questions:

  • Identify the most important uncertainties using the impacts tool.
  • In the plot, some parameters do not show a fitted value of the nuisance parameter relative to the input uncertainty, but rather just a numerical value - why?
"},{"location":"part5/longexercise/#c-post-fit-distributions","title":"C: Post-fit distributions","text":"

Another thing the FitDiagnostics mode can help us with is visualising the distributions we are fitting, and the uncertainties on those distributions, both before the fit is performed (\"pre-fit\") and after (\"post-fit\"). The pre-fit can give us some idea of how well our uncertainties cover any data-MC discrepancy, and the post-fit if discrepancies remain after the fit to data (as well as possibly letting us see the presence of a significant signal!).

To produce these distributions add the --saveShapes and --saveWithUncertainties options when running FitDiagnostics:

combine -M FitDiagnostics workspace_part3.root -m 200 --rMin -1 --rMax 2 --saveShapes --saveWithUncertainties -n .part3B\n

Combine will produce pre- and post-fit distributions (for fit_s and fit_b) in the fitDiagnosticsTest.root output file:

Tasks and questions:

  • Make a plot showing the expected background and signal contributions using the output from FitDiagnostics - do this for both the pre-fit and post-fit. You will find a script postFitPlot.py in the longexercise directory that can help you get started. The bin errors on the TH1s in the fitDiagnostics file are determined from the systematic uncertainties. In the post-fit these take into account the additional constraints on the nuisance parameters as well as any correlations.

  • Why is the uncertainty on the post-fit so much smaller than on the pre-fit?

"},{"location":"part5/longexercise/#d-calculating-the-significance","title":"D: Calculating the significance","text":"

In the event that you observe a deviation from your null hypothesis, in this case the b-only hypothesis, Combine can be used to calculate the p-value or significance. To do this using the asymptotic approximation simply do:

combine -M Significance workspace_part3.root -m 200 --rMin -1 --rMax 2\n

To calculate the expected significance for a given signal strength we can just generate an Asimov dataset first:

combine -M Significance workspace_part3.root -m 200 --rMin -1 --rMax 5 -t -1 --expectSignal 1.5\n

Note that the Asimov dataset generated this way uses the nominal values of all model parameters to define the dataset. Another option is to add --toysFrequentist, which causes a fit to the data to be performed first (with r frozen to the --expectSignal value) and then any subsequent Asimov datasets or toys are generated using the post-fit values of the model parameters. In general this will result in a different value for the expected significance due to changes in the background normalisation and shape induced by the fit to data:

combine -M Significance workspace_part3.root -m 200 --rMin -1 --rMax 5 -t -1 --expectSignal 1.5 --toysFrequentist\n

Tasks and questions:

  • Note how much the expected significance changes with the --toysFrequentist option. Does the change make sense given the difference in the post-fit and pre-fit distributions you looked at in the previous section?
  • Advanced task It is also possible to calculate the significance using toys with HybridNew (details here) if we are in a situation where the asymptotic approximation is not reliable or if we just want to verify the result. Why might this be challenging for a high significance, say larger than \\(5\\sigma\\)?
"},{"location":"part5/longexercise/#e-signal-strength-measurement-and-uncertainty-breakdown","title":"E: Signal strength measurement and uncertainty breakdown","text":"

We have seen that with FitDiagnostics we can make a measurement of the best-fit signal strength and uncertainty. In the asymptotic approximation we find an interval at the \\(\\alpha\\) CL around the best fit by identifying the parameter values at which our test statistic \\(q=\u22122\\Delta \\ln L\\) equals a critical value. This value is the \\(\\alpha\\) quantile of the \\(\\chi^2\\) distribution with one degree of freedom. In the expression for q we calculate the difference in the profile likelihood between some fixed point and the best-fit.

Depending on what we want to do with the measurement, e.g. whether it will be published in a journal, we may want to choose a more precise method for finding these intervals. There are a number of ways that parameter uncertainties are estimated in combine, and some are more precise than others:

  • Covariance matrix: calculated by the Minuit HESSE routine, this gives a symmetric uncertainty by definition and is only accurate when the profile likelihood for this parameter is symmetric and parabolic.
  • Minos error: calculated by the Minuit MINOS route - performs a search for the upper and lower values of the parameter that give the critical value of \\(q\\) for the desired CL. Return an asymmetric interval. This is what FitDiagnostics does by default, but only for the parameter of interest. Usually accurate but prone to fail on more complex models and not easy to control the tolerance for terminating the search.
  • RobustFit error: a custom implementation in combine similar to Minos that returns an asymmetric interval, but with more control over the precision. Enabled by adding --robustFit 1 when running FitDiagnostics.
  • Explicit scan of the profile likelihood on a chosen grid of parameter values. Interpolation between points to find parameter values corresponding to appropriate d. It is a good idea to use this for important measurements since we can see by eye that there are no unexpected features in the shape of the likelihood curve.

In this section we will look at the last approach, using the MultiDimFit mode of combine. By default this mode just performs a single fit to the data:

combine -M MultiDimFit workspace_part3.root -n .part3E -m 200 --rMin -1 --rMax 2\n

You should see the best-fit value of the signal strength reported and nothing else. By adding the --algo X option combine will run an additional algorithm after this best fit. Here we will use --algo grid, which performs a scan of the likelihood with r fixed to a set of different values. The set of points will be equally spaced between the --rMin and --rMax values, and the number of points is controlled with --points N:

combine -M MultiDimFit workspace_part3.root -n .part3E -m 200 --rMin -1 --rMax 2 --algo grid --points 30\n

The results of the scan are written into the output file, if opened interactively should see:

Show output
root [1] limit->Scan(\"r:deltaNLL\")\n************************************\n*    Row   *         r *  deltaNLL *\n************************************\n*        0 * 0.5399457 *         0 *\n*        1 * -0.949999 * 5.6350698 *\n*        2 * -0.850000 * 4.9482779 *\n*        3 *     -0.75 * 4.2942519 *\n*        4 * -0.649999 * 3.6765284 *\n*        5 * -0.550000 * 3.0985388 *\n*        6 * -0.449999 * 2.5635135 *\n*        7 * -0.349999 * 2.0743820 *\n*        8 *     -0.25 * 1.6337506 *\n*        9 * -0.150000 * 1.2438088 *\n*       10 * -0.050000 * 0.9059833 *\n*       11 * 0.0500000 * 0.6215767 *\n*       12 * 0.1500000 * 0.3910581 *\n*       13 *      0.25 * 0.2144184 *\n*       14 * 0.3499999 * 0.0911308 *\n*       15 * 0.4499999 * 0.0201983 *\n*       16 * 0.5500000 * 0.0002447 *\n*       17 * 0.6499999 * 0.0294311 *\n*       18 *      0.75 * 0.1058298 *\n*       19 * 0.8500000 * 0.2272539 *\n*       20 * 0.9499999 * 0.3912534 *\n*       21 * 1.0499999 * 0.5952836 *\n*       22 * 1.1499999 * 0.8371513 *\n*       23 *      1.25 * 1.1142146 *\n*       24 * 1.3500000 * 1.4240909 *\n*       25 * 1.4500000 * 1.7644306 *\n*       26 * 1.5499999 * 2.1329684 *\n*       27 * 1.6499999 * 2.5273966 *\n*       28 *      1.75 * 2.9458723 *\n*       29 * 1.8500000 * 3.3863399 *\n*       30 * 1.9500000 * 3.8469560 *\n************************************\n

To turn this into a plot run:

python plot1DScan.py higgsCombine.part3E.MultiDimFit.mH200.root -o single_scan\n

This script will also perform a spline interpolation of the points to give accurate values for the uncertainties.

In the next step we will split this total uncertainty into two components. It is typical to separate the contribution from statistics and systematics, and sometimes even split the systematic part into different components. This gives us an idea of which aspects of the uncertainty dominate. The statistical component is usually defined as the uncertainty we would have if all the systematic uncertainties went to zero. We can emulate this effect by freezing all the nuisance parameters when we do the scan in r, such that they do not vary in the fit. This is achieved by adding the --freezeParameters allConstrainedNuisances option. It would also work if the parameters are specified explicitly, e.g. --freezeParameters CMS_eff_t,lumi_13TeV,..., but the allConstrainedNuisances option is more concise. Run the scan again with the systematics frozen, and use the plotting script to overlay this curve with the previous one:

combine -M MultiDimFit workspace_part3.root -n .part3E.freezeAll -m 200 --rMin -1 --rMax 2 --algo grid --points 30 --freezeParameters allConstrainedNuisances\npython plot1DScan.py higgsCombine.part3E.MultiDimFit.mH200.root --others 'higgsCombine.part3E.freezeAll.MultiDimFit.mH200.root:FreezeAll:2' -o freeze_first_attempt\n

This doesn't look quite right - the best-fit has been shifted because unfortunately the --freezeParameters option acts before the initial fit, whereas we only want to add it for the scan after this fit. To remedy this we can use a feature of Combine that lets us save a \"snapshot\" of the best-fit parameter values, and reuse this snapshot in subsequent fits. First we perform a single fit, adding the --saveWorkspace option:

combine -M MultiDimFit workspace_part3.root -n .part3E.snapshot -m 200 --rMin -1 --rMax 2 --saveWorkspace\n

The output file will now contain a copy of our workspace from the input, and this copy will contain a snapshot of the best-fit parameter values. We can now run the frozen scan again, but instead using this copy of the workspace as input, and restoring the snapshot that was saved:

combine -M MultiDimFit higgsCombine.part3E.snapshot.MultiDimFit.mH200.root -n .part3E.freezeAll -m 200 --rMin -1 --rMax 2 --algo grid --points 30 --freezeParameters allConstrainedNuisances --snapshotName MultiDimFit\npython plot1DScan.py higgsCombine.part3E.MultiDimFit.mH200.root --others 'higgsCombine.part3E.freezeAll.MultiDimFit.mH200.root:FreezeAll:2' -o freeze_second_attempt --breakdown Syst,Stat\n

Now the plot should look correct:

We added the --breakdown Syst,Stat option to the plotting script to make it calculate the systematic component, which is defined simply as \\(\\sigma_{\\text{syst}} = \\sqrt{\\sigma^2_{\\text{tot}} - \\sigma^2_{\\text{stat}}}\\). To split the systematic uncertainty into different components we just need to run another scan with a subset of the systematics frozen. For example, say we want to split this into experimental and theoretical uncertainties, we would calculate the uncertainties as:

\\(\\sigma_{\\text{theory}} = \\sqrt{\\sigma^2_{\\text{tot}} - \\sigma^2_{\\text{fr.theory}}}\\)

\\(\\sigma_{\\text{expt}} = \\sqrt{\\sigma^2_{\\text{fr.theory}} - \\sigma^2_{\\text{fr.theory+expt}}}\\)

\\(\\sigma_{\\text{stat}} = \\sigma_{\\text{fr.theory+expt}}\\)

where fr.=freeze.

While it is perfectly fine to just list the relevant nuisance parameters in the --freezeParameters argument for the \\(\\sigma_{\\text{fr.theory}}\\) scan, a convenient way can be to define a named group of parameters in the text datacard and then freeze all parameters in this group with --freezeNuisanceGroups. The syntax for defining a group is:

[group name] group = uncertainty_1 uncertainty_2 ... uncertainty_N\n

** Tasks and questions: **

  • Take our stat+syst split one step further and separate the systematic part into two: one part for hadronic tau uncertainties and one for all others.
  • Do this by defining a tauID group in the datacard including the following parameters: CMS_eff_t, CMS_eff_t_highpt, and the three CMS_scale_t_X uncertainties.
  • To plot this and calculate the split via the relations above you can just add further arguments to the --others option in the plot1DScan.py script. Each is of the form: '[file]:[label]:[color]'. The --breakdown argument should also be extended to three terms.
  • How important are these tau-related uncertainties compared to the others?
"},{"location":"part5/longexercise/#f-use-of-channel-masking","title":"F: Use of channel masking","text":"

We will now return briefly to the topic of blinding. We've seen that we can compute expected results by performing any Combine method on an Asimov dataset generated using -t -1. This is useful, because we can optimise our analysis without introducing any accidental bias that might come from looking at the data in the signal region. However our control regions have been chosen specifically to be signal-free, and it would be useful to use the data here to set the normalisation of our backgrounds even while the signal region remains blinded. Unfortunately there's no easy way to generate a partial Asimov dataset just for the signal region, but instead we can use a feature called \"channel masking\" to remove specific channels from the likelihood evaluation. One useful application of this feature is to make post-fit plots of the signal region from a control-region-only fit.

To use the masking we first need to rerun text2workspace.py with an extra option that will create variables named like mask_[channel] in the workspace:

text2workspace.py part3_combined.txt -m 200 -o workspace_part3_with_masks.root --channel-masks\n

These parameters have a default value of 0 which means the channel is not masked. By setting it to 1 the channel is masked from the likelihood evaluation. Task: Run the same FitDiagnostics command as before to save the post-fit shapes, but add an option --setParameters mask_signal_region=1. Note that the s+b fit will probably fail in this case, since we are no longer fitting a channel that contains signal, however the b-only fit should work fine. Task: Compare the expected background distribution and uncertainty to the pre-fit, and to the background distribution from the full fit you made before.

"},{"location":"part5/longexercise/#part-4-physics-models","title":"Part 4: Physics models","text":"

Topics covered in this section:

  • A: Writing a simple physics model
  • B: Performing and plotting 2D likelihood scans

With Combine we are not limited to parametrising the signal with a single scaling parameter r. In fact we can define any arbitrary scaling using whatever functions and parameters we would like. For example, when measuring the couplings of the Higgs boson to the different SM particles we would introduce a POI for each coupling parameter, for example \\(\\kappa_{\\text{W}}\\), \\(\\kappa_{\\text{Z}}\\), \\(\\kappa_{\\tau}\\) etc. We would then generate scaling terms for each \\(i\\rightarrow \\text{H}\\rightarrow j\\) process in terms of how the cross section (\\(\\sigma_i(\\kappa)\\)) and branching ratio (\\(\\frac{\\Gamma_i(\\kappa)}{\\Gamma_{\\text{tot}}(\\kappa)}\\)) scale relative to the SM prediction.

This parametrisation of the signal (and possibly backgrounds too) is specified in a physics model. This is a python class that is used by text2workspace.py to construct the model in terms of RooFit objects. There is documentation on using phyiscs models here.

"},{"location":"part5/longexercise/#a-writing-a-simple-physics-model","title":"A: Writing a simple physics model","text":"

An example physics model that just implements a single parameter r is given in DASModel.py:

Show DASModel.py
from HiggsAnalysis.CombinedLimit.PhysicsModel import PhysicsModel\n\n\nclass DASModel(PhysicsModel):\n    def doParametersOfInterest(self):\n        \"\"\"Create POI and other parameters, and define the POI set.\"\"\"\n        self.modelBuilder.doVar(\"r[0,0,10]\")\n        self.modelBuilder.doSet(\"POI\", \",\".join([\"r\"]))\n\n    def getYieldScale(self, bin, process):\n        \"Return the name of a RooAbsReal to scale this yield by or the two special values 1 and 0 (don't scale, and set to zero)\"\n        if self.DC.isSignal[process]:\n            print(\"Scaling %s/%s by r\" % (bin, process))\n            return \"r\"\n        return 1\n\n\ndasModel = DASModel()\n

In this we override two methods of the basic PhysicsModel class: doParametersOfInterest and getYieldScale. In the first we define our POI variables, using the doVar function which accepts the RooWorkspace factory syntax for creating variables, and then define all our POIs in a set via the doSet function. The second function will be called for every process in every channel (bin), and using the corresponding strings we have to specify how that process should be scaled. Here we check if the process was declared as signal in the datacard, and if so scale it by r, otherwise if it is a background no scaling is applied (1). To use the physics model with text2workspace.py first copy it to the python directory in the Combine package:

cp DASModel.py $CMSSW_BASE/src/HiggsAnalysis/CombinedLimit/python/\n

In this section we will use the full datacards from the MSSM analysis. Have a look in part4/200/combined.txt. You will notice that there are now two signal processes declared: ggH and bbH. In the MSSM these cross sections can vary independently depending on the exact parameters of the model, so it is useful to be able to measure them independently too. First run text2workspace.py as follows, adding the -P option to specify the physics model, then verify the result of the fit:

text2workspace.py part4/200/combined.txt -P HiggsAnalysis.CombinedLimit.DASModel:dasModel -m 200 -o workspace_part4.root\ncombine -M MultiDimFit workspace_part4.root -n .part4A -m 200 --rMin 0 --rMax 2\n

Tasks and questions:

  • Modify the physics model to scale the ggH and bbH processes by r_ggH and r_bbH separately.
  • Then rerun the MultiDimFit command - you should see the result for both signal strengths printed.
"},{"location":"part5/longexercise/#b-performing-and-plotting-2d-likelihood-scans","title":"B: Performing and plotting 2D likelihood scans","text":"

For a model with two POIs it is often useful to look at the how well we are able to measure both simultaneously. A natural extension of determining 1D confidence intervals on a single parameter like we did in part 3D is to determine confidence level regions in 2D. To do this we also use combine in a similar way, with -M MultiDimFit --algo grid. When two POIs are found, Combine will scan a 2D grid of points instead of a 1D array.

Tasks and questions:

  • Run a 2D likelihood scan in r_ggH and r_bbH. You can start with around 100 points but may need to increase this later too see more detail in the resulting plot.
  • Have a look at the output limit tree, it should have branches for each POI as well as the usual deltaNLL value. You can use TTree::Draw to plot a 2D histogram of deltaNLL with r_ggH and r_bbH on the axes.
"},{"location":"part5/longexerciseanswers/","title":"Answers to tasks and questions","text":""},{"location":"part5/longexerciseanswers/#part-1-a-one-bin-counting-experiment","title":"Part 1: A one-bin counting experiment","text":""},{"location":"part5/longexerciseanswers/#a-computing-limits-using-the-asymptotic-approximation","title":"A: Computing limits using the asymptotic approximation","text":"

Tasks and questions:

  • There are some important uncertainties missing from the datacard above. Add the uncertainty on the luminosity (name: lumi_13TeV) which has a 2.5% effect on all processes (except the jetFakes, which are taken from data), and uncertainties on the inclusive cross sections of the Ztautau and ttbar processes (with names xsec_Ztautau and xsec_diboson) which are 4% and 6% respectively.
  • Try changing the values of some uncertainties (up or down, or removing them altogether) - how do the expected and observed limits change?
Show answer Larger uncertainties make the limits worse (ie, higher values of the limit); smaller uncertainties improve the limit (lower values of the limit).
  • Now try changing the number of observed events. The observed limit will naturally change, but the expected does too - why might this be?
Show answer This is because the expected limit relies on a background-only Asimov dataset that is created after a background-only fit to the data. By changing the observed the pulls on the NPs in this fit also change, and therefore so does the expected sensitivity."},{"location":"part5/longexerciseanswers/#advanced-section-b-computing-limits-with-toys","title":"Advanced section: B: Computing limits with toys","text":"

Tasks and questions:

  • In contrast to AsymptoticLimits, with HybridNew each limit comes with an uncertainty. What is the origin of this uncertainty?
Show answer The uncertainty is caused by the limited number of toys: the values of Pmu and Pb come from counting the number of toys in the tails of the test statistic distributions. The number of toys used can be adjusted with the option --toysH
  • How good is the agreement between the asymptotic and toy-based methods?
Show answer The agreement should be pretty good in this example, but will generally break down once we get to the level of 0-5 events.
  • Why does it take longer to calculate the lower expected quantiles (e.g. 0.025, 0.16)? Think about how the statistical uncertainty on the CLs value depends on Pmu and Pb.
Show answer For this we need the definition of CLs = Pmu / (1-Pb). The 0.025 expected quantile is by definition where Pb = 0.025, so for a 95% CL limit we have CLs = 0.05, implying we are looking for the value of r where Pmu = 0.00125. With 1000 s+b toys we would then only expect `1000 * 0.00125 = 1.25` toys in the tail region we have to integrate over. Contrast this to the median limit where 25 toys would be in this region. This means we have to generate a much larger numbers of toys to get the same statistical power."},{"location":"part5/longexerciseanswers/#advanced-section-b-asymptotic-approximation-limitations","title":"Advanced section: B: Asymptotic approximation limitations","text":"

Tasks and questions:

  • Is the asymptotic limit still a good approximation?
Show answer A \"good\" approximation is not well defined, but the difference is clearly larger here.
  • You might notice that the test statistic distributions are not smooth but rather have several \"bump\" structures? Where might this come from? Try reducing the size of the systematic uncertainties to make them more pronounced.
Show answer This bump structure comes from the discrete-ness of the Poisson sampling of the toy datasets. Systematic uncertainties then smear these bumps out, but without systematics we would see delta functions corresponding to the possible integer number of events that could be observed. Once we go to more typical multi-bin analyses with more events and systematic uncertainties these discrete-ness washes out very quickly."},{"location":"part5/longexerciseanswers/#part-2-a-shape-based-analysis","title":"Part 2: A shape-based analysis","text":""},{"location":"part5/longexerciseanswers/#a-setting-up-the-datacard","title":"A: Setting up the datacard","text":"

Only tasks, no questions in this section

"},{"location":"part5/longexerciseanswers/#b-running-combine-for-a-blind-analysis","title":"B: Running combine for a blind analysis","text":"

Tasks and questions:

  • Compare the expected limits calculated with --run expected and --run blind. Why are they different?
Show answer When using --run blind combine will create a background-only Asimov dataset without performing a fit to data first. With --run expected, the observed limit isn't shown, but the background-only Asimov dataset used for the limit calculation is still created after a background-only fit to the data.
  • Calculate a blind limit by generating a background-only Asimov with the -t option instead of using the AsymptoticLimits specific options. You should find the observed limit is the same as the expected. Then see what happens if you inject a signal into the Asimov dataset using the --expectSignal [X] option.
Show answer You should see that with a signal injected the observed limit is worse (has a higher value) than the expected limit: for the expected limit the b-only Asimov dataset is still used, but the observed limit is now calculated on the signal + background Asimov dataset, with a signal at the specified cross section [X]."},{"location":"part5/longexerciseanswers/#c-using-fitdiagnostics","title":"C: Using FitDiagnostics","text":"

Tasks and questions:

  • Which parameter has the largest shift from the nominal value? Which has the tightest constraint?
Show answer CMS_eff_t_highpt should have the largest shift from the nominal value (around 0.47), norm_jetFakes has the tightest constraint (to 25% of the input uncertainty).
  • Should we be concerned when a parameter is more strongly constrained than the input uncertainty (i.e. \\(\\frac{\\sigma}{\\sigma_I}<1.0\\))?
Show answer This is still a hot topic in CMS analyses today, and there isn't a right or wrong answer. Essentially we have to judge if our analysis should really be able to provide more information about this parameter than the external measurement that gave us the input uncertainty. So we would not expect to be able to constrain the luminosity uncertainty for example, but uncertainties specific to the analysis might legitimately be constrained."},{"location":"part5/longexerciseanswers/#d-mc-statistical-uncertainties","title":"D: MC statistical uncertainties","text":"

Tasks and questions:

  • Check how much the cross section measurement and uncertainties change using FitDiagnostics.
Show answer Without autoMCStats we find: Best fit r: -2.73273 -2.13428/+3.38185, with autoMCStats: Best fit r: -3.07825 -3.17742/+3.7087
  • It is also useful to check how the expected uncertainty changes using an Asimov dataset, say with r=10 injected.
Show answer Without autoMCStats we find: Best fit r: 9.99978 -4.85341/+6.56233 , with autoMCStats: Best fit r: 9.99985 -5.24634/+6.98266
  • Advanced task: See what happens if the Poisson threshold is increased. Based on your results, what threshold would you recommend for this analysis?
Show answer At first the uncertainties increase, as the threshold increases, and at some point they stabilise. A Poisson threshold at 10 is probably reasonable for this analysis."},{"location":"part5/longexerciseanswers/#part-3-adding-control-regions","title":"Part 3: Adding control regions","text":""},{"location":"part5/longexerciseanswers/#a-use-of-rateparams","title":"A: Use of rateParams","text":"

Tasks and questions:

  • Run text2workspace.py on this combined card and then use FitDiagnostics on an Asimov dataset with r=1 to get the expected uncertainty. Suggested command line options: --rMin 0 --rMax 2
Show answer As expected uncertainty you should get -0.417238/+0.450593
  • Using the RooFitResult in the fitDiagnosticsTest.root file, check the post-fit value of the rateParams. To what level are the normalisations of the DY and ttbar processes constrained?
Show answer They are constrained to around 1-2%
  • To compare to the previous approach of fitting the SR only, with cross section and acceptance uncertainties restored, an additional card is provided: datacard_part3_nocrs.txt. Run the same fit on this card to verify the improvement of the SR+CR approach
Show answer The expected uncertainty is larger with only the SR: -0.465799/+0.502088 compared with -0.417238/+0.450593 in the SR+CR approach."},{"location":"part5/longexerciseanswers/#b-nuisance-parameter-impacts","title":"B: Nuisance parameter impacts","text":"

Tasks and questions:

  • Identify the most important uncertainties using the impacts tool.
Show answer The most important uncertainty is norm_jetFakes, followed by two MC statistical uncerainties (prop_binsignal_region_bin8 and prop_binsignal_region_bin9).
  • In the plot, some parameters do not show a plotted point for the fitted value, but rather just a numerical value - why?
Show answer These are freely floating parameters ( rate_ttbar and rate_Zll ). They have no prior constraint (and so no shift from the nominal value relative to the input uncertainty) - we show the best-fit value + uncertainty directly."},{"location":"part5/longexerciseanswers/#c-post-fit-distributions","title":"C: Post-fit distributions","text":"

Tasks and questions:

The bin errors on the TH1s in the fitdiagnostics file are determined from the systematic uncertainties. In the post-fit these take into account the additional constraints on the nuisance parameters as well as any correlations.

  • Why is the uncertainty on the post-fit so much smaller than on the pre-fit?
Show answer There are two effects at play here: the nuisance parameters get constrained, and there are anti-correlations between the parameters which also have the effect of reducing the total uncertainty. Note: the post-fit uncertainty could become larger when rateParams are present as they are not taken into account in the pre-fit uncertainty but do enter in the post-fit uncertainty."},{"location":"part5/longexerciseanswers/#d-calculating-the-significance","title":"D: Calculating the significance","text":"

Tasks and questions:

  • Advanced task It is also possible to calculate the significance using toys with HybridNew (details here) if we are in a situation where the asymptotic approximation is not reliable or if we just want to verify the result. Why might this be challenging for a high significance, say larger than \\(5\\sigma\\)?
Show answer A significance of $5\\sigma$ corresponds to a p-value of around $3\\cdot 10^{-7}$ - so we need to populate the very tail of the test statistic distribution and this requires generating a large number of toys."},{"location":"part5/longexerciseanswers/#e-signal-strength-measurement-and-uncertainty-breakdown","title":"E: Signal strength measurement and uncertainty breakdown","text":"

** Tasks and questions: **

  • Take our stat+syst split one step further and separate the systematic part into two: one part for hadronic tau uncertainties and one for all others. Do this by defining a tauID group in the datacard including the following parameters: CMS_eff_t, CMS_eff_t_highpt, and the three CMS_scale_t_X uncertainties.
Show datacard line You should add this line to the end of the datacard:
tauID group = CMS_eff_t CMS_eff_t_highpt CMS_scale_t_1prong0pi0_13TeV CMS_scale_t_1prong1pi0_13TeV CMS_scale_t_3prong0pi0_13TeV\n
  • To plot this and calculate the split via the relations above you can just add further arguments to the --others option in the plot1DScan.py script. Each is of the form: '[file]:[label]:[color]'. The --breakdown argument should also be extended to three terms.
Show code This can be done as:
python plot1DScan.py higgsCombine.part3E.MultiDimFit.mH200.root --others 'higgsCombine.part3E.freezeTauID.MultiDimFit.mH200.root:FreezeTauID:4' 'higgsCombine.part3E.freezeAll.MultiDimFit.mH200.root:FreezeAll:2' -o freeze_third_attempt --breakdown TauID,OtherSyst,Stat\n\n
  • How important are these tau-related uncertainties compared to the others?
Show answer They are smaller than both the statistical uncertainty and the remaining systematic uncertainties"},{"location":"part5/roofit/","title":"RooFit Basics","text":"

RooFit is a OO analysis environment built on ROOT. It has a collection of classes designed to augment root for data modeling.

This section covers a few of the basics of RooFit. There are many more tutorials available at this link: https://root.cern.ch/root/html600/tutorials/roofit/index.html

"},{"location":"part5/roofit/#objects","title":"Objects","text":"

In RooFit, any variable, data point, function, PDF (etc.) is represented by a c++ object The most basic of these is the RooRealVar. We will create one that will represent the mass of some hypothetical particle, we name it and give it an initial starting value and range.

RooRealVar MH(\"MH\",\"mass of the Hypothetical Boson (H-boson) in GeV\",125,120,130);\nMH.Print();\n
RooRealVar::MH = 125  L(120 - 130)\n

Ok, great. This variable is now an object we can play around with. We can access this object and modify its properties, such as its value.

MH.setVal(130);\nMH.getVal();\n

In particle detectors we typically do not observe this particle mass, but usually define some observable which is sensitive to this mass. We will assume we can detect and reconstruct the decay products of the H-boson and measure the invariant mass of those particles. We need to make another variable that represents that invariant mass.

RooRealVar mass(\"m\",\"m (GeV)\",100,80,200);\n

In the perfect world we would perfectly measure the exact mass of the particle in every single event. However, our detectors are usually far from perfect so there will be some resolution effect. We will assume the resolution of our measurement of the invariant mass is 10 GeV and call it \"sigma\"

RooRealVar sigma(\"resolution\",\"#sigma\",10,0,20);\n

More exotic variables can be constructed out of these RooRealVars using RooFormulaVars. For example, suppose we wanted to make a function out of the variables that represented the relative resolution as a function of the hypothetical mass MH.

RooFormulaVar func(\"R\",\"@0/@1\",RooArgList(sigma,mass));\nfunc.Print(\"v\");\n
Show
--- RooAbsArg ---\n  Value State: DIRTY\n  Shape State: DIRTY\n  Attributes: \n  Address: 0x10e878068\n  Clients: \n  Servers: \n    (0x10dcd47b0,V-) RooRealVar::resolution \"#sigma\"\n    (0x10dcd4278,V-) RooRealVar::m \"m (GeV)\"\n  Proxies: \n    actualVars -> \n      1)  resolution\n      2)           m\n--- RooAbsReal ---\n\n  Plot label is \"R\"\n    --- RooFormula ---\n    Formula: \"@0/@1\"\n    (resolution,m)\n

Notice how there is a list of the variables we passed (the servers or \"actual vars\"). We can now plot the function. RooFit has a special plotting object RooPlot which keeps track of the objects (and their normalisations) that we want to draw. Since RooFit does not know the difference between objects that are and are not dependent, we need to tell it.

Right now, we have the relative resolution as \\(R(m,\\sigma)\\), whereas we want to plot \\(R(m,\\sigma(m))\\)!

TCanvas *can = new TCanvas();\n\n//make the x-axis the \"mass\"\nRooPlot *plot = mass.frame(); \nfunc.plotOn(plot);\n\nplot->Draw();\ncan->Draw();\n

The main objects we are interested in using from RooFit are probability denisty functions or (PDFs). We can construct the PDF,

\\[ f(m|M_{H},\\sigma) \\]

as a simple Gaussian shape for example or a RooGaussian in RooFit language (think McDonald's logic, everything is a RooSomethingOrOther)

RooGaussian gauss(\"gauss\",\"f(m|M_{H},#sigma)\",mass,MH,sigma);\ngauss.Print(\"V\");\n
Show
--- RooAbsArg ---\n  Value State: DIRTY\n  Shape State: DIRTY\n  Attributes: \n  Address: 0x10ecf4188\n  Clients: \n  Servers: \n    (0x10dcd4278,V-) RooRealVar::m \"m (GeV)\"\n    (0x10a08a9d8,V-) RooRealVar::MH \"mass of the Hypothetical Boson (H-boson) in GeV\"\n    (0x10dcd47b0,V-) RooRealVar::resolution \"#sigma\"\n  Proxies: \n    x -> m\n    mean -> MH\n    sigma -> resolution\n--- RooAbsReal ---\n\n  Plot label is \"gauss\"\n--- RooAbsPdf ---\nCached value = 0\n

Notice how the gaussian PDF, like the RooFormulaVar depends on our RooRealVar objects, these are its servers. Its evaluation will depend on their values.

The main difference between PDFs and Functions in RooFit is that PDFs are automatically normalised to unitiy, hence they represent a probability density, you don't need to normalise yourself. Lets plot it for the different values of \\(m\\).

plot = mass.frame();\n\ngauss.plotOn(plot);\n\nMH.setVal(120);\ngauss.plotOn(plot,RooFit::LineColor(kBlue));\n\nMH.setVal(125);\ngauss.plotOn(plot,RooFit::LineColor(kRed));\n\nMH.setVal(135);\ngauss.plotOn(plot,RooFit::LineColor(kGreen));\n\nplot->Draw();\n\ncan->Update();\ncan->Draw();\n

Note that as we change the value of MH, the PDF gets updated at the same time.

PDFs can be used to generate Monte Carlo data. One of the benefits of RooFit is that to do so only uses a single line of code! As before, we have to tell RooFit which variables to generate in (e.g which are the observables for an experiment). In this case, each of our events will be a single value of \"mass\" \\(m\\). The arguments for the function are the set of observables, follwed by the number of events,

RooDataSet *gen_data = (RooDataSet*) gauss.generate(RooArgSet(mass),500); \n

Now we can plot the data as with other RooFit objects.

plot = mass.frame();\n\ngen_data->plotOn(plot);\ngauss.plotOn(plot);\ngauss.paramOn(plot);\n\nplot->Draw();\ncan->Update();\ncan->Draw();\n

Of course we are not in the business of generating MC events, but collecting real data!. Next we will look at using real data in RooFit.

"},{"location":"part5/roofit/#datasets","title":"Datasets","text":"

A dataset is essentially just a collection of points in N-dimensional (N-observables) space. There are two basic implementations in RooFit,

1) an \"unbinned\" dataset - RooDataSet

2) a \"binned\" dataset - RooDataHist

both of these use the same basic structure as below

We will create an empty dataset where the only observable is the mass. Points can be added to the dataset one by one ...

RooDataSet mydata(\"dummy\",\"My dummy dataset\",RooArgSet(mass)); \n// We've made a dataset with one observable (mass)\n\nmass.setVal(123.4);\nmydata.add(RooArgSet(mass));\nmass.setVal(145.2);\nmydata.add(RooArgSet(mass));\nmass.setVal(170.8);\nmydata.add(RooArgSet(mass));\n\nmydata.Print();\n
RooDataSet::dummy[m] = 3 entries\n

There are also other ways to manipulate datasets in this way as shown in the diagram below

Luckily there are also Constructors for a RooDataSet from a TTree and for a RooDataHist from a TH1 so its simple to convert from your usual ROOT objects.

We will take an example dataset put together already. The file tutorial.root can be downloaded here.

TFile *file = TFile::Open(\"tutorial.root\");\nfile->ls();\n
Show file contents
TFile**     tutorial.root\n TFile*     tutorial.root\n  KEY: RooWorkspace workspace;1 Tutorial Workspace\n  KEY: TProcessID   ProcessID0;1    48737500-e7e5-11e6-be6f-0d0011acbeef\n

Inside the file, there is something called a RooWorkspace. This is just the RooFit way of keeping a persistent link between the objects for a model. It is a very useful way to share data and PDFs/functions etc among CMS collaborators.

We will now take a look at it. It contains a RooDataSet and one variable. This time we called our variable (or observable) CMS_hgg_mass, we will assume that this is the invariant mass of photon pairs where we assume our H-boson decays to photons.

RooWorkspace *wspace = (RooWorkspace*) file->Get(\"workspace\");\nwspace->Print(\"v\");\n
Show
RooWorkspace(workspace) Tutorial Workspace contents\n\nvariables\n---------\n(CMS_hgg_mass)\n\ndatasets\n--------\nRooDataSet::dataset(CMS_hgg_mass)\n

Now we will have a look at the data. The RooWorkspace has several accessor functions, we will use the RooWorkspace::data one. There are also RooWorkspace::var, RooWorkspace::function and RooWorkspace::pdf with (hopefully) obvious purposes.

RooDataSet *hgg_data = (RooDataSet*) wspace->data(\"dataset\");\nRooRealVar *hgg_mass = (RooRealVar*) wspace->var(\"CMS_hgg_mass\");\n\nplot = hgg_mass->frame();\n\nhgg_data->plotOn(plot,RooFit::Binning(160)); \n// Here we've picked a certain number of bins just for plotting purposes \n\nTCanvas *hggcan = new TCanvas();\nplot->Draw();\nhggcan->Update();\nhggcan->Draw();\n

"},{"location":"part5/roofit/#likelihoods-and-fitting-to-data","title":"Likelihoods and Fitting to data","text":"

The data we have in our file does not look like a Gaussian distribution. Instead, we could probably use something like an exponential to describe it.

There is an exponential PDF already in RooFit (yes, you guessed it) RooExponential. For a PDF, we only need one parameter which is the exponential slope \\(\\alpha\\) so our pdf is,

\\[ f(m|\\alpha) = \\dfrac{1}{N} e^{-\\alpha m}\\]

Where of course, \\(N = \\int_{110}^{150} e^{-\\alpha m} dm\\) is the normalisation constant.

You can find several available RooFit functions here: https://root.cern.ch/root/html/ROOFIT_ROOFIT_Index.html

There is also support for a generic PDF in the form of a RooGenericPdf, check this link: https://root.cern.ch/doc/v608/classRooGenericPdf.html

Now we will create an exponential PDF for our background,

RooRealVar alpha(\"alpha\",\"#alpha\",-0.05,-0.2,0.01);\nRooExponential expo(\"exp\",\"exponential function\",*hgg_mass,alpha);\n

We can use RooFit to tell us to estimate the value of \\(\\alpha\\) using this dataset. You will learn more about parameter estimation, but for now we will just assume you know about maximizing likelihoods. This maximum likelihood estimator is common in HEP and is known to give unbiased estimates for things like distribution means etc.

This also introduces the other main use of PDFs in RooFit. They can be used to construct likelihoods easily.

The likelihood \\(\\mathcal{L}\\) is defined for a particluar dataset (and model) as being proportional to the probability to observe the data assuming some pdf. For our data, the probability to observe an event with a value in an interval bounded by a and b is given by,

\\[ P\\left(m~\\epsilon~[a,b] \\right) = \\int_{a}^{b} f(m|\\alpha)dm \\]

As that interval shrinks we can say this probability just becomes equal to \\(f(m|\\alpha)dm\\).

The probability to observe the dataset we have is given by the product of such probabilities for each of our data points, so that

\\[\\mathcal{L}(\\alpha) \\propto \\prod_{i} f(m_{i}|\\alpha)\\]

Note that for a specific dataset, the \\(dm\\) factors which should be there are constnant. They can therefore be absorbed into the constant of proportionality!

The maximum likelihood esitmator for \\(\\alpha\\), usually written as \\(\\hat{\\alpha}\\), is found by maximising \\(\\mathcal{L}(\\alpha)\\).

Note that this will not depend on the value of the constant of proportionality so we can ignore it. This is true in most scenarios because usually only the ratio of likelihoods is needed, in which the constant factors out.

Obviously this multiplication of exponentials can lead to very large (or very small) numbers which can lead to numerical instabilities. To avoid this, we can take logs of the likelihood. Its also common to multiply this by -1 and minimize the resulting Negative Log Likelihood : \\(\\mathrm{-Log}\\mathcal{L}(\\alpha)\\).

RooFit can construct the NLL for us.

RooNLLVar *nll = (RooNLLVar*) expo.createNLL(*hgg_data);\nnll->Print(\"v\");\n
Show
--- RooAbsArg ---\n  Value State: DIRTY\n  Shape State: DIRTY\n  Attributes:\n  Address: 0x7fdddbe46200\n  Clients:\n  Servers:\n    (0x11eab5638,V-) RooRealVar::alpha \"#alpha\"\n  Proxies:\n    paramSet ->\n      1)  alpha\n--- RooAbsReal ---\n\n  Plot label is \"nll_exp_dataset\"\n

Notice that the NLL object knows which RooRealVar is the parameter because it doesn't find that one in the dataset. This is how RooFit distiguishes between observables and parameters.

RooFit has an interface to Minuit via the RooMinimizer class which takes the NLL as an argument. To minimize, we just call the RooMinimizer::minimize() function. Minuit2 is the program and migrad is the minimization routine which uses gradient descent.

RooMinimizer minim(*nll);\nminim.minimize(\"Minuit2\",\"migrad\");  \n
Show
 **********\n **    1 **SET PRINT           1\n **********\n **********\n **    2 **SET NOGRAD\n **********\n PARAMETER DEFINITIONS:\n    NO.   NAME         VALUE      STEP SIZE      LIMITS\n     1 alpha       -5.00000e-02  2.10000e-02   -2.00000e-01  1.00000e-02\n **********\n **    3 **SET ERR         0.5\n **********\n **********\n **    4 **SET PRINT           1\n **********\n **********\n **    5 **SET STR           1\n **********\n NOW USING STRATEGY  1: TRY TO BALANCE SPEED AGAINST RELIABILITY\n **********\n **    6 **MIGRAD         500           1\n **********\n FIRST CALL TO USER FUNCTION AT NEW START POINT, WITH IFLAG=4.\n START MIGRAD MINIMIZATION.  STRATEGY  1.  CONVERGENCE WHEN EDM .LT. 1.00e-03\n FCN=3589.52 FROM MIGRAD    STATUS=INITIATE        4 CALLS           5 TOTAL\n                     EDM= unknown      STRATEGY= 1      NO ERROR MATRIX\n  EXT PARAMETER               CURRENT GUESS       STEP         FIRST\n  NO.   NAME      VALUE            ERROR          SIZE      DERIVATIVE\n   1  alpha       -5.00000e-02   2.10000e-02   2.24553e-01  -9.91191e+01\n                               ERR DEF= 0.5\n MIGRAD MINIMIZATION HAS CONVERGED.\n MIGRAD WILL VERIFY CONVERGENCE AND ERROR MATRIX.\n COVARIANCE MATRIX CALCULATED SUCCESSFULLY\n FCN=3584.68 FROM MIGRAD    STATUS=CONVERGED      18 CALLS          19 TOTAL\n                     EDM=1.4449e-08    STRATEGY= 1      ERROR MATRIX ACCURATE\n  EXT PARAMETER                                   STEP         FIRST\n  NO.   NAME      VALUE            ERROR          SIZE      DERIVATIVE\n   1  alpha       -4.08262e-02   2.91959e-03   1.33905e-03  -3.70254e-03\n                               ERR DEF= 0.5\n EXTERNAL ERROR MATRIX.    NDIM=  25    NPAR=  1    ERR DEF=0.5\n  8.527e-06\n

RooFit has found the best fit value of alpha for this dataset. It also estimates an uncertainty on alpha using the Hessian matrix from the fit.

alpha.Print(\"v\");\n
--- RooAbsArg ---\n  Value State: clean\n  Shape State: clean\n  Attributes:\n  Address: 0x11eab5638\n  Clients:\n    (0x11eab5978,V-) RooExponential::exp \"exponential function\"\n    (0x7fdddbe46200,V-) RooNLLVar::nll_exp_dataset \"-log(likelihood)\"\n    (0x7fdddbe95600,V-) RooExponential::exp \"exponential function\"\n    (0x7fdddbe5a400,V-) RooRealIntegral::exp_Int[CMS_hgg_mass] \"Integral of exponential function\"\n  Servers:\n  Proxies:\n--- RooAbsReal ---\n\n  Plot label is \"alpha\"\n--- RooAbsRealLValue ---\n  Fit range is [ -0.2 , 0.01 ]\n--- RooRealVar ---\n  Error = 0.00291959\n

We will plot the resulting exponential on top of the data. Notice that the value of \\(\\hat{\\alpha}\\) is used for the exponential.

expo.plotOn(plot);\nexpo.paramOn(plot);\nplot->Draw();\nhggcan->Update();\nhggcan->Draw();\n

It looks like there could be a small region near 125 GeV for which our fit does not quite go through the points. Maybe our hypothetical H-boson is not so hypothetical after all!

We will now see what happens if we include some resonant signal into the fit. We can take our Gaussian function again and use that as a signal model. A reasonable value for the resolution of a resonant signal with a mass around 125 GeV decaying to a pair of photons is around a GeV.

sigma.setVal(1.);\nsigma.setConstant();\n\nMH.setVal(125);\nMH.setConstant();\n\nRooGaussian hgg_signal(\"signal\",\"Gaussian PDF\",*hgg_mass,MH,sigma);\n

By setting these parameters constant, RooFit knows (either when creating the NLL by hand or when using fitTo) that there is not need to fit for these parameters.

We need to add this to our exponential model and fit a \"Sigmal+Background model\" by creating a RooAddPdf. In RooFit there are two ways to add PDFs, recursively where the fraction of yields for the signal and background is a parameter or absolutely where each PDF has its own normalization. We're going to use the second one.

RooRealVar norm_s(\"norm_s\",\"N_{s}\",10,100);\nRooRealVar norm_b(\"norm_b\",\"N_{b}\",0,1000);\n\nconst RooArgList components(hgg_signal,expo);\nconst RooArgList coeffs(norm_s,norm_b);\n\nRooAddPdf model(\"model\",\"f_{s+b}\",components,coeffs);\nmodel.Print(\"v\");\n
Show
--- RooAbsArg ---\n  Value State: DIRTY\n  Shape State: DIRTY\n  Attributes: \n  Address: 0x11ed5d7a8\n  Clients: \n  Servers: \n    (0x11ed5a0f0,V-) RooGaussian::signal \"Gaussian PDF\"\n    (0x11ed5d058,V-) RooRealVar::norm_s \"N_{s}\"\n    (0x11eab5978,V-) RooExponential::exp \"exponential function\"\n    (0x11ed5d398,V-) RooRealVar::norm_b \"N_{b}\"\n  Proxies: \n    !refCoefNorm -> \n    !pdfs -> \n      1)  signal\n      2)     exp\n    !coefficients -> \n      1)  norm_s\n      2)  norm_b\n--- RooAbsReal ---\n\n  Plot label is \"model\"\n--- RooAbsPdf ---\nCached value = 0\n

Ok, now we will fit the model. Note this time we add the option Extended(), which tells RooFit that we care about the overall number of observed events in the data \\(n\\) too. It will add an additional Poisson term in the likelihood to account for this so our likelihood this time looks like,

\\[L_{s+b}(N_{s},N_{b},\\alpha) = \\dfrac{ N_{s}+N_{b}^{n} e^{N_{s}+N_{b}} }{n!} \\cdot \\prod_{i}^{n} \\left[ c f_{s}(m_{i}|M_{H},\\sigma)+ (1-c)f_{b}(m_{i}|\\alpha) \\right] \\]

where \\(c = \\dfrac{ N_{s} }{ N_{s} + N_{b} }\\), \\(f_{s}(m|M_{H},\\sigma)\\) is the Gaussian signal pdf and \\(f_{b}(m|\\alpha)\\) is the exponential pdf. Remember that \\(M_{H}\\) and \\(\\sigma\\) are fixed so that they are no longer parameters of the likelihood.

There is a simpler interface for maximum-likelihood fits which is the RooAbsPdf::fitTo method. With this simple method, RooFit will construct the negative log-likelihood function, from the pdf, and minimize all of the free parameters in one step.

model.fitTo(*hgg_data,RooFit::Extended());\n\nmodel.plotOn(plot,RooFit::Components(\"exp\"),RooFit::LineColor(kGreen));\nmodel.plotOn(plot,RooFit::LineColor(kRed));\nmodel.paramOn(plot);\n\nhggcan->Clear();\nplot->Draw();\nhggcan->Update();\nhggcan->Draw();\n

What about if we also fit for the mass (\\(M_{H}\\))? we can easily do this by removing the constant setting on MH.

MH.setConstant(false);\nmodel.fitTo(*hgg_data,RooFit::Extended());\n
Show output
[#1] INFO:Minization -- RooMinimizer::optimizeConst: activating const optimization\n[#1] INFO:Minization --  The following expressions will be evaluated in cache-and-track mode: (signal,exp)\n **********\n **    1 **SET PRINT           1\n **********\n **********\n **    2 **SET NOGRAD\n **********\n PARAMETER DEFINITIONS:\n    NO.   NAME         VALUE      STEP SIZE      LIMITS\n     1 MH           1.25000e+02  1.00000e+00    1.20000e+02  1.30000e+02\n     2 alpha       -4.08793e-02  2.96856e-03   -2.00000e-01  1.00000e-02\n     3 norm_b       9.67647e+02  3.25747e+01    0.00000e+00  1.00000e+03\n MINUIT WARNING IN PARAMETR\n ============== VARIABLE3 BROUGHT BACK INSIDE LIMITS.\n     4 norm_s       3.22534e+01  1.16433e+01    1.00000e+01  1.00000e+02\n **********\n **    3 **SET ERR         0.5\n **********\n **********\n **    4 **SET PRINT           1\n **********\n **********\n **    5 **SET STR           1\n **********\n NOW USING STRATEGY  1: TRY TO BALANCE SPEED AGAINST RELIABILITY\n **********\n **    6 **MIGRAD        2000           1\n **********\n FIRST CALL TO USER FUNCTION AT NEW START POINT, WITH IFLAG=4.\n START MIGRAD MINIMIZATION.  STRATEGY  1.  CONVERGENCE WHEN EDM .LT. 1.00e-03\n FCN=-2327.53 FROM MIGRAD    STATUS=INITIATE       10 CALLS          11 TOTAL\n                     EDM= unknown      STRATEGY= 1      NO ERROR MATRIX       \n  EXT PARAMETER               CURRENT GUESS       STEP         FIRST   \n  NO.   NAME      VALUE            ERROR          SIZE      DERIVATIVE \n   1  MH           1.25000e+02   1.00000e+00   2.01358e-01   1.12769e+01\n   2  alpha       -4.08793e-02   2.96856e-03   3.30048e-02  -1.22651e-01\n   3  norm_b       9.67647e+02   3.25747e+01   2.56674e-01  -1.96463e-02\n   4  norm_s       3.22534e+01   1.16433e+01   3.10258e-01  -8.97036e-04\n                               ERR DEF= 0.5\n MIGRAD MINIMIZATION HAS CONVERGED.\n MIGRAD WILL VERIFY CONVERGENCE AND ERROR MATRIX.\n COVARIANCE MATRIX CALCULATED SUCCESSFULLY\n FCN=-2327.96 FROM MIGRAD    STATUS=CONVERGED      65 CALLS          66 TOTAL\n                     EDM=1.19174e-05    STRATEGY= 1      ERROR MATRIX ACCURATE \n  EXT PARAMETER                                   STEP         FIRST   \n  NO.   NAME      VALUE            ERROR          SIZE      DERIVATIVE \n   1  MH           1.24628e+02   3.98153e-01   2.66539e-03   2.46327e-02\n   2  alpha       -4.07708e-02   2.97195e-03   1.10093e-03   8.33780e-02\n   3  norm_b       9.66105e+02   3.25772e+01   5.96627e-03   1.83523e-03\n   4  norm_s       3.39026e+01   1.17380e+01   9.60816e-03  -2.32681e-03\n                               ERR DEF= 0.5\n EXTERNAL ERROR MATRIX.    NDIM=  25    NPAR=  4    ERR DEF=0.5\n  1.589e-01 -3.890e-05  1.462e-01 -1.477e-01 \n -3.890e-05  8.836e-06 -2.020e-04  2.038e-04 \n  1.462e-01 -2.020e-04  1.073e+03 -1.072e+02 \n -1.477e-01  2.038e-04 -1.072e+02  1.420e+02 \n PARAMETER  CORRELATION COEFFICIENTS  \n       NO.  GLOBAL      1      2      3      4\n        1  0.04518   1.000 -0.033  0.011 -0.031\n        2  0.03317  -0.033  1.000 -0.002  0.006\n        3  0.27465   0.011 -0.002  1.000 -0.275\n        4  0.27610  -0.031  0.006 -0.275  1.000\n **********\n **    7 **SET ERR         0.5\n **********\n **********\n **    8 **SET PRINT           1\n **********\n **********\n **    9 **HESSE        2000\n **********\n COVARIANCE MATRIX CALCULATED SUCCESSFULLY\n FCN=-2327.96 FROM HESSE     STATUS=OK             23 CALLS          89 TOTAL\n                     EDM=1.19078e-05    STRATEGY= 1      ERROR MATRIX ACCURATE \n  EXT PARAMETER                                INTERNAL      INTERNAL  \n  NO.   NAME      VALUE            ERROR       STEP SIZE       VALUE   \n   1  MH           1.24628e+02   3.98106e-01   5.33077e-04  -7.45154e-02\n   2  alpha       -4.07708e-02   2.97195e-03   2.20186e-04   5.42722e-01\n   3  norm_b       9.66105e+02   3.26003e+01   2.38651e-04   1.20047e+00\n   4  norm_s       3.39026e+01   1.17445e+01   3.84326e-04  -4.87967e-01\n                               ERR DEF= 0.5\n EXTERNAL ERROR MATRIX.    NDIM=  25    NPAR=  4    ERR DEF=0.5\n  1.588e-01 -3.888e-05  1.304e-01 -1.304e-01 \n -3.888e-05  8.836e-06 -1.954e-04  1.954e-04 \n  1.304e-01 -1.954e-04  1.074e+03 -1.082e+02 \n -1.304e-01  1.954e-04 -1.082e+02  1.421e+02 \n PARAMETER  CORRELATION COEFFICIENTS  \n       NO.  GLOBAL      1      2      3      4\n        1  0.04274   1.000 -0.033  0.010 -0.027\n        2  0.03314  -0.033  1.000 -0.002  0.006\n        3  0.27694   0.010 -0.002  1.000 -0.277\n        4  0.27806  -0.027  0.006 -0.277  1.000\n[#1] INFO:Minization -- RooMinimizer::optimizeConst: deactivating const optimization\n

Notice the result for the fitted MH is not 125 and is included in the list of fitted parameters. We can get more information about the fit, via the RooFitResult, using the option Save().

RooFitResult *fit_res = (RooFitResult*) model.fitTo(*hgg_data,RooFit::Extended(),RooFit::Save());\n

For example, we can get the Correlation Matrix from the fit result... Note that the order of the parameters are the same as listed in the \"Floating Parameter\" list above

TMatrixDSym cormat = fit_res->correlationMatrix();\ncormat.Print();\n
4x4 matrix is as follows\n\n     |      0    |      1    |      2    |      3    |\n---------------------------------------------------------\n   0 |          1    -0.03282    0.009538    -0.02623 \n   1 |   -0.03282           1   -0.001978    0.005439 \n   2 |   0.009538   -0.001978           1     -0.2769 \n   3 |   -0.02623    0.005439     -0.2769           1 \n

A nice feature of RooFit is that once we have a PDF, data and results like this, we can import this new model into our RooWorkspace and show off our new discovery to our LHC friends (if we weren't already too late!). We can also save the \"state\" of our parameters for later, by creating a snapshot of the current values.

wspace->import(model);  \nRooArgSet *params = model.getParameters(*hgg_data);\nwspace->saveSnapshot(\"nominal_values\",*params);\n\nwspace->Print(\"V\");\n
Show output
RooWorkspace(workspace) Tutorial Workspace contents\n\nvariables\n---------\n(CMS_hgg_mass,MH,alpha,norm_b,norm_s,resolution)\n\np.d.f.s\n-------\nRooExponential::exp[ x=CMS_hgg_mass c=alpha ] = 0.00248636\nRooAddPdf::model[ norm_s * signal + norm_b * exp ] = 0.00240205\nRooGaussian::signal[ x=CMS_hgg_mass mean=MH sigma=resolution ] = 5.34013e-110\n\ndatasets\n--------\nRooDataSet::dataset(CMS_hgg_mass)\n\nparameter snapshots\n-------------------\nnominal_values = (MH=124.627 +/- 0.398094,resolution=1[C],norm_s=33.9097 +/- 11.7445,alpha=-0.040779 +/- 0.00297195,norm_b=966.109 +/- 32.6025)\n

This is exactly what needs to be done when you want to use shape based datacards in Combine with parametric models.

"},{"location":"part5/roofit/#a-likelihood-for-a-counting-experiment","title":"A likelihood for a counting experiment","text":"

An introductory presentation about likelihoods and interval estimation is available here.

**Note: We will use python syntax in this section; you should use a .py script. Make sure to do import ROOT at the top of your script **

We have seen how to create variables and PDFs, and how to fit a PDF to data. But what if we have a counting experiment, or a histogram template shape? And what about systematic uncertainties? We are going to build a likelihood for this:

\\(\\mathcal{L} \\propto p(\\text{data}|\\text{parameters})\\)

where our parameters are parameters of interest, \\(\\mu\\), and nuisance parameters, \\(\\theta\\). The nuisance parameters are constrained by external measurements, so we add constraint terms \\(\\pi(\\vec{\\theta}_0|\\vec{\\theta})\\)

So we have \\(\\mathcal{L} \\propto p(\\text{data}|\\mu,\\vec{\\theta})\\cdot \\pi(\\vec{\\theta}_0|\\vec{\\theta})\\)

now we will try to build the likelihood by hand for a 1-bin counting experiment. The data is the number of observed events \\(N\\), and the probability is just a Poisson probability \\(p(N|\\lambda) = \\frac{\\lambda^N e^{-\\lambda}}{N!}\\), where \\(\\lambda\\) is the number of events expected in our signal+background model: \\(\\lambda = \\mu\\cdot s(\\vec{\\theta}) + b(\\vec{\\theta})\\).

In the expression, s and b are the numbers of expected signal and background events, which both depend on the nuisance parameters. We will start by building a simple likelihood function with one signal process and one background process. We will assume there are no nuisance parameters for now. The number of observed events in data is 15, the expected number of signal events is 5 and the expected number of background events 8.1.

It is easiest to use the RooFit workspace factory to build our model (this tutorial has more information on the factory syntax).

import ROOT\nw = ROOT.RooWorkspace(\"w\")\n

We need to create an expression for the number of events in our model, \\(\\mu s +b\\):

w.factory('expr::n(\"mu*s +b\", mu[1.0,0,4], s[5],b[8.1])')\n

Now we can build the likelihood, which is just our Poisson PDF:

w.factory('Poisson::poisN(N[15],n)')\n

To find the best fit value for our parameter of interest \\(\\mu\\) we need to maximize the likelihood. In practice it is actually easier to minimize the Negative log of the likelihood, or NLL:

w.factory('expr::NLL(\"-log(@0)\",poisN)')\n

We can now use the RooMinimizer to find the minimum of the NLL

nll = w.function(\"NLL\")\nminim = ROOT.RooMinimizer(nll)\nminim.setErrorLevel(0.5)\nminim.minimize(\"Minuit2\",\"migrad\")\nbestfitnll = nll.getVal()\n

Notice that we need to set the error level to 0.5 to get the uncertainties (relying on Wilks' theorem!) - note that there is a more reliable way of extracting the confidence interval (explicitly rather than relying on migrad). We will discuss this a bit later in this section.

Now we will add a nuisance parameter, lumi, which represents the luminosity uncertainty. It has a 2.5% effect on both the signal and the background. The parameter will be log-normally distributed: when it's 0, the normalization of the signal and background are not modified; at \\(+1\\sigma\\) the signal and background normalizations will be multiplied by 1.025 and at \\(-1\\sigma\\) they will be divided by 1.025. We should modify the expression for the number of events in our model:

w.factory('expr::n(\"mu*s*pow(1.025,lumi) +b*pow(1.025,lumi)\", mu[1.0,0,4], s[5],b[8.1],lumi[0,-4,4])')\n

And we add a unit gaussian constraint

w.factory('Gaussian::lumiconstr(lumi,0,1)')\n

Our full likelihood will now be

w.factory('PROD::likelihood(poisN,lumiconstr)')\n

and the NLL

w.factory('expr::NLL(\"-log(@0)\",likelihood)')\n

Which we can minimize in the same way as before.

Now we will extend our model a bit.

  • Expanding on what was demonstrated above, build the likelihood for \\(N=15\\), a signal process s with expectation 5 events, a background ztt with expectation 3.7 events and a background tt with expectation 4.4 events. The luminosity uncertainty applies to all three processes. The signal process is further subject to a 5% log-normally distributed uncertainty sigth, tt is subject to a 6% log-normally distributed uncertainty ttxs, and ztt is subject to a 4% log-normally distributed uncertainty zttxs. Find the best-fit value and the associated uncertainty
  • Also perform an explicit scan of the \\(\\Delta\\) NLL ( = log of profile likelihood ratio) and make a graph of the scan. Some example code can be found below to get you started. Hint: you'll need to perform fits for different values of mu, where mu is fixed. In RooFit you can set a variable to be constant as var(\"VARNAME\").setConstant(True)
  • From the curve that you have created by performing an explicit scan, we can extract the 68% CL interval. You can do so by eye or by writing some code to find the relevant intersections of the curve.
gr = ROOT.TGraph()\n\nnpoints = 0\nfor i in range(0,60):\n  npoints+=1\n  mu=0.05*i\n  ...\n  [perform fits for different values of mu with mu fixed]\n  ...\n  deltanll = ...\n  gr.SetPoint(npoints,mu,deltanll)\n\n\ncanv = ROOT.TCanvas()\ngr.Draw(\"ALP\")\ncanv.SaveAs(\"likelihoodscan.pdf\")\n

Well, this is doable - but we were only looking at a simple one-bin counting experiment. This might become rather cumbersome for large models... \\([*]\\)

For the next set ot tutorials, we will now switch to working with Combine that will help in building the statistical model and do the statistical analysis, instead of building the likelihood with RooFit.

Info

RooFit does have additional functionality to help with statistical model building, but we will not go into detail in these tutorials.

"},{"location":"tutorial2023/parametric_exercise/","title":"Parametric Models in Combine","text":""},{"location":"tutorial2023/parametric_exercise/#getting-started","title":"Getting started","text":"

By now you should have a working setup of Combine v9 from the pre-tutorial exercise. If so then move onto the cloning of the parametric fitting exercise gitlab repo below. If not then you need to set up a CMSSW area and checkout the combine package:

cmssw-el7\ncmsrel CMSSW_11_3_4\ncd CMSSW_11_3_4/src\ncmsenv\ngit clone https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit.git HiggsAnalysis/CombinedLimit\ncd HiggsAnalysis/CombinedLimit\n\ncd $CMSSW_BASE/src/HiggsAnalysis/CombinedLimit\ngit fetch origin\ngit checkout v9.1.0\n

We will also make use of another package, CombineHarvester, which contains some high-level tools for working with combine. The following command will download the repository and checkout just the parts of it we need for this exercise:

cd $CMSSW_BASE/src/\nbash <(curl -s https://raw.githubusercontent.com/cms-analysis/CombineHarvester/main/CombineTools/scripts/sparse-checkout-https.sh)\n

Now let's compile the CMSSW area:

scramv1 b clean; scramv1 b\ncmsenv\n

Finally, let's move to the working directory for this tutorial which contains all of the inputs and scripts needed to run the parametric fitting exercise:

cd $CMSSW_BASE/src/HiggsAnalysis/CombinedLimit/data/tutorials/parametric_exercise\n
"},{"location":"tutorial2023/parametric_exercise/#session-structure","title":"Session structure","text":"

The exercise is split into six parts which cover:

1) Parametric model building

2) Simple fits

3) Systematic uncertainties

4) Toy generation

5) Discrete profiling

6) Multi-signal hypothesis

Throughout the tutorial there are a number of questions and exercises for you to complete. These are shown by the bullet points in this markdown file.

All the code required to run the different parts is available in python scripts. We have purposely commented out the code to encourage you to open the scripts and take a look what is inside. Each block is separated by a divider and a blank line. When you are happy and understand the code, you can uncomment (block by block) and then run the scripts (using python3) e.g.:

python3 construct_models_part1.py\n

A number of scripts will produce plots (as .png files). The default path to store these plots is in the current working directory. You can change this (e.g. pipe to an eos webpage) by changing the plot_dir variable in the config.py script.

There's also a set of combine (.txt) datacards which will help you get through the various parts of the exercise. The exercises should help you become familiar with the structure of parametric fitting datacards.

Finally, this exercise is heavily based off the RooFit package. So if you find yourself using the python interpreter for any checks, don't forget to...

import ROOT\n
"},{"location":"tutorial2023/parametric_exercise/#jupyter-notebooks","title":"Jupyter notebooks","text":"

Alternatively, we have provided Jupyter notebooks to run the different parts of the exercise e.g. part1.ipynb. You will have already downloaded these notebooks when cloning the tutorial gitlab repo. To open Jupyter notebooks on lxplus within a CMSSW environment, you can add the following option when you ssh into lxplus:

ssh -X -Y username@lxplus.cern.ch -L8xxx:localhost:8xxx\n

where you should replace xxx with some three digit number. Then cd into the combinetutorial-2023-parametric directory and set up the CMSSW environment with:

cmsenv\n

You can then open the Jupyter notebook inside the environment with:

jupyter notebook --no-browser --port 8xxx\n

replacing xxx with the same three digit number. You should now be able to copy the url it provides into a browser and access the various exercise notebooks.

"},{"location":"tutorial2023/parametric_exercise/#analysis-overview","title":"Analysis overview","text":"

In this exercise we will look at one of the most famous parametric fitting analyses at the LHC: the Higgs boson decaying to two photons (H \\(\\rightarrow \\gamma\\gamma\\)). This decay channel is key in understanding the properties of the Higgs boson due to its clean final state topology. The excellent energy resolution- of the CMS electromagnetic calorimeter leads to narrow signal peak in the diphoton invariant mass spectrum, \\(m_{\\gamma\\gamma}\\), above a smoothly falling background continuum. The mass spectrum for the legacy Run 2 analysis is shown below.

In the analysis, we construct parametric models (analytic functions) of both signal and background events to fit the \\(m_{\\gamma\\gamma}\\) spectrum in data. From the fit we can extract measurements of Higgs boson properties including its rate of production, its mass (\\(m_H\\)), its coupling behaviour, to name a few. This exercise will show how to construct parametric models using RooFit, and subsequently how to use combine to extract the results.

"},{"location":"tutorial2023/parametric_exercise/#part-1-parametric-model-building","title":"Part 1: Parametric model building","text":"

As with any fitting exercise, the first step is to understand the format of the input data, explore its contents and construct a model. The python script which performs the model construction is construct_models_part1.py. This section will explain what the various lines of code are doing.

"},{"location":"tutorial2023/parametric_exercise/#signal-modelling","title":"Signal modelling","text":"

Firstly, we will construct a model to fit the signal (H \\(\\rightarrow\\gamma\\gamma\\)) mass peak using a Monte Carlo simulation sample of gluon-gluon fusion production (ggH) events with \\(m_H=125\\) GeV. This production mode has the largest cross section in the SM, and the LO Feynman diagram is shown below.

There has already been a dedicated selection performed on the events to increase the signal-to-background ratio (e.g. using some ML event classifier). Events passing this selection enter the analysis category, Tag0. Events entering Tag0 are used for the parametric fitting of the \\(m_{\\gamma\\gamma}\\) spectrum.

The events are stored in a ROOT TTree, where the diphoton mass, CMS_hgg_mass, and the event weight, weight, are saved. Let's begin by loading the MC, and converting the TTree data into RooDataSet:

import ROOT\nROOT.gROOT.SetBatch(True)\n\nf = ROOT.TFile(\"mc_part1.root\",\"r\")\n# Load TTree\nt = f.Get(\"ggH_Tag0\")\n\n# Define mass and weight variables\nmass = ROOT.RooRealVar(\"CMS_hgg_mass\", \"CMS_hgg_mass\", 125, 100, 180)\nweight = ROOT.RooRealVar(\"weight\",\"weight\",0,0,1)\n\n# Convert to RooDataSet\nmc = ROOT.RooDataSet(\"ggH_Tag0\",\"ggH_Tag0\", t, ROOT.RooArgSet(mass,weight), \"\", \"weight\" )\n\n# Lets plot the signal mass distribution\ncan = ROOT.TCanvas()\nplot = mass.frame()\nmc.plotOn(plot)\nplot.Draw()\ncan.Update()\ncan.SaveAs(\"part1_signal_mass.png\")\n

The plot shows a peak centred on the Higgs mass at 125 GeV. Let's use a simple Gaussian to model the peak.

# Introduce a RooRealVar into the workspace for the Higgs mass\nMH = ROOT.RooRealVar(\"MH\", \"MH\", 125, 120, 130 )\nMH.setConstant(True)\n\n# Signal peak width\nsigma = ROOT.RooRealVar(\"sigma_ggH_Tag0\", \"sigma_ggH_Tag0\", 2, 1, 5)\n\n# Define the Gaussian with mean=MH and width=sigma\nmodel = ROOT.RooGaussian( \"model_ggH_Tag0\", \"model_ggH_Tag0\", mass, MH, sigma ) \n\n# Fit Gaussian to MC events and plot\nmodel.fitTo(mc,ROOT.RooFit.SumW2Error(True))\n\ncan = ROOT.TCanvas()\nplot = mass.frame()\nmc.plotOn(plot)\nmodel.plotOn( plot, ROOT.RooFit.LineColor(2) )\nplot.Draw()\ncan.Update()\ncan.Draw()\ncan.SaveAs(\"part1_signal_model_v0.png\")\n

It looks like a good fit!

  • Do you understand the output from the fitTo command (i.e the mimimization)? From now on we will add the option ROOT.RooFit.PrintLevel(-1) when fitting the models to surpress the minimizer output.

But what if the mean of the model does not correspond directly to the Higgs boson mass i.e. there are some reconstruction effects. Let's instead define the mean of the model as:

\\[\\mu = m_H + \\delta\\]

and we can fit for \\(\\delta\\) in the model construction. For this we introduce a RooFormulaVar.

dMH = ROOT.RooRealVar(\"dMH_ggH_Tag0\", \"dMH_ggH_Tag0\", 0, -1, 1 )\nmean = ROOT.RooFormulaVar(\"mean_ggH_Tag0\", \"mean_ggH_Tag0\", \"(@0+@1)\", ROOT.RooArgList(MH,dMH))\nmodel = ROOT.RooGaussian( \"model_ggH_Tag0\", \"model_ggH_Tag0\", mass, mean, sigma )\n\n# Fit the new model with a variable mean\nmodel.fitTo(mc,ROOT.RooFit.SumW2Error(True),ROOT.RooFit.PrintLevel(-1))\n\n# Model is parametric in MH. Let's show this by plotting for different values of MH\ncan = ROOT.TCanvas()\nplot = mass.frame()\nMH.setVal(120)\nmodel.plotOn( plot, ROOT.RooFit.LineColor(2) )\nMH.setVal(125)\nmodel.plotOn( plot, ROOT.RooFit.LineColor(3) )\nMH.setVal(130)\nmodel.plotOn( plot, ROOT.RooFit.LineColor(4) )\nplot.Draw()\ncan.Update()\ncan.SaveAs(\"part1_signal_model_v1.png\")\n

Let's now save the model inside a RooWorkspace. Combine will load this model when performing the fits. Crucially, we need to freeze the fit parameters of the signal model, otherwise they will be freely floating in the final results extraction.

  • This choice of setting the shape parameters to constant means we believe our MC will perfectly model the Higgs boson events in data. Is this the case? How could we account for the MC mis-modelling in the fit? (See part 3).
MH.setVal(125)\ndMH.setConstant(True)\nsigma.setConstant(True)\n\nf_out = ROOT.TFile(\"workspace_sig.root\", \"RECREATE\")\nw_sig = ROOT.RooWorkspace(\"workspace_sig\",\"workspace_sig\")\ngetattr(w_sig, \"import\")(model)\nw_sig.Print()\nw_sig.Write()\nf_out.Close()\n

We have successfully constructed a parametric model to fit the shape of the signal peak. But we also need to know the yield/normalisation of the ggH signal process. In the SM, the ggH event yield in Tag0 is equal to:

\\[ N = \\sigma_{ggH} \\cdot \\mathcal{B}^{\\gamma\\gamma} \\cdot \\epsilon \\cdot \\mathcal{L}\\]

Where \\(\\sigma_{ggH}\\) is the SM ggH cross section, \\(\\mathcal{B}^{\\gamma\\gamma}\\) is the SM branching fraction of the Higgs boson to two photons, \\(\\epsilon\\) is the efficiency factor and corresponds to the fraction of the total ggH events landing in the Tag0 analysis category. Finally \\(\\mathcal{L}\\) is the integrated luminosity.

In this example, the ggH MC events are normalised before any selection is performed to \\(\\sigma_{ggH} \\cdot \\mathcal{B}^{\\gamma\\gamma}\\), taking the values from the LHCHWG twiki. Note this does not include the lumi scaling, which may be different to what you have in your own analyses! We can then calculate the efficiency factor, \\(\\epsilon\\), by taking the sum of weights in the MC dataset and dividing through by \\(\\sigma_{ggH} \\cdot \\mathcal{B}^{\\gamma\\gamma}\\). This will tell us what fraction of ggH events land in Tag0.

# Define SM cross section and branching fraction values\nxs_ggH = 48.58 #in [pb]\nbr_gamgam = 2.7e-3\n\n# Calculate the efficiency and print output\nsumw = mc.sumEntries()\neff = sumw/(xs_ggH*br_gamgam)\nprint(\"Efficiency of ggH events landing in Tag0 is: %.2f%%\"%(eff*100))\n\n# Calculate the total yield (assuming full Run 2 lumi) and print output\nlumi = 138000\nN = xs_ggH*br_gamgam*eff*lumi\nprint(\"For 138fb^-1, total normalisation of signal is: N = xs * br * eff * lumi = %.2f events\"%N)\n

Gives the output:

Efficiency of ggH events landing in Tag0 is: 1.00%\nFor 138fb^-1, total normalisation of signal is: N = xs * br * eff * lumi = 181.01 events\n

So we find 1% of all ggH events enter Tag0. And the total expected yield of ggH events in Tag0 (with lumi scaling) is 181.01. Lets make a note of this for later!

"},{"location":"tutorial2023/parametric_exercise/#background-modelling","title":"Background modelling","text":"

In the H \\(\\rightarrow\\gamma\\gamma\\) analysis we construct the background model directly from data. To avoid biasing our background estimate, we remove the signal region from the model construction and fit the mass sidebands. Let's begin by loading the data TTree and converting to a RooDataSet. We will then plot the mass sidebands.

f = ROOT.TFile(\"data_part1.root\",\"r\")\nt = f.Get(\"data_Tag0\")\n\n# Convert TTree to a RooDataSet\ndata = ROOT.RooDataSet(\"data_Tag0\", \"data_Tag0\", t, ROOT.RooArgSet(mass), \"\", \"weight\")\n\n# Define mass sideband ranges on the mass variable: 100-115 and 135-180\nn_bins = 80\nbinning = ROOT.RooFit.Binning(n_bins,100,180)\nmass.setRange(\"loSB\", 100, 115 )\nmass.setRange(\"hiSB\", 135, 180 )\nmass.setRange(\"full\", 100, 180 )\nfit_range = \"loSB,hiSB\"\n\n# Plot the data in the mass sidebands\ncan = ROOT.TCanvas()\nplot = mass.frame()\ndata.plotOn( plot, ROOT.RooFit.CutRange(fit_range), binning )\nplot.Draw()\ncan.Update()\ncan.Draw()\ncan.SaveAs(\"part1_data_sidebands.png\")\n

By eye, it looks like an exponential function would fit the data sidebands well. Let's construct the background model using a RooExponential and fit the data sidebands:

alpha = ROOT.RooRealVar(\"alpha\", \"alpha\", -0.05, -0.2, 0 )\nmodel_bkg = ROOT.RooExponential(\"model_bkg_Tag0\", \"model_bkg_Tag0\", mass, alpha )\n\n# Fit model to data sidebands\nmodel_bkg.fitTo( data, ROOT.RooFit.Range(fit_range),  ROOT.RooFit.PrintLevel(-1))\n\n# Let's plot the model fit to the data\ncan = ROOT.TCanvas()\nplot = mass.frame()\n# We have to be careful with the normalisation as we only fit over sidebands\n# First do an invisible plot of the full data set\ndata.plotOn( plot, binning, ROOT.RooFit.MarkerColor(0), ROOT.RooFit.LineColor(0) )\nmodel_bkg.plotOn( plot, ROOT.RooFit.NormRange(fit_range), ROOT.RooFit.Range(\"full\"), ROOT.RooFit.LineColor(2))\ndata.plotOn( plot, ROOT.RooFit.CutRange(fit_range), binning )\nplot.Draw()\ncan.Update()\ncan.Draw()\ncan.SaveAs(\"part1_bkg_model.png\")\n

As the background model is extracted from data, we want to introduce a freely floating normalisation term. We use the total number of data events (including in the signal region) as the initial prefit value of this normalisation object i.e. assuming no signal in the data. The syntax to name this normalisation object is {model}_norm which will the be picked up automatically by combine. Note we also allow the shape parameter to float in the final fit to data (by not setting to constant).

norm = ROOT.RooRealVar(\"model_bkg_Tag0_norm\", \"Number of background events in Tag0\", data.numEntries(), 0, 3*data.numEntries() )\nalpha.setConstant(False)\n

Let's then save the background model, the normalisation object, and the data distribution to a new RooWorkspace:

f_out = ROOT.TFile(\"workspace_bkg.root\", \"RECREATE\")\nw_bkg = ROOT.RooWorkspace(\"workspace_bkg\",\"workspace_bkg\")\ngetattr(w_bkg, \"import\")(data)\ngetattr(w_bkg, \"import\")(norm)\ngetattr(w_bkg, \"import\")(model_bkg)\nw_bkg.Print()\nw_bkg.Write()\nf_out.Close()\n
"},{"location":"tutorial2023/parametric_exercise/#datacard","title":"Datacard","text":"

The model workspaces have now been constructed. But before we can run any fits in combine we need to build the so-called datacard. This is a text file which defines the different processes entering the fit and their expected yields, and maps these processes to the corresponding (parametric) models. We also store information on the systematic uncertainties in the datacard (see part 3). Given the low complexity of this example, the datacard is reasonably short. The datacard for this section is titled datacard_part1.txt. Take some time to understand the different lines. In particular, the values for the process normalisations:

  • Where does the signal (ggH) normalisation come from?
  • Why do we use a value of 1.0 for the background model normalisation in this analysis?
# Datacard example for combine tutorial 2023 (part 1)\n---------------------------------------------\nimax 1\njmax 1\nkmax *\n---------------------------------------------\n\nshapes      ggH          Tag0      workspace_sig.root      workspace_sig:model_ggH_Tag0\nshapes      bkg_mass     Tag0      workspace_bkg.root      workspace_bkg:model_bkg_Tag0\nshapes      data_obs     Tag0      workspace_bkg.root      workspace_bkg:data_Tag0\n\n---------------------------------------------\nbin             Tag0\nobservation     -1\n---------------------------------------------\nbin             Tag0         Tag0\nprocess         ggH          bkg_mass\nprocess         0            1\nrate            181.01       1.0\n---------------------------------------------\n

To compile the datacard we run the following command, using a value of the Higgs mass of 125.0:

text2workspace.py datacard_part1.txt -m 125\n
  • This compiles the datacard into a RooWorkspace, effectively building the likelihood function. Try opening the compiled workspace (root datacard_part1.root) and print the contents.
w->Print()\n
  • Do you understand what all the different objects are? What does the variable r correspond to? Try (verbose) printing with:
w->var(\"r\")->Print(\"v\")\n
"},{"location":"tutorial2023/parametric_exercise/#extension-signal-normalisation-object","title":"Extension: signal normalisation object","text":"

In the example above, the signal model normalisation is input by hand in the datacard. We can instead define the signal normalisation components in the model in a similar fashion to the background model normalisation object. Let's build the cross section (ggH), branching fraction (H->gamgam), and efficiency variables. It's important to set these terms to be constant for the final fit to data:

xs_ggH = ROOT.RooRealVar(\"xs_ggH\", \"Cross section of ggH in [pb]\", 48.58 )\nbr_gamgam = ROOT.RooRealVar(\"BR_gamgam\", \"Branching ratio of Higgs to gamma gamma\", 0.0027 )\neff_ggH_Tag0 = ROOT.RooRealVar(\"eff_ggH_Tag0\", \"Efficiency for ggH events to land in Tag0\", eff )\n\nxs_ggH.setConstant(True)\nbr_gamgam.setConstant(True)\neff_ggH_Tag0.setConstant(True)\n

The normalisation component is then defined as the product of these three variables:

norm_sig = ROOT.RooProduct(\"model_ggH_Tag0_norm\", \"Normalisation term for ggH in Tag 0\", ROOT.RooArgList(xs_ggH,br_gamgam,eff_ggH_Tag0))\n

Again the syntax {model}_norm has been used so that combine will automatically assign this object as the normalisation for the model (model_ggH_Tag0). Firstly we need to save a new version of the signal model workspace with the normalisation term included.

f_out = ROOT.TFile(\"workspace_sig_with_norm.root\", \"RECREATE\")\nw_sig = ROOT.RooWorkspace(\"workspace_sig\",\"workspace_sig\")\ngetattr(w_sig, \"import\")(model)\ngetattr(w_sig, \"import\")(norm_sig)\nw_sig.Print()\nw_sig.Write()\nf_out.Close()\n

We then need to modify the datacard to account for this normalisation term. Importantly, the {model}_norm term in our updated signal model workspace does not contain the integrated luminosity. Therefore, the rate term in the datacard must be set equal to the integrated luminosity in [pb^-1] (as the cross section was defined in [pb]). The total normalisation for the signal model is then the product of the {model}_norm and the rate value.

  • You can find the example datacard here: datacard_part1_with_norm.txt with the signal normalisation object included. Check if it compiles successfully using text2workspace? If so, try printing out the contents of the workspace. Can you see the normalisation component?
"},{"location":"tutorial2023/parametric_exercise/#extension-unbinned-vs-binned","title":"Extension: unbinned vs binned","text":"

In a parametric analysis, the fit can be performed using a binned or unbinned likelihood function. The consequences of binned vs unbinned likelihoods were discussed in the morning session. In combine, we can simply toggle between binned and unbinned fits by changing how the data set is stored in the workspace. In the example above, the data was saved as a RooDataSet. This means that an unbinned maximum likelihood function would be used.

To switch to a binned maximum likelihood fit, we need to store the data set in the workspace as a RooDataHist. Let's first load the data as a RooDataSet as before:

f = ROOT.TFile(\"data_part1.root\",\"r\")\nt = f.Get(\"data_Tag0\")\n\n# Convert TTree to a RooDataSet\ndata = ROOT.RooDataSet(\"data_Tag0\", \"data_Tag0\", t, ROOT.RooArgSet(mass, weight), \"\", \"weight\")\n

We then need to set the number of bins in the observable and convert the data to a RooDataHist. In this example we will use 320 bins over the full mass range (0.25 GeV per bin). It is important that the binning is sufficiently granular so that we do not lose information in the data by switching to a binned likelihood fit. When fitting a signal peak over a background we want the bin width to be sufficiently smaller than the signal model mass resolution.

# Set bin number for mass variables\nmass.setBins(320)\ndata_hist = ROOT.RooDataHist(\"data_hist_Tag0\", \"data_hist_Tag0\", mass, data)\n\n# Save the background model with the RooDataHist instead\nf_out = ROOT.TFile(\"workspace_bkg_binned.root\", \"RECREATE\")\nw_bkg = ROOT.RooWorkspace(\"workspace_bkg\",\"workspace_bkg\")\ngetattr(w_bkg, \"import\")(data_hist)\ngetattr(w_bkg, \"import\")(norm)\ngetattr(w_bkg, \"import\")(model_bkg)\nw_bkg.Print()\nw_bkg.Write()\nf_out.Close()\n
"},{"location":"tutorial2023/parametric_exercise/#part-2-simple-fits","title":"Part 2: Simple fits","text":"

Now the parametric models have been constructed and the datacard has been compiled, we are ready to start using combine for running fits. In CMS analyses we begin by blinding ourselves to the data in the signal region, and looking only at the expected results based off toys datasets (asimov or pseudo-experiments). In this exercise, we will look straight away at the observed results. Note, the python commands in this section are taken from simple_fits.py.

To run a simple best-fit for the signal strength, r, fixing the Higgs mass to 125 GeV, you can run the command in the terminal:

combine -M MultiDimFit datacard_part1_with_norm.root -m 125 --freezeParameters MH --saveWorkspace -n .bestfit\n

We obtain a best-fit signal strength of r = 1.548 i.e. the observed signal yield is 1.548 times the SM prediction.

The option --saveWorkspace stores a snapshot of the postfit workspace in the output file (higgsCombine.bestfit.MultiDimFit.mH125.root). We can load the postfit workspace and look at how the values of all the fit parameters change (compare the clean and MultiDimFit parameter snapshots):

import ROOT\n\nf = ROOT.TFile(\"higgsCombine.bestfit.MultiDimFit.mH125.root\")\nw = f.Get(\"w\")\nw.Print(\"v\")\n

We can even plot the postfit signal-plus-background model using the workspace snapshot:

n_bins = 80\nbinning = ROOT.RooFit.Binning(n_bins,100,180)\n\ncan = ROOT.TCanvas()\nplot = w.var(\"CMS_hgg_mass\").frame()\nw.data(\"data_obs\").plotOn( plot, binning )\n\n# Load the S+B model\nsb_model = w.pdf(\"model_s\").getPdf(\"Tag0\")\n\n# Prefit\nsb_model.plotOn( plot, ROOT.RooFit.LineColor(2), ROOT.RooFit.Name(\"prefit\") )\n\n# Postfit\nw.loadSnapshot(\"MultiDimFit\")\nsb_model.plotOn( plot, ROOT.RooFit.LineColor(4), ROOT.RooFit.Name(\"postfit\") )\nr_bestfit = w.var(\"r\").getVal()\n\nplot.Draw()\n\nleg = ROOT.TLegend(0.55,0.6,0.85,0.85)\nleg.AddEntry(\"prefit\", \"Prefit S+B model (r=1.00)\", \"L\")\nleg.AddEntry(\"postfit\", \"Postfit S+B model (r=%.2f)\"%r_bestfit, \"L\")\nleg.Draw(\"Same\")\n\ncan.Update()\ncan.SaveAs(\"part2_sb_model.png\")\n

"},{"location":"tutorial2023/parametric_exercise/#confidence-intervals","title":"Confidence intervals","text":"

We not only want to find the best-fit value of the signal strength, r, but also the confidence intervals. The singles algorithm will find the 68% CL intervals:

combine -M MultiDimFit datacard_part1_with_norm.root -m 125 --freezeParameters MH -n .singles --algo singles\n

To perform a likelihood scan (i.e. calculate 2NLL at fixed values of the signal strength, profiling the other parameters), we use the grid algorithm. We can control the number of points in the scan using the --points option. Also, it is important to set a suitable range for the signal strength parameter. The singles algorithm has shown us that the 1 stdev interval on r is around +/-0.2.

  • Use these intervals to define a suitable range for the scan, and change lo,hi in the following options accordingly: --setParameterRanges r=lo,hi.
combine -M MultiDimFit datacard_part1_with_norm.root -m 125 --freezeParameters MH -n .scan --algo grid --points 20 --setParameterRanges r=lo,hi\n

We can use the plot1DScan.py function from combineTools to plot the likelihood scan:

plot1DScan.py higgsCombine.scan.MultiDimFit.mH125.root -o part2_scan\n

  • Do you understand what the plot is showing? What information about the signal strength parameter can be inferred from the plot?
"},{"location":"tutorial2023/parametric_exercise/#extension-expected-fits","title":"Extension: expected fits","text":"

To run expected fits we simply add -t N to the combine command. For N>0, this will generate N random toys from the model and fit each one independently. For N=-1, this will generate an asimov toy in which all statistical fluctuations from the model are suppressed.

You can use the --expectSignal 1 option to set the signal strength parameter to 1 when generating the toy. Alternatively, --expectSignal 0 will generate a toy from the background-only model. For multiple parameter models you can set the initial values when generating the toy(s) using the --setParameters option of combine. For example, if you want to throw a toy where the Higgs mass is at 124 GeV and the background slope parameter alpha is equal to -0.05, you would add --setParameters MH=124.0,alpha=-0.05.

  • Try running the asimov likelihood scan for r=1 and r=0, and plotting them using the plot1DScan.py script.
"},{"location":"tutorial2023/parametric_exercise/#extension-goodness-of-fit-tests","title":"Extension: goodness-of-fit tests","text":"

The goodness-of-fit tests available in combine are only well-defined for binned maximum likelihood fits. Therefore, to perform a goodness-of-fit test with a parametric datacard, make sure to save the data object as a RooDataHist, as in workspace_bkg_binned.root.

  • Try editing the datacard_part1_with_norm.txt file to pick up the correct binned workspace file, and the RooDataHist. The goodness-of-fit method requires at-least one nuisance parameter in the model to run successfully. Append the following line to the end of the datacard:
lumi_13TeV      lnN          1.01         -\n
  • Does the datacard compile with the text2workspace.py command?

We use the GoodnessOfFit method in combine to evaluate how compatible the observed data are with the model pdf. There are three types of GoF algorithm within combine, this example will use the saturated algorithm. You can find more information about the other algorithms here.

Firstly, we want to calculate the value of the test statistic for the observed data:

combine -M GoodnessOfFit datacard_part1_binned.root --algo saturated -m 125 --freezeParameters MH -n .goodnessOfFit_data\n

Now lets calculate the test statistic value for many toys thrown from the model:

combine -M GoodnessOfFit datacard_part1_binned.root --algo saturated -m 125 --freezeParameters MH -n .goodnessOfFit_toys -t 1000\n

To make a plot of the GoF test-statistic distribution you can run the following commands, which first collect the values of the test-statistic into a json file, and then plots from the json file:

combineTool.py -M CollectGoodnessOfFit --input higgsCombine.goodnessOfFit_data.GoodnessOfFit.mH125.root higgsCombine.goodnessOfFit_toys.GoodnessOfFit.mH125.123456.root -m 125.0 -o gof.json\n\nplotGof.py gof.json --statistic saturated --mass 125.0 -o part2_gof\n

  • What does the plot tell us? Does the model fit the data well? You can refer back to the discussion here
"},{"location":"tutorial2023/parametric_exercise/#part-3-systematic-uncertainties","title":"Part 3: Systematic uncertainties","text":"

In this section, we will learn how to add systematic uncertainties to a parametric fit analysis. The python commands are taken from the systematics.py script.

For uncertainties which only affect the process normalisation, we can simply implement these as lnN uncertainties in the datacard. The file mc_part3.root contains the systematic-varied trees i.e. Monte-Carlo events where some systematic uncertainty source {photonID,JEC,scale,smear} has been varied up and down by \\(1\\sigma\\).

import ROOT\n\nf = ROOT.TFile(\"mc_part3.root\")\nf.ls()\n

Gives the output:

TFile**     mc_part3.root   \n TFile*     mc_part3.root   \n  KEY: TTree    ggH_Tag0;1  ggH_Tag0\n  KEY: TTree    ggH_Tag0_photonIDUp01Sigma;1    ggH_Tag0_photonIDUp01Sigma\n  KEY: TTree    ggH_Tag0_photonIDDown01Sigma;1  ggH_Tag0_photonIDDown01Sigma\n  KEY: TTree    ggH_Tag0_scaleUp01Sigma;1   ggH_Tag0_scaleUp01Sigma\n  KEY: TTree    ggH_Tag0_scaleDown01Sigma;1 ggH_Tag0_scaleDown01Sigma\n  KEY: TTree    ggH_Tag0_smearUp01Sigma;1   ggH_Tag0_smearUp01Sigma\n  KEY: TTree    ggH_Tag0_smearDown01Sigma;1 ggH_Tag0_smearDown01Sigma\n  KEY: TTree    ggH_Tag0_JECUp01Sigma;1 ggH_Tag0_JECUp01Sigma\n  KEY: TTree    ggH_Tag0_JECDown01Sigma;1   ggH_Tag0_JECDown01Sigma\n

Let's first load the systematic-varied trees as RooDataSets and store them in a python dictionary, mc:

# Define mass and weight variables\nmass = ROOT.RooRealVar(\"CMS_hgg_mass\", \"CMS_hgg_mass\", 125, 100, 180)\nweight = ROOT.RooRealVar(\"weight\",\"weight\",0,0,1)\n\nmc = {}\n\n# Load the nominal dataset\nt = f.Get(\"ggH_Tag0\")\nmc['nominal'] = ROOT.RooDataSet(\"ggH_Tag0\",\"ggH_Tag0\", t, ROOT.RooArgSet(mass,weight), \"\", \"weight\" )\n\n# Load the systematic-varied datasets\nfor syst in ['JEC','photonID','scale','smear']:\n    for direction in ['Up','Down']:\n        key = \"%s%s01Sigma\"%(syst,direction)\n        name = \"ggH_Tag0_%s\"%(key)\n        t = f.Get(name)\n        mc[key] = ROOT.RooDataSet(name, name, t, ROOT.RooArgSet(mass,weight), \"\", \"weight\" )\n

The jet energy scale (JEC) and photon identification (photonID) uncertainties do not affect the shape of the \\(m_{\\gamma\\gamma}\\) distribution i.e. they only effect the signal yield estimate. We can calculate their impact by comparing the sum of weights to the nominal dataset. Note, the photonID uncertainty changes the weight of the events in the tree, whereas the JEC varied trees contain a different set of events, generated by shifting the jet energy scale in the simulation. In any case, the means for calculating the yield variations is equivalent:

for syst in ['JEC','photonID']:\n    for direction in ['Up','Down']:\n        yield_variation = mc['%s%s01Sigma'%(syst,direction)].sumEntries()/mc['nominal'].sumEntries()\n        print(\"Systematic varied yield (%s,%s): %.3f\"%(syst,direction,yield_variation))\n
Systematic varied yield (JEC,Up): 1.056\nSystematic varied yield (JEC,Down): 0.951\nSystematic varied yield (photonID,Up): 1.050\nSystematic varied yield (photonID,Down): 0.950\n

We can write these yield variations in the datacard with the lines:

CMS_scale_j           lnN      0.951/1.056      -\nCMS_hgg_phoIdMva      lnN      1.05             -   \n
  • Why is the photonID uncertainty expressed as one number, whereas the JEC uncertainty is defined by two?

Note in this analysis there are no systematic uncertainties affecting the background estimate (- in the datacard), as the background model has been derived directly from data.

"},{"location":"tutorial2023/parametric_exercise/#parametric-shape-uncertainties","title":"Parametric shape uncertainties","text":"

What about systematic uncertainties which affect the shape of the mass distribution?

In a parametric analysis, we need to build the dependence directly into the model parameters. The example uncertainty sources in this tutorial are the photon energy scale and smearing uncertainties. From the names alone we can expect that the scale uncertainty will affect the mean of the signal Gaussian, and the smear uncertainty will impact the resolution (sigma). Let's first take a look at the scaleUp01Sigma dataset:

# Build the model to fit the systematic-varied datasets\nmean = ROOT.RooRealVar(\"mean\", \"mean\", 125, 124, 126)\nsigma = ROOT.RooRealVar(\"sigma\", \"sigma\", 2, 1.5, 2.5)\ngaus = ROOT.RooGaussian(\"model\", \"model\", mass, mean, sigma)\n\n# Run the fits twice (second time from the best-fit of first run) to obtain more reliable results\ngaus.fitTo(mc['scaleUp01Sigma'], ROOT.RooFit.SumW2Error(True),ROOT.RooFit.PrintLevel(-1))\ngaus.fitTo(mc['scaleUp01Sigma'], ROOT.RooFit.SumW2Error(True),ROOT.RooFit.PrintLevel(-1))\nprint(\"Mean = %.3f +- %.3f GeV, Sigma = %.3f +- %.3f GeV\"%(mean.getVal(),mean.getError(),sigma.getVal(),sigma.getError()) )\n

Gives the output:

Mean = 125.370 +- 0.009 GeV, Sigma = 2.011 +- 0.006 GeV\n

Now let's compare the values to the nominal fit for all systematic-varied trees. We observe a significant variation in the mean for the scale uncertainty, and a significant variation in sigma for the smear uncertainty.

# First fit the nominal dataset\ngaus.fitTo(mc['nominal'], ROOT.RooFit.SumW2Error(True), ROOT.RooFit.PrintLevel(-1) )\ngaus.fitTo(mc['nominal'], ROOT.RooFit.SumW2Error(True), ROOT.RooFit.PrintLevel(-1) )\n# Save the mean and sigma values and errors to python dicts\nmean_values, sigma_values = {}, {}\nmean_values['nominal'] = [mean.getVal(),mean.getError()]\nsigma_values['nominal'] = [sigma.getVal(),sigma.getError()]\n\n# Next for the systematic varied datasets\nfor syst in ['scale','smear']:\n    for direction in ['Up','Down']:\n        key = \"%s%s01Sigma\"%(syst,direction)\n        gaus.fitTo(mc[key] , ROOT.RooFit.SumW2Error(True),  ROOT.RooFit.PrintLevel(-1))\n        gaus.fitTo(mc[key], ROOT.RooFit.SumW2Error(True), ROOT.RooFit.PrintLevel(-1))\n        mean_values[key] = [mean.getVal(), mean.getError()]\n        sigma_values[key] = [sigma.getVal(), sigma.getError()]\n\n# Print the variations in mean and sigma\nfor key in mean_values.keys():\n    print(\"%s: mean = %.3f +- %.3f GeV, sigma = %.3f +- %.3f GeV\"%(key,mean_values[key][0],mean_values[key][1],sigma_values[key][0],sigma_values[key][1]))\n

Prints the output:

nominal: mean = 125.001 +- 0.009 GeV, sigma = 1.996 +- 0.006 GeV\nscaleUp01Sigma: mean = 125.370 +- 0.009 GeV, sigma = 2.011 +- 0.006 GeV\nscaleDown01Sigma: mean = 124.609 +- 0.009 GeV, sigma = 2.005 +- 0.006 GeV\nsmearUp01Sigma: mean = 125.005 +- 0.009 GeV, sigma = 2.097 +- 0.007 GeV\nsmearDown01Sigma: mean = 125.007 +- 0.009 GeV, sigma = 1.912 +- 0.006 GeV\n

The values tell us that the scale uncertainty (at \\(\\pm 1 \\sigma\\)) varies the signal peak mean by around 0.3%, and the smear uncertainty (at \\(\\pm 1 \\sigma\\)) varies the signal width (sigma) by around 4.5% (average of up and down variations).

Now we need to bake these effects into the parametric signal model. The mean of the Gaussian was previously defined as:

\\[ \\mu = m_H + \\delta\\]

We introduce the nuisance parameter nuisance_scale = \\(\\eta\\) to account for a shift in the signal peak mean using:

\\[ \\mu = (m_H + \\delta) \\cdot (1+0.003\\eta)\\]

At \\(\\eta = +1 (-1)\\) the signal peak mean will shift up (down) by 0.3%. To build this into the RooFit signal model we simply define a new parameter, \\(\\eta\\), and update the definition of the mean formula variable:

# Building the workspace with systematic variations\nMH = ROOT.RooRealVar(\"MH\", \"MH\", 125, 120, 130 )\nMH.setConstant(True)\n\n# Define formula for mean of Gaussian\ndMH = ROOT.RooRealVar(\"dMH_ggH_Tag0\", \"dMH_ggH_Tag0\", 0, -5, 5 )\neta = ROOT.RooRealVar(\"nuisance_scale\", \"nuisance_scale\", 0, -5, 5)\neta.setConstant(True)\nmean_formula = ROOT.RooFormulaVar(\"mean_ggH_Tag0\", \"mean_ggH_Tag0\", \"(@0+@1)*(1+0.003*@2)\", ROOT.RooArgList(MH,dMH,eta))\n
  • Why do we set the nuisance parameter to constant at this stage?

Similar for the width introducing a nuisance parameter, \\(\\chi\\):

\\[ \\sigma = \\sigma \\cdot (1+0.045\\chi)\\]
sigma = ROOT.RooRealVar(\"sigma_ggH_Tag0_nominal\", \"sigma_ggH_Tag0_nominal\", 2, 1, 5)\nchi = ROOT.RooRealVar(\"nuisance_smear\", \"nuisance_smear\", 0, -5, 5)\nchi.setConstant(True)\nsigma_formula = ROOT.RooFormulaVar(\"sigma_ggH_Tag0\", \"sigma_ggH_Tag0\", \"@0*(1+0.045*@1)\", ROOT.RooArgList(sigma,chi))\n

Let's now fit the new model to the signal Monte-Carlo dataset, build the normalisation object and save the workspace.

# Define Gaussian\nmodel = ROOT.RooGaussian( \"model_ggH_Tag0\", \"model_ggH_Tag0\", mass, mean_formula, sigma_formula )\n\n# Fit model to MC\nmodel.fitTo( mc['nominal'], ROOT.RooFit.SumW2Error(True), ROOT.RooFit.PrintLevel(-1) )\n\n# Build signal model normalisation object\nxs_ggH = ROOT.RooRealVar(\"xs_ggH\", \"Cross section of ggH in [pb]\", 48.58 )\nbr_gamgam = ROOT.RooRealVar(\"BR_gamgam\", \"Branching ratio of Higgs to gamma gamma\", 0.0027 )\neff = mc['nominal'].sumEntries()/(xs_ggH.getVal()*br_gamgam.getVal())\neff_ggH_Tag0 = ROOT.RooRealVar(\"eff_ggH_Tag0\", \"Efficiency for ggH events to land in Tag0\", eff )\n# Set values to be constant\nxs_ggH.setConstant(True)\nbr_gamgam.setConstant(True)\neff_ggH_Tag0.setConstant(True)\n# Define normalisation component as product of these three variables\nnorm_sig = ROOT.RooProduct(\"model_ggH_Tag0_norm\", \"Normalisation term for ggH in Tag 0\", ROOT.RooArgList(xs_ggH,br_gamgam,eff_ggH_Tag0))\n\n# Set shape parameters of model to be constant (i.e. fixed in fit to data)\ndMH.setConstant(True)\nsigma.setConstant(True)\n\n# Build new signal model workspace with signal normalisation term. \nf_out = ROOT.TFile(\"workspace_sig_with_syst.root\", \"RECREATE\")\nw_sig = ROOT.RooWorkspace(\"workspace_sig\",\"workspace_sig\")\ngetattr(w_sig, \"import\")(model)\ngetattr(w_sig, \"import\")(norm_sig)\nw_sig.Print()\nw_sig.Write()\nf_out.Close()\n

The final step is to add the parametric uncertainties as Gaussian-constrained nuisance parameters into the datacard. The syntax means the Gaussian constraint term in the likelihood function will have a mean of 0 and a width of 1.

nuisance_scale        param    0.0    1.0\nnuisance_smear        param    0.0    1.0\n
  • Try adding these lines to datacard_part1_with_norm.txt, along with the lines for the JEC and photonID yield uncertainties above, and compiling with the text2workspace command. Open the workspace and look at its contents. You will need to change the signal process workspace file name in the datacard to point to the new workspace (workspace_sig_with_syst.root).
  • Can you see the new objects in the compiled datacard that have been created for the systematic uncertainties? What do they correspond to?

We can now run a fit with the systematic uncertainties included. The option --saveSpecifiedNuis can be called to save the postfit nuisance parameter values in the combine output limit tree.

combine -M MultiDimFit datacard_part1_with_norm.root -m 125 --freezeParameters MH --saveWorkspace -n .bestfit.with_syst --saveSpecifiedNuis CMS_scale_j,CMS_hgg_phoIdMva,nuisance_scale,nuisance_smear\n
  • What do the postfit values of the nuisances tell us here? You can check them by opening the output file (root higgsCombine.bestfit.with_syst.MultiDimFit.mH125.root) and running limit->Show(0).
  • Try plotting the postfit mass distribution (as detailed in part 2). Do you notice any difference?
"},{"location":"tutorial2023/parametric_exercise/#uncertainty-breakdown","title":"Uncertainty breakdown","text":"

A more complete datacard with additional nuisance parameters is stored in datacard_part3.txt. We will use this datacard for the rest of part 3. Open the text file and have a look at the contents.

The following line has been appended to the end of the datacard to define the set of theory nuisance parameters. This will come in handy when calculating the uncertainty breakdown.

theory group = BR_hgg QCDscale_ggH pdf_Higgs_ggH alphaS_ggH UnderlyingEvent PartonShower\n

Compile the datacard and run an observed MultiDimFit likelihood scan over the signal strength, r:

text2workspace.py datacard_part3.txt -m 125\n\ncombine -M MultiDimFit datacard_part3.root -m 125 --freezeParameters MH -n .scan.with_syst --algo grid --points 20 --setParameterRanges r=0.5,2.5\n

Our aim is to break down the total uncertainty into the systematic and statistical components. To get the statistical-uncertainty-only scan it should be as simple as freezing the nuisance parameters in the fit... right?

Try it by adding ,allConstrainedNuisances to the --freezeParameters option. This will freeze all (constrained) nuisance parameters in the fit. You can also feed in regular expressions with wildcards using rgx{.*}. For instance to freeze only the nuisance_scale and nuisance_smear you could run with --freezeParameters MH,rgx{nuisance_.*}.

combine -M MultiDimFit datacard_part3.root -m 125 --freezeParameters MH,allConstrainedNuisances -n .scan.with_syst.statonly --algo grid --points 20 --setParameterRanges r=0.5,2.5\n

You can plot the two likelihood scans on the same axis with the command:

plot1DScan.py higgsCombine.scan.with_syst.MultiDimFit.mH125.root --main-label \"With systematics\" --main-color 1 --others higgsCombine.scan.with_syst.statonly.MultiDimFit.mH125.root:\"Stat-only\":2 -o part3_scan_v0\n

  • Can you spot the problem?

The nuisance parameters introduced into the model have pulled the best-fit signal strength point! Therefore we cannot simply subtract the uncertainties in quadrature to get an estimate for the systematic/statistical uncertainty breakdown.

The correct approach is to freeze the nuisance parameters to their respective best-fit values in the stat-only scan. We can do this by first saving a postfit workspace with all nuisance parameters profiled in the fit. Then we load the postfit snapshot values of the nuisance parameters (with the option --snapshotName MultiDimFit) from the combine output of the previous step, and then freeze the nuisance parameters for the stat-only scan.

combine -M MultiDimFit datacard_part3.root -m 125 --freezeParameters MH -n .bestfit.with_syst --setParameterRanges r=0.5,2.5 --saveWorkspace\n\ncombine -M MultiDimFit higgsCombine.bestfit.with_syst.MultiDimFit.mH125.root -m 125 --freezeParameters MH,allConstrainedNuisances -n .scan.with_syst.statonly_correct --algo grid --points 20 --setParameterRanges r=0.5,2.5 --snapshotName MultiDimFit\n

Adding the option --breakdown syst,stat to the plot1DScan.py command will automatically calculate the uncertainty breakdown for you.

plot1DScan.py higgsCombine.scan.with_syst.MultiDimFit.mH125.root --main-label \"With systematics\" --main-color 1 --others higgsCombine.scan.with_syst.statonly_correct.MultiDimFit.mH125.root:\"Stat-only\":2 -o part3_scan_v1 --breakdown syst,stat\n

We can also freeze groups of nuisance parameters defined in the datacard with the option --freezeNuisanceGroups. Let's run a scan freezing only the theory uncertainties (using the nuisance group we defined in the datacard):

combine -M MultiDimFit higgsCombine.bestfit.with_syst.MultiDimFit.mH125.root -m 125 --freezeParameters MH --freezeNuisanceGroups theory -n .scan.with_syst.freezeTheory --algo grid --points 20 --setParameterRanges r=0.5,2.5 --snapshotName MultiDimFit\n

To breakdown the total uncertainty into the theory, experimental and statistical components we can then use:

plot1DScan.py higgsCombine.scan.with_syst.MultiDimFit.mH125.root --main-label Total --main-color 1 --others higgsCombine.scan.with_syst.freezeTheory.MultiDimFit.mH125.root:\"Freeze theory\":4 higgsCombine.scan.with_syst.statonly_correct.MultiDimFit.mH125.root:\"Stat-only\":2 -o part3_scan_v2 --breakdown theory,exp,stat\n

These methods are not limited to this particular grouping of systematics. We can use the above procedure to assess the impact of any nuisance parameter(s) on the signal strength confidence interval.

  • Try and calculate the contribution to the total uncertainty from the luminosity estimate using this approach.
"},{"location":"tutorial2023/parametric_exercise/#impacts","title":"Impacts","text":"

It is often useful/required to check the impacts of the nuisance parameters (NP) on the parameter of interest, r. The impact of a NP is defined as the shift \\(\\Delta r\\) induced as the NP, \\(\\theta\\), is fixed to its \\(\\pm1\\sigma\\) values, with all other parameters profiled as normal. More information can be found in the combine documentation via this link.

Let's calculate the impacts for our analysis. We can use the combineTool.py from the CombineHarvester package to automate the scripts. The impacts are calculated in a few stages:

1) Do an initial fit for the parameter of interest, adding the --robustFit 1 option:

combineTool.py -M Impacts -d datacard_part3.root -m 125 --freezeParameters MH -n .impacts --setParameterRanges r=0.5,2.5 --doInitialFit --robustFit 1\n
  • What does the option --robustFit 1 do?

2) Next perform a similar scan for each NP with the --doFits option. This may take a few minutes:

combineTool.py -M Impacts -d datacard_part3.root -m 125 --freezeParameters MH -n .impacts --setParameterRanges r=0.5,2.5 --doFits --robustFit 1\n

3) Collect the outputs from the previous step and write the results to a json file:

combineTool.py -M Impacts -d datacard_part3.root -m 125 --freezeParameters MH -n .impacts --setParameterRanges r=0.5,2.5 -o impacts_part3.json\n

4) Produce a plot summarising the nuisance parameter values and impacts:

plotImpacts.py -i impacts_part3.json -o impacts_part3\n

There is a lot of information in these plots, which can be of invaluable use to analysers in understanding the fit. Do you understand everything that the plot is showing?

  • Which NP has the highest impact on the signal strength measurement?
  • Which NP is pulled the most in the fit to data? What does this information imply about the signal model mean in relation to the data?
  • Which NP is the most constrained in the fit to the data? What does it mean for a nuisance parameter to be constrained?
  • Try adding the option --summary to the impacts plotting command. This is a nice new feature in combine!
"},{"location":"tutorial2023/parametric_exercise/#part-4-toy-generation-and-bias-studies","title":"Part 4: Toy generation and bias studies","text":"

With combine we can generate toy datasets from the compiled datacard workspace. Please read this section in the combine manual before proceeding.

An interesting use case of toy generation is when performing bias studies. In the Higgs to two photon (Hgg) analysis, the background is fit with some functional form. However (due to the complexities of QCD) the exact form of this function is unknown. Therefore, we need to understand how our choice of background function may impact the fitted signal strength. This is performed using a bias study, which will indicate how much potential bias is present given a certain choice of functional form.

In the classical bias studies we begin by building a set of workspaces which correspond to different background function choices. In addition to the RooExponential constructed in Section 1, let's also try a (4th order) RooChebychev polynomial and a simple power law function to fit the background \\(m_{\\gamma\\gamma}\\) distribution.

The script used to fit the different functions and build the workspaces is construct_models_bias_study_part4.py. Take some time to look at the script and understand what the code is doing. In particular notice how we have saved the data as a RooDataHist in the workspace. This means we are now performing binned maximum likelihood fits (this is useful for part 4 to speed up fitting the many toys). If the binning is sufficiently granular, then there will be no noticeable difference in the results to the unbinned likelihood fits. Run the script with:

python3  construct_models_bias_study_part4.py\n

The outputs are a set of workspaces which correspond to different choices of background model functions, and a plot showing fits of the different functions to the data mass sidebands.

The datacards for the different background model functions are saved as datacard_part4_{pdf}.txt where pdf = {exp,poly,pow}. Have a look inside the .txt files and understand what changes have been made to pick up the different functions. Compile the datacards with:

for pdf in {exp,poly,pow}; do text2workspace.py datacard_part4_${pdf}.txt -m 125; done\n
"},{"location":"tutorial2023/parametric_exercise/#bias-studies","title":"Bias studies","text":"

For the bias studies we want to generate (\"throw\") toy datasets with some choice of background function and fit back with another. The toys are thrown with a known value of the signal strength (r=1 in this example), which we will call \\(r_{truth}\\). The fitted value of r is defined as \\(r_{fit}\\), with some uncertainty \\(\\sigma_{fit}\\). A pull value, \\(P\\), is calculated for each toy dataset according to,

\\[ P = (r_{truth}-r_{fit})/\\sigma_{fit}\\]

By repeating the process for many toys we can build up a pull distribution. If there is no bias present then we would expect to obtain a normal distribution centred at 0, with a standard deviation of 1. Let's calculate the bias for our analysis.

Firstly, we generate N=1000 toys from each of the background function choices and save them in a ROOT file. For this we use the GenerateOnly method of combine. We will inject signal in the toys by setting r=1 using the --expectSignal 1 option.

  • If time allows, repeat the bias studies with --expectSignal 0. This will inform us of the potential bias in the signal strength measurement given that there is no true signal.

The following commands show the example of throwing 1000 toys from the exponential function, and then fitting back with the 4th-order Chebychev polynomial. We use the singles algorithm to obtain a value for \\(r_{fit}\\) and \\(\\sigma_{fit}\\) simultaneously.

combine -M GenerateOnly datacard_part4_exp.root -m 125 --freezeParameters MH -t 1000 -n .generate_exp --expectSignal 1 --saveToys\n\ncombine -M MultiDimFit datacard_part4_poly.root -m 125 --freezeParameters MH -t 1000 -n .bias_truth_exp_fit_poly --expectSignal 1 --toysFile higgsCombine.generate_exp.GenerateOnly.mH125.123456.root --algo singles\n

The script plot_bias_pull.py will plot the pull distribution and fit a Gaussian to it:

python3 plot_bias_pull.py\n

The potential bias is defined as the (fitted) mean of the pull distribution.

  • What is our bias value? Have we generated enough toys to be confident of the bias value? You could try generating more toys if not.
  • What threshold do we use to define \"acceptable\" bias?

From the pull definition, we see the bias value is defined relative to the total uncertainty in the signal strength (denominator of \\(\\sigma_{fit}\\)). Some analyses use 0.14 as the threshold because a bias below this value would change the total uncertainty (when added in quadrature) by less than 1% (see equation below). Other analyses use 0.2 as this will change the total uncertainty by less than 2%. We should define the threshold before performing the bias study.

\\[ \\sqrt{ 1^2 + 0.14^2} = 1.0098 \\]
  • How does our bias value compare to the thresholds? If we the bias is outside the acceptable region we should account for this using a spurious signal method (see advanced exercises TBA).
  • Repeat the bias study for each possible truth and fitted background function combinations. Do the bias values induced by the choice of background function merit adding a spurious signal component into the fit?
  • What would you expect the bias value to be for a background function that does not fit the data well? Should we be worried about such functions? What test could we use to reject such functions from the study beforehand?
"},{"location":"tutorial2023/parametric_exercise/#part-5-discrete-profiling","title":"Part 5: Discrete-profiling","text":"

If multiple pdfs exist to fit some distribution, we can store all pdfs in a single workspace by using a RooMultiPdf object. The script construct_models_multipdf_part5.py shows how to store the exponential, (4th order) Chebychev polynomial and the power law function from the previous section in a RooMultiPdf object. This requires a RooCategory index, which controls the pdf which is active at any one time. Look at the contents of the script and then run with:

python3 construct_models_multipdf_part5.py\n

The file datacard_part5.txt will load the multipdf as the background model. Notice the line at the end of the datacard (see below). This tells combine about the RooCategory index.

pdfindex_Tag0         discrete\n

Compile the datacard with:

text2workspace.py datacard_part5.txt -m 125\n

The RooMultiPdf is a handy object for performing bias studies as all functions can be stored in a single workspace. You can then set which function is used for generating the toys with the --setParameters pdfindex_Tag0=i option, and which function is used for fitting with --setParameters pdfindex_Tag0=j --freezeParameters pdfindex_Tag0 options.

  • It would be a useful exercise to repeat the bias studies from part 4 but using the RooMultiPdf workspace. What happens when you do not freeze the index in the fitting step?

But simpler bias studies are not the only benefit of using the RooMultiPdf! It also allows us to apply the discrete profiling method in our analysis. In this method, the index labelling which pdf is active (a discrete nuisance parameter) is left floating in the fit, and will be profiled by looping through all the possible index values and finding the pdf which gives the best fit. In this manner, we are able to account for the uncertainty in the choice of the background function.

Note, by default, the multipdf will tell combine to add 0.5 to the NLL for each parameter in the pdf. This is known as the penalty term (or correction factor) for the discrete profiling method. You can toggle this term when building the workspace with the command multipdf.setCorrectionFactor(0.5). You may need to change the value of this term to obtain an acceptable bias in your fit!

Let's run a likelihood scan using the compiled datacard with the RooMultiPdf:

combine -M MultiDimFit datacard_part5.root -m 125 --freezeParameters MH -n .scan.multidimfit --algo grid --points 20 --cminDefaultMinimizerStrategy 0 --saveSpecifiedIndex pdfindex_Tag0 --setParameterRanges r=0.5,2.5\n

The option --cminDefaultMinimizerStrategy 0 is required to prevent HESSE being called as this cannot handle discrete nuisance parameters. HESSE is the full calculation of the second derivative matrix (Hessian) of the likelihood using finite difference methods.

The option --saveSpecifiedIndex pdfindex_Tag0 saves the value of the index at each point in the likelihood scan. Let's have a look at how the index value changes as a function of the signal strength. You can make the following plot by running:

python3 plot_pdfindex.py\n

By floating the discrete nuisance parameter pdfindex_Tag0, at each point in the likelihood scan the pdfs will be iterated over and the one which gives the max likelihood (lowest 2NLL) including the correction factor will be used. The plot above shows that the pdfindex_Tag0=0 (exponential) is chosen for the majority of r values, but this switches to pdfindex_Tag0=1 (Chebychev polynomial) at the lower edge of the r range. We can see the impact on the likelihood scan by fixing the pdf to the exponential:

combine -M MultiDimFit datacard_part5.root -m 125 --freezeParameters MH,pdfindex_Tag0 --setParameters pdfindex_Tag0=0 -n .scan.multidimfit.fix_exp --algo grid --points 20 --cminDefaultMinimizerStrategy 0 --saveSpecifiedIndex pdfindex_Tag0 --setParameterRanges r=0.5,2.5\n

Plotting the two scans on the same axis:

plot1DScan.py higgsCombine.scan.multidimfit.MultiDimFit.mH125.root --main-label \"Pdf choice floating\" --main-color 1 --others higgsCombine.scan.multidimfit.fix_exp.MultiDimFit.mH125.root:\"Pdf fixed to exponential\":2 -o part5_scan --y-cut 35 --y-max 35\n

The impact on the likelihood scan is evident at the lower edge, where the scan in which the index is floating flattens out. In this example, neither the \\(1\\sigma\\) or \\(2\\sigma\\) intervals are affected. But this is not always the case! Ultimately, this method allows us to account for the uncertainty in the choice of background function in the signal strength measurement.

Coming back to the bias studies. Do you now understand what you are testing if you do not freeze the index in the fitting stage? In this case you are fitting the toys back with the discrete profiling method. This is the standard approach for the bias studies when we use the discrete-profiling method in an analysis.

There are a number of options which can be added to the combine command to improve the performance when using discrete nuisance parameters. These are detailed at the end of this section in the combine manual.

"},{"location":"tutorial2023/parametric_exercise/#part-6-multi-signal-model","title":"Part 6: Multi-signal model","text":"

In reality, there are multiple Higgs boson processes which contribute to the total signal model, not only ggH. This section will explain how we can add an additional signal process (VBF) into the fit. Following this, we will add a second analysis category (Tag1), which has a higher purity of VBF events. To put this in context, the selection for Tag1 may require two jets with a large pseudorapidity separation and high invariant mass, which are typical properties of the VBF topology. By including this additional category with a different relative yield of VBF to ggH production, we are able to simultaneously constrain the rate of the two production modes.

In the SM, the VBF process has a cross section which is roughly 10 times smaller than the ggH cross section. This explains why we need to use certain features of the event to boost the purity of VBF events. The LO Feynman diagram for VBF production is shown below.

"},{"location":"tutorial2023/parametric_exercise/#building-the-models","title":"Building the models","text":"

Firstly, lets build the necessary inputs for this section using construct_models_part6.py. This script uses everything we have learnt in the previous sections: 1) Signal models (Gaussians) are built separately for each process (ggH and VBF) in each analysis category (Tag0 and Tag1). This uses separate TTrees for each contribution in the mc_part6.root file. The mean and width of the Gaussians include the effect of the parametric shape uncertainties, nuisance_scale and nuisance_smear. Each signal model is normalised according to the following equation, where \\(\\epsilon_{ij}\\) labels the fraction of process, \\(i\\) (=ggH,VBF), landing in analysis category, \\(j\\) (=Tag0,Tag1), and \\(\\mathcal{L}\\) is the integrated luminosity (defined in the datacard).

\\[ N_{ij} = \\sigma_i \\cdot \\mathcal{B}^{\\gamma\\gamma} \\cdot \\epsilon_{ij} \\cdot \\mathcal{L}\\]

2) A background model is constructed for each analysis category by fitting the mass sidebands in data. The input data is stored in the data_part6.root file. The models are RooMultiPdfs which contain an exponential, a 4th-order Chebychev polynomial and a power law function. The shape parameters and normalisation terms of the background models are freely floating in the final fit.

  • Have a look through the construct_models_part6.py script and try to understand all parts of the model construction. When you are happy, go ahead and construct the models with:
python3 construct_models_part6.py\n

The datacards for the two analysis categories are saved separately as datacard_part6_Tag0.txt and datacard_part6_Tag1.txt.

  • Do you understand the changes made to include multiple signal processes in the datacard? What value in the process line is used to label VBF as a signal?
  • Try compiling the individual datacards. What are the prefit ggH and VBF yields in each analysis category? You can find these by opening the workspace and printing the contents.
  • Run the best fits and plot the prefit and postfit S+B models along with the data (see code in part 2). How does the absolute number of data events in Tag1 compare to Tag0? What about the signal-to-background ratio, S/B?

In order to combine the two categories into a single datacard, we make use of the combineCards.py script:

combineCards.py datacard_part6_Tag0.txt datacard_part6_Tag1.txt > datacard_part6_combined.txt\n
"},{"location":"tutorial2023/parametric_exercise/#running-the-fits","title":"Running the fits","text":"

If we use the default text2workspace command on the combined datacard, then this will introduce a single signal strength modifer which modifies the rate of all signal processes (ggH and VBF) by the same factor.

  • Try compiling the combined datacard and running a likelihood scan. Does the sensitivity to the global signal strength improve by adding the additional analysis category \"Tag1\"?

If we want to measure the independent rates of both processes simultaneously, then we need to introduce a separate signal strength for ggH and VBF. To do this we use the multiSignalModel physics model in combine by adding the following options to the text2workspace command:

text2workspace.py datacard_part6_combined.txt -m 125 -P HiggsAnalysis.CombinedLimit.PhysicsModel:multiSignalModel --PO \"map=.*/ggH:r_ggH[1,0,2]\" --PO \"map=.*/VBF:r_VBF[1,0,3]\" -o datacard_part6_combined_multiSignalModel.root\n

The syntax for the parameter to process mapping is map=category/process/POI[default,min,max]. We have used the wildcard .* to tell combine that the POI (parameter of interest) should scale all cases of that process, regardless of the analysis category. The output of this command tells us what is scaled by the two signal strengths:

Will scale  ch1/ggH  by  r_ggH\nWill scale  ch1/VBF  by  r_VBF\nWill scale  ch1/bkg_mass  by  1\nWill scale  ch2/ggH  by  r_ggH\nWill scale  ch2/VBF  by  r_VBF\nWill scale  ch2/bkg_mass  by  1\nWill scale  ch1/ggH  by  r_ggH\nWill scale  ch1/VBF  by  r_VBF\nWill scale  ch1/bkg_mass  by  1\nWill scale  ch2/ggH  by  r_ggH\nWill scale  ch2/VBF  by  r_VBF\nWill scale  ch2/bkg_mass  by  1\n

Exactly what we require!

To run a 1D \"profiled\" likelihood scan for ggH we use the following command:

combine -M MultiDimFit datacard_part6_combined_multiSignalModel.root -m 125 --freezeParameters MH -n .scan.part6_multiSignalModel_ggH --algo grid --points 20 --cminDefaultMinimizerStrategy 0 --saveInactivePOI 1 -P r_ggH --floatOtherPOIs 1\n
  • \"Profiled\" here means we are profiling over the other parameter of interest, r_VBF in the fit. In other words, we are treating r_VBF as an additional nuisance parameter. The option --saveInactivePOI 1 stores the value of r_VBF in the combine output. Take a look at the fit output. Does the value of r_VBF depend on r_ggH? Are the two parameters of interest correlated? Remember, to look at the contents of the TTree you can use limit->Show(i), where i is an integer labelling the point in the likelihood scan.
  • Run the profiled scan for the VBF signal strength. Plot the r_ggH and r_VBF likelihood scans using the plot1DScan.py script. You will need to change some of the input options, in particular the --POI option. You can list the full set of options by running:
plot1DScan.py --help\n
"},{"location":"tutorial2023/parametric_exercise/#two-dimensional-likelihood-scan","title":"Two-dimensional likelihood scan","text":"

We can also run the fit at fixed points in (r_ggH,r_VBF) space. By using a sufficient number of points, we are able to up the 2D likelihood surface. Let's change the ranges of the parameters of interest to match what we have found in the profiled scans:

combine -M MultiDimFit datacard_part6_combined_multiSignalModel.root -m 125 --freezeParameters MH -n .scan2D.part6_multiSignalModel --algo grid --points 800 --cminDefaultMinimizerStrategy 0 -P r_ggH -P r_VBF --setParameterRanges r_ggH=0.5,2.5:r_VBF=-1,2\n

To plot the output you can use the plot_2D_scan.py script:

python3 plot_2D_scan.py\n

This script interpolates the 2NLL value between the points ran in the scan so that the plot shows a smooth likelihood surface. You may find in some cases, the number of scanned points and interpolation parameters need to be tuned to get a sensible looking surface. This basically depends on how complicated the likelihood surface is.

  • The plot shows that the data is in agreement with the SM within the \\(2\\sigma\\) CL. Here, the \\(1\\sigma\\) and \\(2\\sigma\\) confidence interval contours corresponds to 2NLL values of 2.3 and 5.99, respectively. Do you understand why this? Think about Wilk's theorem.
  • Does the plot show any correlation between the ggH and VBF signal strengths? Are the two positively or negatively correlated? Does this make sense for this pair of parameters given the analysis setup? Try repeating the 2D likelihood scan using the \"Tag0\" only datacard. How does the correlation behaviour change?
  • How can we read off the \"profiled\" 1D likelihood scan constraints from this plot?
"},{"location":"tutorial2023/parametric_exercise/#correlations-between-parameters","title":"Correlations between parameters","text":"

For template-based analyses we can use the FitDiagnostics method in combine to extract the covariance matrix for the fit parameters. Unfortunately, this method is not compatible when using discrete nuisance parameters (RooMultiPdf). Instead, we can use the robustHesse method to find the Hessian matrix by finite difference methods. The matrix is then inverted to get the covariance. Subsequently, we can use the covariance to extract the correlations between fit parameters.

combine -M MultiDimFit datacard_part6_combined_multiSignalModel.root -m 125 --freezeParameters MH -n .robustHesse.part6_multiSignalModel --cminDefaultMinimizerStrategy 0 -P r_ggH -P r_VBF --setParameterRanges r_ggH=0.5,2.5:r_VBF=-1,2 --robustHesse 1 --robustHesseSave 1 --saveFitResult\n

The output file robustHesse.robustHesse.part6_multiSignalModel.root stores the correlation matrix (h_correlation). This contains the correlations between all parameters including the nuisances. So if we are interested in the correlation between r_ggH and r_VBF, we first need to find which bin corresponds to these parameters:

root robustHesse.robustHesse.part6_multiSignalModel.root\n\nroot [1] h_correlation->GetXaxis()->GetBinLabel(19)\n(const char *) \"r_VBF\"\nroot [2] h_correlation->GetYaxis()->GetBinLabel(20)\n(const char *) \"r_ggH\"\nroot [3] h_correlation->GetBinContent(19,20)\n(double) -0.19822058\n
  • The two parameters of interest have a correlation coefficient of -0.198. This means the two parameters are somewhat anti-correlated. Does this match what we see in the 2D likelihood scan?
"},{"location":"tutorial2023/parametric_exercise/#impacts_1","title":"Impacts","text":"

We extract the impacts for each parameter of interest using the following commands:

combineTool.py -M Impacts -d datacard_part6_combined_multiSignalModel.root -m 125 --freezeParameters MH -n .impacts_part6_multiSignal --robustFit 1 --cminDefaultMinimizerStrategy 0 -P r_ggH -P r_VBF --doInitialFit\n\ncombineTool.py -M Impacts -d datacard_part6_combined_multiSignalModel.root -m 125 --freezeParameters MH -n .impacts_part6_multiSignal --robustFit 1 --cminDefaultMinimizerStrategy 0 -P r_ggH -P r_VBF --doFits\n\ncombineTool.py -M Impacts -d datacard_part6_combined_multiSignalModel.root -m 125 --freezeParameters MH -n .impacts_part6_multiSignal --robustFit 1 --cminDefaultMinimizerStrategy 0 -P r_ggH -P r_VBF -o impacts_part6.json\n\nplotImpacts.py -i impacts_part6.json -o impacts_part6_r_ggH --POI r_ggH\nplotImpacts.py -i impacts_part6.json -o impacts_part6_r_VBF --POI r_VBF\n
  • Look at the output PDF files. How does the ranking of the nuisance parameters change for the different signal strengths?
"},{"location":"tutorial2023/parametric_exercise/#advanced-exercises-to-be-added","title":"Advanced exercises (to be added)","text":"

The combine experts will include additional exercises here in due course. These will include:

  • Convolution of model pdfs: RooAddPdf
  • Application of the spurious signal method
  • Advanced physics models including parametrised signal strengths e.g. SMEFT
  • Mass fits
  • Two-dimensional parametric models
"},{"location":"tutorial2023_unfolding/unfolding_exercise/","title":"Likelihood Based Unfolding Exercise in Combine","text":""},{"location":"tutorial2023_unfolding/unfolding_exercise/#getting-started","title":"Getting started","text":"

To get started, you should have a working setup of Combine and CombineHarvester. This setup can be done following any of the installation instructions.

After setting up CMSSW, you can access the working directory for this tutorial which contains all of the inputs and scripts needed to run the unfolding fitting exercise:

cd $CMSSW_BASE/src/HiggsAnalysis/CombinedLimit/data/tutorials/tutorial_unfolding_2023/\n
"},{"location":"tutorial2023_unfolding/unfolding_exercise/#exercise-outline","title":"Exercise outline","text":"

The hands-on exercise is split into seven parts:

1) \"Simple\" Unfolding Experiment

2) Producing the Migration matrix from the datacards

3) Advanced Unfolding with more detector-level information and control regions

4) Extracting the expected intervals

5) Producing Impacts for multiple POIs

6) Unfold to the generator-level quantities

7) Extracting POI correlations from the FitDiagnostics output

Throughout the tutorial there are a number of questions and exercises for you to complete. These are shown in the boxes like this one.

Note that some additional information on unfolding in Combine are available here, which also includes some information on regularization, which is not discussed in this tutorial.

"},{"location":"tutorial2023_unfolding/unfolding_exercise/#analysis-overview","title":"Analysis overview","text":"

In this tutorial we will look at the cross section measurements of on of the SM Higgs processes VH, in \\(H\\to b\\bar{b}\\) (VHbb) final state.

The measurement is performed within the Simplified Template Cross Section (STXS) framework, which provides the prediction in the bins of generator-level quantities \\(p_{T}(V)\\) and number of additional jets. The maximum likelihood based unfolding is performed to measure the cross section in the generator-level bins defined by STXS scheme. At the detector-level we define appropriate categories to match the STXS bins as closely as possible so that there is a good correspondence between the detector-level observable and the underlying generator-level quantity we are interested in.

Note that for this STXS measurement, as well as measuring the cross-section as a function of the \\(p_{T}\\) of the vector boson, the measurement includes some information on the number of additional jets and is performed over multiple different production modes, for different production processes. However, it is common to focus on a single distribution (e.g. \\(p_{T}\\)) for a signle process, (e.g. \\(t\\bar{t}\\)).

In this tutorial we will focus on the ZH production, with the Z boson decaying to charged leptons, and Higgs boson reconstructed with the resolved \\(b\\bar{b}\\) pair. We will also use only a part of the Run 2 categories, we will not achieve the same sensitivity as the full analysis. Note that ggZH and ZH production modes are combined in the fit, since it is not possible to resolve them at this stage of the analysis. The STXS categories are defined independently of the Higgs decay channel, to streamline the combinations of the cross section measurement.

In the first part of the tutorial, we will setup a relatively simple unfolding, where there is a single detector-level bin for every generator-level bin we are trying to measure. We will then perform a blind analysis using this setup to see the expected sensitivity.

In this simple version of the analysis, we use a series of datacards, one for each detector-level bin, implemented as a counting experiment. We then combine the datacards for the full measurement. It is also possible to implement the same analysis as a single datacard, passing a histogram with each of the detector-level bins. Either method can be used, depending on which is more practical for the analysis being considered.

In the second part of the tutorial we will perform the same measurement with a more advanced setup, making use of differential distributions per generator-level bin we are trying to measure, as well as control regions. By providing this additional information to the fit, we are able to achieve a better and more robust unfolding result. After checking the expected sensitivity, we will take a look at the impacts and pulls of the nuisance parameters. Then we will unblind and look at the results of the measurement, produce generator-level plots and provide the correlation matrix for our measured observables.

"},{"location":"tutorial2023_unfolding/unfolding_exercise/#simplified-unfolding","title":"Simplified unfolding","text":"

When determining the detector-level binning for any differential analysis the main goal is to chose a binning that distinguishes contributions from the various generator-level bins well. In the simplest case it can be done with the cut-based approach, i.e. applying the same binning for the detector-level observables as is being applied to the generator-level quantities being measured. In this case, that means binning in \\(p_{T}(Z)\\) and \\(n_{\\text{add. jets}}\\). Due to the good lepton \\(p_{T}\\) resolution we can follow the original STXS scheme quite closely with the detector-level selection, with one exception, it is not possible to access the very-low transverse momenta bin \\(p_{T}(Z)<75\\) GeV.

In counting/regions dicrectory you can find the datacards with five detector-level categories, each targetting a corresponding generator-level bin. Below you can find an example of the datacard for the detector-level bin with \\(p_{T}(Z)>400\\) GeV.

imax    1 number of bins\njmax    9 number of processes minus 1\nkmax    * number of nuisance parameters\n--------------------------------------------------------------------------------\n--------------------------------------------------------------------------------\nbin          vhbb_Zmm_gt400_13TeV\nobservation  12.0\n--------------------------------------------------------------------------------\nbin                                   vhbb_Zmm_gt400_13TeV   vhbb_Zmm_gt400_13TeV vhbb_Zmm_gt400_13TeV   vhbb_Zmm_gt400_13TeV     vhbb_Zmm_gt400_13TeV vhbb_Zmm_gt400_13TeV vhbb_Zmm_gt400_13TeV vhbb_Zmm_gt400_13TeV vhbb_Zmm_gt400_13TeV vhbb_Zmm_gt400_13TeV\nprocess                               ggZH_lep_PTV_GT400_hbb ZH_lep_PTV_GT400_hbb ZH_lep_PTV_250_400_hbb ggZH_lep_PTV_250_400_hbb Zj1b            Zj0b_c          Zj0b_udsg       VVLF            Zj2b            VVHF\nprocess                               -3                     -2                   -1                     0                        1               2               3               4               5               6\nrate                                  0.0907733              0.668303             0.026293               0.00434588               3.78735         2.58885         4.09457         0.413716        7.02731         0.642605\n--------------------------------------------------------------------------------\n\n

You can see the contributions from various background processes, namely Z+jets, \\(t\\bar{t}\\) and the single top, as well as the signal processes (ggZH and ZH) corresponding to the STXS scheme discussed above. Note that for each generator-level bin being measured, we assign a different process in combine. This is so that the signal strengths for each of their contributions can float independently in the measurement. Also note, that due to migrations, each detector-level bin will receive contributions from multiple generator-level bins.

One of the most important stages in the analysis design, is to make sure that the detector-level categories are well-chosen to target the corresponding generator-level processes.

To explicitly check the correspondance between detector- and generator-level, one can plot the contributions of each of the generator-level bins in all of the detector-level bins. You can use the script provided in the tutorial git-lab page. This script uses CombineHarvester to loop over detector-level bins, and get the rate at which each of the signal processes (generator-level bins) contributes to that detector-level bin; which is then used to plot the migration matrix.

python scripts/get_migration_matrix.py counting/combined_ratesOnly.txt\n\n

The migration matrix shows the generator-level bins on the x-axis and the corresponding detector-level bins on the y-axis. The entries are normalized such that the sum of all contributions for a given generator-level bin sum up to 1. With this convention, the numbers in each bin represent the probability that an event from a given generator-level bin is reconstructed in a given detector-level bin if it is reconstructed at all within the considered bins.

Now that we checked the response matrix we can attempt the maximum likelihood unfolding. We can use the multiSignalModel physics model available in Combine, which assigns a parameter of interest poi to a process p within a bin b using the syntax --PO 'map=b/p:poi[init, min, max]' to linearly scale the normalisation of this process under the parameter of interest (POI) variations. To create the workspace we can run the following command:

text2workspace.py -m 125  counting/combined_ratesOnly.txt -P HiggsAnalysis.CombinedLimit.PhysicsModel:multiSignalModel  --PO verbose --PO 'map=.*/.*ZH_lep_PTV_75_150_hbb:r_zh_75_150[1,-5,5]' --PO 'map=.*/.*ZH_lep_PTV_150_250_0J_hbb:r_zh_150_250noj[1,-5,5]'  --PO 'map=.*/.*ZH_lep_PTV_150_250_GE1J_hbb:r_zh_150_250wj[1,-5,5]' --PO 'map=.*/.*ZH_lep_PTV_250_400_hbb:r_zh_250_400[1,-5,5]' --PO 'map=.*/.*ZH_lep_PTV_GT400_hbb:r_zh_gt400[1,-5,5]' -o ws_counting.root\n

In the example given above a signal POI is assigned to each generator-level bin independent of detector-level bin. This allows the measurement to take into account migrations.

To extract the measurement let's run the initial fit first using the MultiDimFit method implemented in Combine to extract the best-fit values and uncertainties on all floating parameters:

combineTool.py -M MultiDimFit --datacard ws_counting.root --setParameters r_zh_250_400=1,r_zh_150_250noj=1,r_zh_75_150=1,r_zh_150_250wj=1,r_zh_gt400=1 --redefineSignalPOIs r_zh_75_150,r_zh_150_250noj,r_zh_150_250wj,r_zh_250_400,r_zh_gt400 -t -1 \n

With the option -t -1 we set Combine to fit the asimov dataset instead of actual data. The --setParameters <param>=<value> set the initial value of parameter named . --redefineSignalPOIs r_zh_75_150,r_zh_150_250noj,r_zh_150_250wj,r_zh_250_400,r_zh_gt400 set the POIs to the comma-separated list, instead of the default one r.

While the uncertainties on the parameters of interest (POIs) can be extracted in multiple ways, the most robust way is to run the likelihood scans for a POI corresponding to each generator-level bin, it allows you to spot discontinuities in the likelihood shape in case of problems with the fit or the model.

combineTool.py -M MultiDimFit --datacard ws_counting.root -t -1 --setParameters r_zh_250_400=1,r_zh_150_250noj=1,r_zh_75_150=1,r_zh_150_250wj=1,r_zh_gt400=1 --redefineSignalPOIs r_zh_75_150,r_zh_150_250noj,r_zh_150_250wj,r_zh_250_400,r_zh_gt400 --algo=grid --points=100 -P r_zh_75_150 --floatOtherPOIs=1 -n scan_r_zh_75_150\n\n

Now we can plot the likelihood scan and extract the expected intervals.

python scripts/plot1DScan.py higgsCombinescan_r_zh_75_150.MultiDimFit.mH120.root -o r_zh_75_150 --POI r_zh_75_150\n
  • Repeat for all POIs
"},{"location":"tutorial2023_unfolding/unfolding_exercise/#shape-analysis-with-control-regions","title":"Shape analysis with control regions","text":"

One of the advantages of the maximum likelihood unfolding is the flexibility to choose the analysis observable and include more information on the event kinematics, consequently improving the analysis sensitivity. This analysis benefits from the shape information of the DNN output trained to differentiate the VH(bb) signal from the SM backgrounds.

The datacards for this part of the exercise located full_model_datacards/, where you can find a separate datacard for each region within full_model_datacards/regions directory and also a combined datacard full_model_datacards/comb_full_model.txt. In this case, each of the detector-level bins being used in the unfolding above is now split into multiple bins according to the DNN output score. This provides extra discrimination power to separate the signal from background and improve the measurement.

As you will find, the datacards also contain several background processes. To control them properly we will also add regions enriched in the respective backgrounds. Then we can define a common set of rate parameters for signal and control regions to scale the rates or other parameters affecting their shape.

For the shape datacards one has to specify the mapping of histograms and channels/processes as given described below:

shapes [process] [channel] [file] [nominal] [systematics_templates]\n

Then the shape nuisance parameters can be defined in the systematics block in the datacard. More details can be found in Combine documentation pages.

In many CMS analyses there are hundreds of nuisance parameters corresponding to various source of systematics.

When we unfold to the generator-level quantities we should remove the nuisances affecting the rate of the generator-level bins, i.e. when measuring a given cross-section such as \\(\\sigma_{\\textrm{gen1}}\\), the nuisance parameters should not change the value of that parameter itself; they should only change the relationship between that parameter and the observations. This means that, for example, effects of renormalization and factorization scales on the generator-level cross section within each bin need to be removed. Only their effects on the detector-level distribution through changes of shape within each bin as well as acceptances and efficiencies should be considered.

For this analysis, that means removing the lnN nuisance parameters: THU_ZH_mig* and THU_ZH_inc; keeping only the acceptance shape uncertainties: THU_ZH_acc and THU_ggZH_acc, which do not scale the inclusive cross sections by construction. In this analysis the normalisation effects in the THU_ZH_acc and THU_ggZH_acc templates were already removed from the shape histograms. Removing the normalization effects can be achieved by removing them from the datacard. Alternatively, freezing the respective nuisance parameters with the option --freezeParameters par_name1,par_name2. Or you can create a group following the syntax given below at the end of the combined datacard, and freeze the parameters with the --freezeNuisanceGroups group_name option.

[group_name] group = uncertainty_1 uncertainty_2 ... uncertainty_N\n

Now we can create the workspace using the same multiSignalmodel as before:

text2workspace.py -m 125  full_model_datacards/comb_full_model.txt -P HiggsAnalysis.CombinedLimit.PhysicsModel:multiSignalModel  --PO verbose --PO 'map=.*/.*ZH_lep_PTV_75_150_hbb:r_zh_75_150[1,-5,5]' --PO 'map=.*/.*ZH_lep_PTV_150_250_0J_hbb:r_zh_150_250noj[1,-5,5]'  --PO 'map=.*/.*ZH_lep_PTV_150_250_GE1J_hbb:r_zh_150_250wj[1,-5,5]' --PO 'map=.*/.*ZH_lep_PTV_250_400_hbb:r_zh_250_400[1,-5,5]' --PO 'map=.*/.*ZH_lep_PTV_GT400_hbb:r_zh_gt400[1,-5,5]' --for-fits --no-wrappers --X-pack-asympows --optimize-simpdf-constraints=cms --use-histsum -o ws_full.root\n

As you might have noticed we are using a few extra versions --for-fits --no-wrappers --X-pack-asympows --optimize-simpdf-constraints=cms --use-histsum to create a workspace. They are needed to construct a more optimised pdf using the CMSHistSum class implemented in Combine to significantly lower the memory consumption.

  • Following the instructions given earlier, create the workspace and run the initial fit with -t -1.

Since this time the datacards include shape uncertainties as well as additional categories to improve the background description the fit might take much longer, but we can submit jobs to a batch system by using the combine tool and have results ready to look at in a few minutes.

combineTool.py -M MultiDimFit -d ws_full.root --setParameters r_zh_250_400=1,r_zh_150_250noj=1,r_zh_75_150=1,r_zh_150_250wj=1,r_zh_gt400=1 --redefineSignalPOIs r_zh_75_150,r_zh_150_250noj,r_zh_150_250wj,r_zh_250_400,r_zh_gt400  -t -1 --X-rtd FAST_VERTICAL_MORPH --algo=grid --points=50 --floatOtherPOIs=1 -n .scans_blinded --job-mode condor --task-name scans_zh  --split-points 1 --generate P:n::r_zh_gt400,r_zh_gt400:r_zh_250_400,r_zh_250_400:r_zh_150_250wj,r_zh_150_250wj:r_zh_150_250noj,r_zh_150_250noj:r_zh_75_150,r_zh_75_150\n

The option --X-rtd FAST_VERTICAL_MORPH is added here and for all combineTool.py -M MultiDimFit ... to speed up the minimisation.

The job submission is handled by the CombineHarvester, the combination of options --job-mode condor --task-name scans_zh --split-points 1 --generate P:n::r_zh_gt400,r_zh_gt400:r_zh_250_400,r_zh_250_400:r_zh_150_250wj,r_zh_150_250wj:r_zh_150_250noj,r_zh_150_250noj:r_zh_75_150,r_zh_75_150 will submit the jobs to HTCondor for each POI. The --generate option is is being used to automatically generate jobs attaching the options -P <POI> -n <name> with each of the pairs of values <POI>,<name> specified between the colons. You can add --dry-run option to create the submissions files first and check them, and then submit the jobs with condor_submit condor_scans_zh.sub.

If you are running the tutorial from a cluster where HTCondor is not available you can also submit the jobs to the slurm system, just change the --job-mode condor to --job-mode slurm.

After all jobs are completed we can combine the files for each POI:

for p in r_zh_75_150 r_zh_150_250noj r_zh_150_250wj r_zh_250_400 r_zh_gt400\ndo\n    hadd -k -f scan_${p}_blinded.root higgsCombine.scans_blinded.${p}.POINTS.*.MultiDimFit.mH120.root\ndone\n

And finally plot the likelihood scans

python scripts/plot1DScan.py scan_r_zh_75_150_blinded.root  -o scan_r_zh_75_150_blinded --POI r_zh_75_150 --json summary_zh_stxs_blinded.json\n

"},{"location":"tutorial2023_unfolding/unfolding_exercise/#impacts","title":"Impacts","text":"

One of the important tests before we move to the unblinding stage is to check the impacts of nuisance parameters on each POI. For this we can run the combineTool.py with -M Impacts method. We start with the initial fit, which should take about 20 minutes (good time to have a coffee break!):

combineTool.py -M Impacts -d ws_full.root -m 125 --robustFit 1 --doInitialFit --redefineSignalPOIs r_zh_75_150,r_zh_150_250noj,r_zh_150_250wj,r_zh_250_400,r_zh_gt400 --X-rtd FAST_VERTICAL_MORPH\n

Note that it is important to add the option --redefineSignalPOIs [list of parameters], to produce the impacts for all POIs we defined when the workspace was created with the multiSignalModel.

After the initial fit is completed we can perform the likelihood scans for each nuisance parameter. We will submit the jobs to the HTCondor to speed up the process.

combineTool.py -M Impacts -d ws_full.root -m 125 --robustFit 1 --doFits --redefineSignalPOIs r_zh_75_150,r_zh_150_250noj,r_zh_150_250wj,r_zh_250_400,r_zh_gt400 --job-mode condor --task-name impacts_zh --X-rtd FAST_VERTICAL_MORPH \n

Now we can combine the results into the .json format and use it to produce the impact plots.

combineTool.py -M Impacts -d ws_full.root -m 125 --redefineSignalPOIs r_zh_75_150,r_zh_150_250noj,r_zh_150_250wj,r_zh_250_400,r_zh_gt400 --output impacts.json \n\nplotImpacts.py -i impacts.json -o impacts_r_zh_75_150 --POI r_zh_75_150\n

* Do you observe differences in impacts plots for different POIs, do these differences make sense to you?

"},{"location":"tutorial2023_unfolding/unfolding_exercise/#unfolded-measurements","title":"Unfolded measurements","text":"

Now that we studied the nuisance parameter impacts for each POI, we can finally perform the measurement. Note that for the purposes of the tutorial, we are skipping further checks and validation that you should do on your analysis. Namely the goodness of fit test and the post-fit plots of folded observables. Both of these checks were detailed in the previous exercises, which you can find under the following link.

At this stage we'll run the MultiDimFit again scanning each POI to calculate the intervals, but this time we'll remove the -t -1 option to extract the unblinded results.

Also since we want to unfold the measurements to the generator-level observables, i.e. extract the cross sections, we remove the theoretical uncertainties affecting the rates of signal processes, we can do this be freezing them --freezeNuisanceGroups <group_name>, using the group_name you assigned earlier in the tutorial.

Now plot the scans and collect the measurements in the json file summary_zh_stxs.json.

python scripts/plot1DScan.py scan_r_zh_75_150.root -o r_zh_75_150 --POI r_zh_75_150 --json summary_zh_stxs.json  \n

Repeat the same command for other POIs to fill the summary_zh_stxs.json, which can then be used to make the cross section plot by multiplying the standard model cross sections by the signal strengths' best-fit values as shown below.

python scripts/make_XSplot.py summary_zh_stxs.json\n

"},{"location":"tutorial2023_unfolding/unfolding_exercise/#poi-correlations","title":"POI correlations","text":"

In addition to the cross-section measurements it is very important to publish covariance or correlation information of the measured cross sections. This allows the measurement to be properly intepreted or reused in combined fits.

The correlation matrix or covariance matrix can be extracted from the results after the fit. Here we can use the FitDiagnostics or MultiDimFit method.

combineTool.py -M FitDiagnostics --datacard ws_full.root --setParameters r_zh_250_400=1,r_zh_150_250noj=1,r_zh_75_150=1,r_zh_150_250wj=1,r_zh_gt400=1 --redefineSignalPOIs r_zh_75_150,r_zh_150_250noj,r_zh_150_250wj,r_zh_250_400,r_zh_gt400  --robustHesse 1 -n .full_model --X-rtd FAST_VERTICAL_MORPH\n

Then the RooFitResult, containing correlations matrix, can be found in the fitDiagnostics.full_model.root file under the name fit_s. The script plotCorrelations_pois.py from the exercise git-lab repository can help to plot the correlation matrix.

python scripts/plotCorrelations_pois.py -i fitDiagnostics.full_model.root:fit_s -p r_zh_75_150,r_zh_150_250noj,r_zh_150_250wj,r_zh_250_400,r_zh_gt400\n\n

"},{"location":"what_combine_does/fitting_concepts/","title":"Likelihood based fitting","text":"

\"Fitting\" simply means estimating some parameters of a model (or really a set of models) based on data. Likelihood-based fitting does this through the likelihood function.

In frequentist frameworks, this typically means doing maximum likelihood estimation. In bayesian frameworks, usually posterior distributions of the parameters are calculated from the likelihood.

"},{"location":"what_combine_does/fitting_concepts/#fitting-frameworks","title":"Fitting Frameworks","text":"

Likelihood fits typically either follow a frequentist framework of maximum likelihood estimation, or a bayesian framework of updating estimates to find posterior distributions given the data.

"},{"location":"what_combine_does/fitting_concepts/#maximum-likelihood-fits","title":"Maximum Likelihood fits","text":"

A maximum likelihood fit means finding the values of the model parameters \\((\\vec{\\mu}, \\vec{\\nu})\\) which maximize the likelihood, \\(\\mathcal{L}(\\vec{\\mu},\\vec{\\nu})\\) The values which maximize the likelihood, are the parameter estimates, denoted with a \"hat\" (\\(\\hat{}\\)):

\\[(\\vec{\\hat{\\mu}}, \\vec{\\hat{\\nu}}) \\equiv \\underset{\\vec{\\mu},\\vec{\\nu}}{\\operatorname{argmax}} \\mathcal{L}(\\vec{\\mu}, \\vec{\\nu})\\]

These values provide point estimates for the parameter values.

Because the likelihood is equal to the probability of observing the data given the model, the maximum likelihood estimate finds the parameter values for which the data is most probable.

"},{"location":"what_combine_does/fitting_concepts/#bayesian-posterior-calculation","title":"Bayesian Posterior Calculation","text":"

In a bayesian framework, the likelihood represents the probability of observing the data given the model and some prior probability distribution over the model parameters.

The prior probability of the parameters, \\(\\pi(\\vec{\\Phi})\\), are updated based on the data to provide a posterior distributions

\\[ p(\\vec{\\Phi};\\mathrm{data}) = \\frac{ p(\\mathrm{data};\\vec{\\Phi}) \\pi(\\vec{\\Phi}) }{\\int p(\\mathrm{data};\\vec{\\Phi}') \\pi(\\vec{\\Phi}') \\mathrm{d}\\vec{\\Phi}' } = \\frac{ \\mathcal{L}(\\vec{\\Phi}) \\pi(\\vec{\\Phi}) }{ \\int \\mathcal{L}(\\vec{\\Phi}') \\pi(\\vec{\\Phi}') \\mathrm{d}\\vec{\\Phi}' }\\]

The posterior distribution \\(p(\\vec{\\Phi};\\mathrm{data})\\) defines the updated belief about the parameters \\(\\vec{\\Phi}\\).

"},{"location":"what_combine_does/fitting_concepts/#methods-for-considering-subsets-of-models","title":"Methods for considering subsets of models","text":"

Often, one is interested in some particular aspect of a model. This may be for example information related to the parameters of interest, but not the nuisance parameters. In this case, one needs a method for specifying precisely what is meant by a model considering only those parameters of interest.

There are several methods for considering sub models which each have their own interpretations and use cases.

"},{"location":"what_combine_does/fitting_concepts/#conditioning","title":"Conditioning","text":"

Conditional Sub-models can be made by simply restricting the values of some parameters. The conditional likelihood of the parameters \\(\\vec{\\mu}\\) conditioned on particular values of the parameters \\(\\vec{\\nu}\\) is:

\\[ \\mathcal{L}(\\vec{\\mu},\\vec{\\nu}) \\xrightarrow{\\mathrm{conditioned\\ on\\ } \\vec{\\nu} = \\vec{\\nu}_0} \\mathcal{L}(\\vec{\\mu}) = \\mathcal{L}(\\vec{\\mu},\\vec{\\nu}_0) \\]"},{"location":"what_combine_does/fitting_concepts/#profiling","title":"Profiling","text":"

The profiled likelihood \\(\\mathcal{L}(\\vec{\\mu})\\) is defined from the full likelihood, \\(\\mathcal{L}(\\vec{\\mu},\\vec{\\nu})\\), such that for every point \\(\\vec{\\mu}\\) it is equal to the full likelihood at \\(\\vec{\\mu}\\) maximized over \\(\\vec{\\nu}\\).

\\[ \\mathcal{L}(\\vec{\\mu},\\vec{\\nu}) \\xrightarrow{\\mathrm{profiling\\ } \\vec{\\nu}} \\mathcal{L}({\\vec{\\mu}}) = \\max_{\\vec{\\nu}} \\mathcal{L}(\\vec{\\mu},\\vec{\\nu})\\]

In some sense, the profiled likelihood is the best estimate of the likelihood at every point \\(\\vec{\\mu}\\), it is sometimes also denoted with a double hat notation \\(\\mathcal{L}(\\vec{\\mu},\\vec{\\hat{\\hat{\\nu}}}(\\vec{\\mu}))\\).

"},{"location":"what_combine_does/fitting_concepts/#marginalization","title":"Marginalization","text":"

Marginalization is a procedure for producing a probability distribution \\(p(\\vec{\\mu};\\mathrm{data})\\) for a set of parameters \\(\\vec{\\mu}\\), which are only a subset of the parameters in the full distribution \\(p(\\vec{\\mu},\\vec{\\nu};\\mathrm{data})\\). The marginal probability density \\(p(\\vec{\\mu})\\) is defined such that for every point \\(\\vec{\\mu}\\) it is equal to the probability at \\(\\vec{\\mu}\\) integrated over \\(\\vec{\\nu}\\).

\\[ p(\\vec{\\mu},\\vec{\\nu}) \\xrightarrow{\\mathrm{marginalizing\\ } \\vec{\\nu}} p({\\vec{\\mu}}) = \\int_{\\vec{\\nu}} p(\\vec{\\mu},\\vec{\\nu})\\]

The marginalized probability \\(p(\\vec{\\mu})\\) is the probability for the parameter values \\(\\vec{\\mu}\\) taking into account all possible values of \\(\\vec{\\nu}\\).

Marginalized likelihoods can also be defined, by their relationship to the probability distributions.

"},{"location":"what_combine_does/fitting_concepts/#parameter-uncertainties","title":"Parameter Uncertainties","text":"

Parameter uncertainties describe regions of parameter values which are considered reasonable parameter values, rather than single estimates. These can be defined either in terms of frequentist confidence regions or bayesian credibility regions.

In both cases the region is defined by a confidence or credibility level \\(CL\\), which quantifies the meaning of the region. For frequentist confidence regions, the confidence level \\(CL\\) describes how often the confidence region will contain the true parameter values if the model is a sufficiently accurate approximation of the truth. For bayesian credibility regions, the credibility level \\(CL\\) describes the bayesian probability that the true parameter value is in that region for under the given model.

The confidence or credibility regions are described by a set of points \\(\\{ \\vec{\\mu} \\}_{\\mathrm{CL}}\\) which meet some criteria. In most situations of interest, the credibility region or confidence region for a single parameter, \\(\\mu\\), is effectively described by an interval:

\\[ \\{ \\mu \\}_{\\mathrm{CL}} = [ \\mu^{-}_{\\mathrm{CL}}, \\mu^{+}_{\\mathrm{CL}} ] \\]

Typically indicated as:

\\[ \\mu = X^{+\\mathrm{up}}_{-\\mathrm{down}} \\]

or, if symmetric intervals are used:

\\[ \\mu = X \\pm \\mathrm{unc.} \\]"},{"location":"what_combine_does/fitting_concepts/#frequentist-confidence-regions","title":"Frequentist Confidence Regions","text":"

Frequentist confidence regions are random variables of the observed data. These are very often the construction used to define the uncertainties reported on a parameter.

If the same experiment is repeated multiple times, different data will be osbserved each time and a different confidence set \\(\\{ \\vec{\\mu}\\}_{\\mathrm{CL}}^{i}\\) will be found for each experiment. If the data are generated by the model with some set of values \\(\\vec{\\mu}_{\\mathrm{gen}}\\), then the fraction of the regions \\(\\{ \\vec{\\mu}\\}_{\\mathrm{CL}}^{i}\\) which contain the values \\(\\vec{\\mu}_{\\mathrm{gen}}\\) will be equal to the confidence level \\({\\mathrm{CL}}\\). The fraction of intervals which contain the generating parameter value is referred to as the \"coverage\".

From first principles, the intervals can be constructed using the Neyman construction.

In practice, the likelihood can be used to construct confidence regions for a set of parameters \\(\\vec{\\mu}\\) by using the profile likelikhood ratio:

\\[ \\Lambda \\equiv \\frac{\\mathcal{L}(\\vec{\\mu},\\vec{\\hat{\\nu}}(\\vec{\\mu}))}{\\mathcal{L}(\\vec{\\hat{\\mu}},\\vec{\\hat{\\nu}})} \\]

i.e. the ratio of the profile likelihood at point \\(\\vec{\\mu}\\) to the maxmimum likelihood. For technical reasons, the negative logarithm of this quantity is typically used in practice.

Each point \\(\\vec{\\mu}\\) can be tested to see if it is in the confidence region, by checking the value of the likelihood ratio at that point and comparing it to the expected distribution if that point were the true generating value of the data.

\\[ \\{ \\vec{\\mu} \\}_{\\mathrm{CL}} = \\{ \\vec{\\mu} : -\\log(\\Lambda) \\lt \\gamma_{\\mathrm{CL}}(\\vec{\\mu}) \\} \\]

The cutoff value \\(\\gamma_{\\mathrm{CL}}\\) must be chosen to match this desired coverage of the confidence set.

Under some conditions, the value of \\(\\gamma_{\\mathrm{CL}}\\) is known analytically for any desired confidence level, and is independent of \\(\\vec{\\mu}\\), which greatly simplifies estimating confidence regions.

Constructing Frequentist Confidence Regions in Practice

When a single fit is performed by some numerical minimization program and parameter values are reported along with some uncertainty values, they are usually reported as frequentist intervals. The MINUIT minimizer which evaluates likelihood functions has two methods for estimating parameter uncertainties.

These two methods are the most commonly used methods for estimating confidence regions in a fit; they are the minos method, and the hessian method. In both cases, Wilk's theorem is assumed to hold at all points in parameter space, such that \\(\\gamma_{\\mathrm{CL}}\\) is independent of \\(\\vec{\\mu}\\).

When \\(\\gamma_{\\mathrm{CL}}\\) is independent of \\(\\vec{\\mu}\\) the problem simplifies to finding the boundaries where \\(-\\log(\\Lambda) = \\gamma_{\\mathrm{CL}}\\). This boundary point is referred to as the \"crossing\", i.e. where \\(-\\log(\\Lambda)\\) crosses the threshold value.

"},{"location":"what_combine_does/fitting_concepts/#the-minos-method-for-estimating-confidence-regions","title":"The Minos method for estimating confidence regions","text":"

In the minos method, once the best fit point \\(\\vec{\\hat{\\mu}}\\) is determined, the confidence region for any parameter \\(\\mu_i\\) can be found by moving away from its best fit value \\(\\hat{\\mu}_i\\). At each value of \\(\\mu_i\\), the other parameters are profiled, and \\(-\\log{\\Lambda}\\) is calculated.

Following this procedure, \\(\\mu_i\\) is searched for the boundary of the confidence regions, where \\(-\\log{\\Lambda} = \\gamma_{\\mathrm{CL}}\\).

The search is performed in both directions, away from the best fit value of the parameter and the two crossings are taken as the borders of the confidence region.

This procedure has to be followed sepately for each parameter \\(\\mu_i\\) for which a confidence interval is calculated.

"},{"location":"what_combine_does/fitting_concepts/#the-hessian-method-for-estimating-confidence-regions","title":"The Hessian method for estimating confidence regions","text":"

The Hessian method relies on the second derivatives (i.e. the hessian) of the likelihood at the best fit point.

By assuming that the shape of the likelihood function is well described by its second-order approximation, the values at which \\(-\\log(\\Lambda) = \\gamma_{\\mathrm{CL}}\\) can be calculated analytically without the need for a seach

\\[ \\mu_i^{\\mathrm{crossing}} - \\hat{\\mu} \\propto \\left(\\frac{\\partial^2{\\mathcal{L}(\\vec{\\hat{\\mu}})}}{\\partial\\mu_i^2}\\right)^{-2} \\]

By computing and then inverting the full hessian matrix, all individual confidence regions and the full covariance matrix are determined. By construction, this method always reports symmetric confidence intervals, as it assumes that the likelihood is well described by a second order expansion.

"},{"location":"what_combine_does/fitting_concepts/#bayesian-credibility-regions","title":"Bayesian Credibility Regions","text":"

Often the full posterior probability distribution is summarized in terms of some credible region which contains some specified portion of the posterior probability of the parameter.

\\[ \\{ \\vec{\\mu} \\}_{\\mathrm{CL}} = \\{ \\vec{\\mu} : \\vec{\\mu} \\in \\Omega, \\int_{\\Omega} p(\\vec{\\mu};\\mathrm{data}) = \\mathrm{CL} \\}\\]

The credible region represents a region in which the bayesian probability of the parameter being in that region is equal to the chosen Credibility Level.

"},{"location":"what_combine_does/introduction/","title":"Introduction And Capabilities","text":"

Combine is a tool for making statistical analyses based on a model of expected observations and a dataset. Example statistical analyses are claiming discovery of a new particle or process, setting limits on the existence of new physics, and measuring cross sections.

The package has no physics-specific knowledge, it is completely agnostic to the interpretation of the analysis being performed, but its usage and development is based around common cases in High Energy Physics. This documentation is a description of what combine does and how you can use it to run your analyses.

Roughly, combine does three things:

  1. Helps you to build a statistical model of expected observations;
  2. Runs statistical tests on the model and observed data;
  3. Provides tools for validating, inspecting, and understanding the model and the statistical tests.

Combine can be used for analyses in HEP ranging from simple counting experiments to unfolded measurements, new physics searches, combinations of measurements, and EFT fits.

"},{"location":"what_combine_does/introduction/#model-building","title":"Model Building","text":"

Combine provides a powerful, human-readable, and lightweight interface for building likelihood models for both binned and unbinned data. The likelihood definition allows the user to define many processes which contribute to the observation, as well as multiple channels which may be fit simultaneously.

Furthermore, combine provides a powerful and intuitive interface for combining models, as it was originally developped for combinations of higgs boson analysis at the CMS experiment.

The interface simplifies many common tasks, while providing many options for customizations. Common nuisance parameter types are defined for easy use, while user-defined functions can also be provided. Input histograms defining the model can be provide in root format, or in other tabular formats compatable with pandas.

Custom physics models can be defined in python which determine how the parameters of interest alter the model, and a number of predefined models are provided by default.

A number of tools are also provided for run-time alterations of the model, allowing for straightforward comparisons of alternative models.

"},{"location":"what_combine_does/introduction/#statistical-tests","title":"Statistical Tests","text":"

Combine can be used for statistical tests in frequentist or bayesian frameworks as well as perform some hybrid frequentist-bayesian analysis tasks.

Combine implements various methods for commonly used statistical tests in high energy physics, including for discovery, limit setting, and parameter estimation. Statistical tests can be customized to use various test statistics and confidence levels, as well as providing different output formats.

A number of asymptotic methods, relying on Wilks' theorem, and valid in appropriate conditions are implemented for fast evaluation. Generation of pseudo-data from the model can also be performed, and tests are implemented to automatically run over emprical distributions without relying on asymptotic approximations. Pseudo-data generation and fitting over the pseudo-data can be customized in a number of ways.

"},{"location":"what_combine_does/introduction/#validation-and-inspection","title":"Validation and Inspection","text":"

Combine provides tools for inspecting the model for things like potentially problematic input templates.

Various methods are provided for inspecting the likelihood function and the performance of the fits.

Methods are provided for comparing pre-fit and postfit results of all values including nuisance parameters, and summaries of the results can produced.

Plotting utilities allow the pre- and post-fit model expectations and their uncertainties to be plotted, as well as plotted summaries of debugging steps such as the nuisance parameter values and likelihood scans.

"},{"location":"what_combine_does/model_and_likelihood/","title":"Observation Models and Likelihoods","text":""},{"location":"what_combine_does/model_and_likelihood/#the-observation-model","title":"The Observation Model","text":"

The observation model, \\(\\mathcal{M}( \\vec{\\Phi})\\) defines the probability for any set of observations given specific values of the input parameters of the model \\(\\vec{\\Phi}\\). The probability for any observed data is denoted:

\\[ p_{\\mathcal{M}}(\\mathrm{data}; \\vec{\\Phi} ) \\]

where the subscript \\(\\mathcal{M}\\) is given here to remind us that these are the probabilities according to this particular model (though usually we will omit it for brevity).

Combine is designed for counting experiments, where the number of events with particular features are counted. The events can either be binned, as in histograms, or unbinned, where continuous values are stored for each event. The event counts are assumed to be of independent events, such as individual proton-proton collisions, which are not correlated with each other.

The event-count portion of the model consists of a sum over different processes. The expected observations, \\(\\vec{\\lambda}\\), are then the sum of the expected observations for each of the processes, \\(\\vec{\\lambda} =\\sum_{p} \\vec{\\lambda}_{p}\\).

The model can also be composed of multiple channels, in which case the expected observation is the set of all expected observations from the various channels \\(\\vec{\\lambda} = \\{ \\vec{\\lambda}_{c1}, \\vec{\\lambda}_{c2}, .... \\vec{\\lambda}_{cN}\\}\\).

The model can also include data and parameters related to non-count values, such as the observed luminosity or detector calibration constant. These non-count data are usually considered as auxiliary information which are used to constrain our expectations about the observed event counts.

The full model therefore defines the probability of any given observations over all the channels, given all the processes and model parameters.

Combining full models is possible by combining their channels, assuming that the channels are mutually independent.

A Simple Example

Consider performing an analysis searching for a Higgs boson by looking for events where the Higgs decays into two photons.

The event count data may be binned histograms of the number of events with two photons with different bins of invariant mass of the photons. The expected counts would include signal contributions from processes where a Higgs boson is produced, as well as background contributions from processes where two photons are produced through other mechanisms, like radiation off a quark. The expected counts may also depend on parameters such as the energy resolution of the measured photons and the total luminosity of collisions being considered in the dataset, these can be parameterized in the model as auxiliary information.

The analysis itself might be split into multiple channels, targetting different Higgs production modes with different event selection criteria. Furthermore, the analysis may eventually be combined with other analyses, such as a measurement targetting Higgs production where the Higgs boson decays into four leptons, rather than two photons.

Combine provides the functionality for building the statistical models and combining all the channels or analyses together into one common analysis.

"},{"location":"what_combine_does/model_and_likelihood/#sets-of-observation-models","title":"Sets of Observation Models","text":"

We are typically not interested in a single model, but in a set of models, parameterized by a set of real numbers representing possible versions of the model.

Model parameters include the parameters of interest ( \\(\\vec{\\mu}\\), those being measured such as a cross section) as well as nuisance parameters (\\(\\vec{\\nu}\\)), which may not be of interest but still affect the model expectation.

Combine provides tools and interfaces for defining the model as pre-defined or user-defined functions of the input parameters. In practice, however, there are a number of most commonly used functional forms which define how the expected events depend on the model parameters. These are discussed in detail in the context of the full likelihood below.

"},{"location":"what_combine_does/model_and_likelihood/#the-likelihood","title":"The Likelihood","text":"

For any given model, \\(\\mathcal{M}(\\vec{\\Phi})\\), the likelihood defines the probability of observing a given dataset. It is numerically equal to the probability of observing the data, given the model.

\\[ \\mathcal{L}_\\mathcal{M}(\\vec{\\Phi}) = p_{\\mathcal{M}}(\\mathrm{data};\\vec{\\Phi}) \\]

Note, however that the likelihood is a function of the model parameters, not the data, which is why we distinguish it from the probability itself.

The likelihood in combine takes the general form:

\\[ \\mathcal{L} = \\mathcal{L}_{\\textrm{primary}} \\cdot \\mathcal{L}_{\\textrm{auxiliary}} \\]

Where \\(\\mathcal{L}_{\\mathrm{auxiliary}}\\) is equal to the probability of observing the event count data for a given set of model parameters, and \\(\\mathcal{L}_{\\mathrm{auxiliary}}\\) represent some external constraints on the parameters. The constraint term may be constraints from previous measurements (such as Jet Energy Scales) or prior beliefs about the value some parameter in the model should have.

Both \\(\\mathcal{L}_{\\mathrm{primary}}\\) and \\(\\mathcal{L}_{\\mathrm{auxiliary}}\\) can be composed of many sublikelihoods, for example for observations of different bins and constraints on different nuisance parameters.

This form is entirely general. However, as with the model itself, there are typical forms that the likelihood takes which will cover most use cases, and for which combine is primarily designed.

"},{"location":"what_combine_does/model_and_likelihood/#primary-likelihoods-for-binned-data","title":"Primary Likelihoods for binned data","text":"

For a binned likelihood, the probability of observing a certain number of counts, given a model takes on a simple form. For each bin:

\\[ \\mathcal{L}_{\\mathrm{bin}}(\\vec{\\Phi}) = \\mathrm{Poiss}(n_{\\mathrm{obs}}; n_{\\mathrm{exp}}(\\vec{\\Phi})) \\]

i.e. it is a poisson distribution with the mean given by the expected number of events in that bin. The full primary likelihood for binned data is simply the product of each of the bins' likelihoods:

\\[ \\mathcal{L}_\\mathrm{primary} = \\prod_\\mathrm{bins} \\mathcal{L}_\\mathrm{bin}. \\]

This is the underlying likelihood model used for every binned analysis. The freedom in the analysis comes in how \\(n_\\mathrm{exp}\\) depends on the model parameters, and the constraints that are placed on those parameters.

"},{"location":"what_combine_does/model_and_likelihood/#primary-likelihoods-for-unbinned-data","title":"Primary Likelihoods for unbinned data","text":"

For unbinned likelihood models, a likelihood can be given to each data point. It is proportional to the probability density function at that point, \\(\\vec{x}\\). For the full set of observed data points, information about the total number of data points is also included:

\\[ \\mathcal{L}_\\mathrm{data} = \\mathrm{Poiss}(n_{\\mathrm{obs}} ; n_{\\mathrm{exp}}(\\vec{\\Phi})) \\prod_{i}^{N_{\\mathrm{obs}}} \\mathrm{pdf}(\\vec{x}_i ; \\vec{\\Phi} ) \\]

Where \\(n_{\\mathrm{obs}}\\) and \\(n_{\\mathrm{exp}}\\) are the total number of observed and expected events, respectively. This is sometimes referred to as an 'extended' likelihood, as the probability density has been 'extended' to include information about the total number of observations.

"},{"location":"what_combine_does/model_and_likelihood/#auxiliary-likelihoods","title":"Auxiliary Likelihoods","text":"

The auxiliary likelihood terms encode the probability of model nuisance parameters taking on a certain value, without regards to the primary data. In frequentist frameworks, this usually represents the result of a previous measurement (such as of the jet energy scale). We will write in a mostly frequentist framework, though combine can be used for either frequentist or bayesian analyses[^1].

[^1]: see: the first paragraphs of the PDGs statistics review for more information on these two frameworks

In this framework, each auxiliary term represents the likelihood of some parameter, \\(\\nu\\), given some previous observation \\(y\\); the quantity \\(y\\) is sometimes referred to as a \"global observable\".

\\[ \\mathcal{L}_{\\mathrm{auxiliary}}( \\nu ) = p( y ; \\nu ) \\]

In principle the form of the likelihood can be any function where the corresponding \\(p\\) is a valid probability distribution. In practice, most of the auxiliary terms are gaussian, and the definition of \\(\\nu\\) is chosen such that the central observation \\(y = 0\\) , and the width of the gaussian is one.

Note that on its own, the form of the auxiliary term is not meaningful; what is meaningful is the relationship between the auxiliary term and how the model expectation is altered by the parameter. Any co-ordinate transformation of the parameter values can be absorbed into the definition of the parameter. A reparameterization would change the mathematical form of the auxiliary term, but would also simultaneously change how the model depends on the parameter in such a way that the total likelihood is unchanged. e.g. if you define \\(\\nu = \\sigma(tt)\\) or \\(\\nu = \\sigma(tt) - \\sigma_0\\) you will change the form of the constraint term, but the you will not change the overall likelihood.

"},{"location":"what_combine_does/model_and_likelihood/#likelihoods-implemented-in-combine","title":"Likelihoods implemented in Combine","text":"

Combine builds on the generic forms of the likelihood for counting experiments given above to provide specific functional forms which are commonly most useful in high energy physics, such as separating contributions between different processes.

"},{"location":"what_combine_does/model_and_likelihood/#binned-likelihoods-using-templates","title":"Binned Likelihoods using Templates","text":"

Binned likelihood models can be defined by the user by providing simple inputs such as a set of histograms and systematic uncertainties. These likelihood models are referred to as template-based because they rely heavily on histograms as templates for building the full likelihood function.

Here, we describe the details of the mathematical form of these likelihoods. As already mentioned, the likelihood can be written as a product of two parts:

\\[ \\mathcal{L} = \\mathcal{L}_\\mathrm{primary} \\cdot \\mathcal{L}_\\mathrm{auxiliary} = \\prod_{c=1}^{N_c} \\prod_{b=1}^{N_b^c} \\mathrm{Poiss}(n_{cb}; n^\\mathrm{exp}_{cb}(\\vec{\\mu},\\vec{\\nu})) \\cdot \\prod_{e=1}^{N_E} p_e(y_e ; \\nu_e) \\]

Where \\(c\\) indexes the channel, \\(b\\) indexes the histogram bin, and \\(e\\) indexes the nuisance parameter.

"},{"location":"what_combine_does/model_and_likelihood/#model-of-expected-event-counts-per-bin","title":"Model of expected event counts per bin","text":"

The generic model of the expected event count in a given bin, \\(n^\\mathrm{exp}_{cb}\\), implemented in combine for template based analyses is given by:

\\[n^\\mathrm{exp}_{cb} = \\mathrm{max}(0, \\sum_{p} M_{cp}(\\vec{\\mu})N_{cp}(\\nu_G, \\vec{\\nu}_L,\\vec{\\nu}_S,\\vec{\\nu}_{\\rho})\\omega_{cbp}(\\vec{\\nu}_S) + E_{cb}(\\vec{\\nu}_B) ) \\]

where here:

  • \\(p\\) indexes the processes contributing to the channel;
  • \\(\\nu_{G}, \\vec{\\nu}_L, \\vec{\\nu}_S, \\vec{\\nu}_{\\rho}\\) and \\(\\vec{\\nu}_B\\) are different types of nuisance parameters which modify the processes with different functional forms;
    • \\(\\nu_{G}\\) is a gamma nuisances,
    • \\(\\vec{\\nu}_{L}\\) are log-normal nuisances,
    • \\(\\vec{\\nu}_{S}\\) are \"shape\" nuisances,
    • \\(\\vec{\\nu}_{\\rho}\\) are user defined rate parameters, and
    • \\(\\vec{\\nu}_{B}\\) are nuisance parameters related to the statistical uncertainties in the simulation used to build the model.
  • \\(M\\) defines the effect of the parameters of interest on the signal process;
  • \\(N\\) defines the overall normalization effect of the nuisance parameters;
  • \\(\\omega\\) defines the shape effects (i.e. bin-dependent effects) of the nuisance parameters; and
  • \\(E\\) defines the impact of statistical uncertainties from the samples used to derive the histogram templates used to build the model.
"},{"location":"what_combine_does/model_and_likelihood/#parameter-of-interest-model","title":"Parameter of Interest Model","text":"

The function \\(M\\) can take on custom functional forms, as defined by the user, but in the most common case, the parameter of interest \\(\\mu\\) simply scales the contributions from signal processes:

\\[\\label{eq:sig_param} M_{cp}(\\mu) = \\begin{cases} \\mu &\\mathrm{if\\ } p \\in \\mathrm{signal} \\\\ 1 &\\mathrm{otherwise} \\end{cases} \\]

However, combine supports many more models beyond this. As well as built-in support for models with multiple parameters of interest, combine comes with many pre-defined models which go beyond simple process normalization, which are targetted at various types of searches and measurements.

"},{"location":"what_combine_does/model_and_likelihood/#normalization-effects","title":"Normalization Effects","text":"

The overall normalization \\(N\\) is affected differently by the different types of nuisances parameters, and takes the general form

\\[N = \\prod_X \\prod_i f_X(\\vec{\\nu}_{X}^{i})\\mathrm{,}\\]

With \\(X\\) identifying a given nuisance parameter type; i.e. \\(N\\) multiplies together the morphings from each of the individual nuisance parameters from each of the nuisance types.

Normalization Parameterization Details

The full functional form of the normalization term is given by:

\\[ N_{cp} = N_{\\mathrm{0}}(\\nu_{G})\\prod_{n} {\\kappa_{n}}^{\\nu_{L,n}}\\prod_{a} {\\kappa^{\\mathrm{A}}_{a}(\\nu_{L(S)}^{a},\\kappa^{+}_{a}, \\kappa^{-}_{a})}^{\\nu_{L(S)}^{a}} \\prod_{r}F_{r}(\\nu_\\rho) \\]

where:

  • \\(N_{\\mathrm{0}}(\\nu_{G}) \\equiv \\frac{\\nu_{G}}{y_{G}}\\), is the normalization effect of a gamma uncertainty. \\(y_{G}\\) is taken as the observed number of events in some external control region and \\(\\nu_{G}\\) has a constraint pdf \\(\\mathrm{Poiss}(\\nu; y)\\)
  • \\(\\kappa_{n}^{\\nu_{L,n}}\\), are log-normal uncertainties specified by a fixed value \\(\\kappa\\);
  • \\(\\kappa^{\\mathrm{A}}_{a}(\\nu_{L(S)}^{a},\\kappa^{+}_{a}, \\kappa^{-}_{a})^{\\nu_{L(S)}^{a}}\\) are asymmetric log-normal uncertainties, in which the value of \\(\\kappa^{\\mathrm{A}}\\) depends on the nuisance parameter and two fixed values \\(\\kappa^{+}_{a}\\) and \\(\\kappa^{-}_{a}\\). The functions, \\(\\kappa^A\\), define a smooth interpolation for the asymmetric uncertainty; and
  • \\(F_{r}(\\vec{\\nu}_\\rho)\\) are user-defined functions of the user defined nuisance parameters which may have uniform or gaussian constraint terms.

The function for the asymmetric normalization modifier, \\(\\kappa^A\\) is

\\[ \\kappa^{\\mathrm{A}}(\\nu,\\kappa^{+}, \\kappa^{-}) = \\begin{cases} \\kappa^{+}, &\\mathrm{for\\,} \\nu \\geq 0.5 \\\\ \\frac{1}{\\kappa^{-}}, &\\mathrm{for\\,} \\nu \\leq -0.5 \\\\ \\exp\\left(\\frac{1}{2} \\left( (\\ln{\\kappa^{+}}-\\ln{\\kappa^{-}}) + \\frac{1}{4}(\\ln{\\kappa^{+}}+\\ln{\\kappa^{-}})I(\\nu)\\right)\\right), &\\mathrm{otherwise}\\end{cases} \\]

where \\(I(\\nu) = 48\\nu^5 - 40\\nu^3 + 15\\nu\\), which ensures \\(\\kappa^{\\mathrm{A}}\\) and its first and second derivatives are continuous for all values of \\(\\nu\\).

and the \\(\\kappa^{+}\\) and \\(\\kappa^{-}\\) are the relative normalizations of the two systematics variations; i.e.:

\\[ \\kappa^{\\pm}_{s} = \\frac{\\sum_{b}\\omega_{b}^{s,\\pm}}{\\sum_{b}\\omega_{b}^{0}}. \\]

where \\(\\omega_{b}^{s,\\pm}\\) is the bin yield as defined by the two shifted values \\(\\nu_{S} = \\nu_{S}^{\\pm}\\), and \\(\\omega_{b}^{0}\\) is the bin yield when \\(\\nu_{S} = \\omega_{S}\\).

"},{"location":"what_combine_does/model_and_likelihood/#shape-morphing-effects","title":"Shape Morphing Effects","text":"

The number of events in a given bin \\(b\\), \\(\\omega_{cbp}\\), is a function of the shape parameters \\(\\vec{\\nu}_{S}\\). The shape interpolation works with the fractional yields in each bin, where the interpolation can be performed either directly on the fractional yield, or on the logarithm of the fraction yield, which is then exponentiated again.

Shape parameterization Details

In the following, the channel and process labels \\(c\\) and \\(p\\) apply to every term, and so are omitted.

The fixed nominal number of events is denoted \\(\\omega_{b}^{0}\\). For each applicable shape uncertainty \\(s\\), two additional predictions are specified, \\(\\omega_{b}^{s,+}\\) and \\(\\omega_{b}^{s,-}\\), typically corresponding to the \\(+1\\sigma\\) and \\(-1\\sigma\\) variations, respectively. These may change both the shape and normalization of the process. The two effects are separated; the shape transformation is constructed in terms of the fractional event counts in the templates via a smooth vertical interpolation, and the normalization is treated as an asymmetric log-normal uncertainty, as described above in the description of the \\(N\\) term in the likelihood.

For a given process, the shape may be interpolated either directly in terms of the fractional bin yields, \\(f_b = \\omega_b / \\sum \\omega_{b}\\) or their logarithms, \\(\\ln(f_b)\\). The transformed yield is then given as, respectively,

\\[ \\omega_{b}(\\vec{\\nu}) = \\begin{cases} \\max\\left(0, y^{0}\\left(f^{0}_{b} + \\sum_{s} F(\\nu_{s}, \\delta^{s,+}_{b}, \\delta^{s,-}_{b}, \\epsilon_{s})\\right)\\right) & \\text{(direct),}\\\\ \\max\\left(0, y^{0}\\exp\\left(\\ln(f^{0}_{b}) + \\sum_{s} F(\\nu_{s}, \\Delta^{s,+}_{b}, \\Delta^{s,-}_{b}, \\epsilon_{s})\\right) \\right) & \\text{(logarithmic)}, \\end{cases} \\]

where \\(\\omega^{0} = \\sum \\omega_{b}^{0}\\), \\(\\delta^{\\pm} = f^{\\pm}_{i} - f^{0}_{i}\\), and \\(\\Delta^{\\pm} = \\ln\\left(\\frac{f^{\\pm}_{i}}{f^{0}_{i}}\\right)\\).

The smooth interpolating function \\(F\\), defined below, depends on a set of coefficients, \\(\\epsilon_{s}\\). These are assumed to be unity by default, but may be set to different values, for example if the \\(\\omega_{b}^{s,\\pm}\\) correspond to the \\(\\pm X\\sigma\\) variations, then \\(\\epsilon_{s} = 1/X\\) is typically set. The minimum value of \\(\\epsilon\\) over the shape uncertainties for a given process is \\(q = \\min({{\\epsilon_{s}}})\\). The function \\({F}\\) is then defined as

\\[ F(\\nu, \\delta^{+}, \\delta^{-}, \\epsilon) = \\begin{cases} \\frac{1}{2}\\nu^{'} \\left( (\\delta^{+}-\\delta^{-}) + \\frac{1}{8}(\\delta^{+}+\\delta^{-})(3\\bar{\\nu}^5 - 10\\bar{\\nu}^3 + 15\\bar{\\nu}) \\right), & \\text{for } -q < \\nu' < q; \\\\ \\nu^{'}\\delta^{+}, & \\text{for } \\nu' \\ge q;\\\\ -\\nu^{'}\\delta^{-}, & \\text{for } \\nu' \\le -q;\\\\ \\end{cases} \\]

where \\(\\nu^{'} = \\nu\\epsilon\\), \\(\\bar{\\nu} = \\nu^{'} / q\\), and the label \\(s\\) has been omitted. This function ensures the yield and its first and second derivatives are continuous for all values of \\(\\nu\\).

"},{"location":"what_combine_does/model_and_likelihood/#statistical-uncertainties-in-the-simulation-used-to-build-the-model","title":"Statistical Uncertainties in the Simulation used to build the Model","text":"

Since the histograms used in a binned shape analysis are typically created from simulated samples, the yields in each bin are also subject to statistical uncertainties on the bin yields. These are taken into account by either assigning one nuisance parameter per bin, or as many parameters as contributing processes per bin.

Model Statistical Uncertainty Details

If the uncertainty in each bin is modelled as a single nuisance parameter it takes the form:

\\[ E_{cb}(\\vec{\\mu},\\vec{\\nu},\\nu) = \\nu\\left(\\sum_{p} (e_{cpb}N_{cp}M_{cp}(\\vec{\\mu},\\vec{\\nu}))^{2}\\right)^{\\frac{1}{2}}. \\]

where \\(e_{cbp}\\) is the uncertainty in the bin content for the histogram defining process \\(p\\) in the channel \\(c\\).

Alternatively, one parameter is assigned per process, which may be modelled with either a Poisson or Gaussian constraint pdf:

\\[ E_{cb}(\\vec{\\mu},\\vec{\\nu},\\vec{\\nu}_{\\alpha},\\vec{\\nu}_{\\beta}) = \\sum_{\\alpha}^{\\text{Poisson}} \\left(\\frac{\\nu_{\\alpha}}{\\omega_{\\alpha}} - 1\\right)\\omega_{c\\alpha b}N_{c\\alpha}(\\vec{\\nu})M_{c\\alpha}(\\vec{\\mu},\\vec{nu}) + \\sum_{\\beta}^{\\text{Gaussian}} \\nu_{\\beta}e_{c\\beta b}N_{c\\beta}(\\vec{\\nu})M_{c\\beta}(\\vec{\\mu},\\vec{\\nu}), \\]

where the indices \\(\\alpha\\) and \\(\\beta\\) runs over the Poisson- and Gaussian-constrained processes, respectively. The parameters \\(\\omega_{\\alpha}\\) represent the nominal unweighted numbers of events, and are treated as the external measurements and \\(N_{cp}\\) and \\(\\omega_{c\\alpha b}\\) are defined as above.

"},{"location":"what_combine_does/model_and_likelihood/#customizing-the-form-of-the-expected-event-counts","title":"Customizing the form of the expected event counts","text":"

Although the above likelihood defines some specific functional forms, users are also able to implement custom functional forms for \\(M\\), \\(N\\), and \\(\\omega_{cbp}\\). In practice, this makes the functional form much more general than the default forms used above.

However, some constraints do exist, such as the requirement that bin contents be positive, and that the function \\(M\\) only depends on \\(\\vec{\\mu}\\), whereas \\(N\\), and \\(\\omega_{cbp}\\) only depend on \\(\\vec{\\nu}\\).

"},{"location":"what_combine_does/model_and_likelihood/#auxiliary-likelihood-terms","title":"Auxiliary Likelihood terms","text":"

The auxiliary constraint terms implemented in combine are Gaussian, Poisson or Uniform:

\\[ p_{e} \\propto \\exp{\\left(-0.5 \\left(\\frac{(\\nu_{e} - y_{e})}{\\sigma}\\right)^2 \\right)}\\mathrm{;~} \\\\ p_{e} = \\mathrm{Poiss}( \\nu_{e}; y_{e} ) \\mathrm{;\\ or~} \\\\ p_{e} \\propto \\mathrm{constant\\ (on\\ some\\ interval\\ [a,b])}. \\]

Which form they have depends on the type of nuisance paramater:

  • The shape (\\(\\vec{\\nu}_{S}\\)) and log-normal (\\(\\vec{\\nu}_{L}\\)), nuisance parameters always use gaussian constraint terms;
  • The gamma (\\(\\vec{\\nu}_{G}\\)) nuisance parameters always use Poisson constraints;
  • The rate parameters (\\(\\vec{\\nu}_{\\rho}\\)) may have either Gaussian or Uniform constraints; and
  • The model statistical uncertiainties (\\(\\vec{\\nu}_{B}\\)) may use Gaussian or Poisson Constraints.

While combine does not provide functionality for user-defined auxiliary pdfs, the effect of nuisance paramters is highly customizable through the form of the dependence of \\(n^\\mathrm{exp}_{cb}\\) on the parameter.

"},{"location":"what_combine_does/model_and_likelihood/#overview-of-the-template-based-likelihood-model-in-combine","title":"Overview of the template-based likelihood model in Combine","text":"

An overview of the binned likelihood model built by combine is given below. Note that \\(M_{cp}\\) can be chosen by the user from a set of predefined models, or defined by the user themselves.

"},{"location":"what_combine_does/model_and_likelihood/#parametric-likelihoods-in-combine","title":"Parametric Likelihoods in Combine","text":"

As with the template likelihood, the parameteric likelihood implemented in combine implements likelihoods for multiple process and multiple channels. Unlike the template likelihoods, the parametric likelihoods are defined using custom probability density functions, which are functions of continuous observables, rather than discrete, binned counts. Because the pdfs are functions of a continuous variable, the likelihood can be evaluated over unbinned data. They can still, also, be used for analysis on binned data.

The unbinned model implemented in combine is given by:

\\[ \\mathcal{L} = \\mathcal{L}_\\mathrm{primary} \\cdot \\mathcal{L}_\\mathrm{auxiliary} = \\\\ \\left(\\prod_c \\mathrm{Poiss}(n_{c,\\mathrm{tot}}^{\\mathrm{obs}} ; n_{c,\\mathrm{tot}}^{\\mathrm{exp}}(\\vec{\\mu},\\vec{\\nu})) \\prod_{i}^{n_c^{\\mathrm{obs}}} \\sum_p f_{cp}^{\\mathrm{exp}} \\mathrm{pdf}_{cp}(\\vec{x}_i ; \\vec{\\mu}, \\vec{\\nu} ) \\right) \\cdot \\prod_e p_e( y_e ; \\nu_e) \\]

where \\(c\\) indexes the channel, \\(p\\) indexes the process, and \\(e\\) indexes the nuisance parameter.

  • \\(n_{c,\\mathrm{tot}}\\) is the total number of expected events in channel \\(c\\);
  • \\(\\mathrm{pdf}_{cp}\\) are user defined probability density functions, which may take on the form of any valid probability density; and
  • \\(f_{cp}^{\\mathrm{exp}}\\) is the fraction of the total events in channel \\(c\\) from process \\(p\\), \\(f_{cp} = \\frac{n_{cp}}{\\sum_p n_{cp}}\\).

for parametric likelihoods on binned data, the data likelihood is first converted into the binned data likelihood format before evaluation. i.e.

\\[ \\mathcal{L} = \\prod_c \\prod_b \\mathrm{Poiss}(n_{cb}^{\\mathrm{obs}}; n_{cb}^{\\mathrm{exp}}) \\prod_e p_e( y_e ; \\nu_e) \\]

where \\(n^\\mathrm{exp}\\) is calculated from the input pdf and normalization, based on the model parameters.

"},{"location":"what_combine_does/model_and_likelihood/#model-of-expected-event-counts","title":"Model of expected event counts","text":"

The total number of expected events is modelled as:

\\[n_{c,\\mathrm{tot}}^\\mathrm{exp} = \\mathrm{max}(0, \\sum_{p} n^{cp}_0 M_{cp}(\\vec{\\mu})N_{cp}(\\nu_{G},\\vec{\\nu}_L,\\vec{\\nu}_{\\rho})) \\]

where, \\(n^{cp}_0\\) is a default normalization for the process; and as for the binned likelihoods \\(\\nu_G, \\vec{\\nu}_L\\), and \\(\\vec{\\nu}_{\\rho}\\) are different types of nuisance parameters which modify the processes normalizations with different functional forms, as in the binned case;

Details of Process Normalization

As in the template-based case, the different types of nuisance parameters affecting the process normalizations are:

  • \\(\\nu_{G}\\) is a gamma nuisance, with linear normalization effects and a poisson constraint term.
  • \\(\\vec{\\nu}_{L}\\) are log-normal nuisances, with log-normal normalization effects and gaussian constraint terms.
  • \\(\\vec{\\nu}_{\\rho}\\) are user defined rate parameters, with user-defined normalization effects and gaussian or uniform constraint terms.
  • \\(N\\) defines the overall normalization effect of the nuisance parameters;

and \\(N\\) is defined as in the template-based case, except that there are no \\(\\vec{\\nu}_S\\) uncertainties.

\\[ N_{cp} = N_{\\mathrm{0}}(\\nu_{G})\\prod_{n} {\\kappa_{n}}^{\\nu_{L,n}}\\prod_{a} {\\kappa^{\\mathrm{A}}_{a}(\\nu_{L}^{a},\\kappa^{+}_{a}, \\kappa^{-}_{a})}^{\\nu_{L}^{a}} \\prod_{r}F_{r}(\\nu_\\rho) \\]

The function \\(F_{r}\\) is any user-defined mathematical expression. The functions \\(\\kappa(\\nu,\\kappa^+,\\kappa^-)\\) are defined to create smooth asymmetric log-normal uncertainties. The details of the interpolations which are used are found in the section on normalization effects in the binned model.

"},{"location":"what_combine_does/model_and_likelihood/#parameter-of-interest-model_1","title":"Parameter of Interest Model","text":"

As in the template-based case, the parameter of interest model, \\(M_{cp}(\\vec{\\mu})\\), can take on different forms defined by the user. The default model is one where \\(\\vec{\\mu}\\) simply scales the signal processes' normalizations.

"},{"location":"what_combine_does/model_and_likelihood/#shape-morphing-effects_1","title":"Shape Morphing Effects","text":"

The user may define any number of nuisance parameters which morph the shape of the pdf according to functional forms defined by the user. These nuisance parameters are included as \\(\\vec{\\nu}_\\rho\\) uncertainties, which may have gaussian or uniform constraints, and include user-defined process normalization effects.

"},{"location":"what_combine_does/model_and_likelihood/#combining-template-based-and-parametric-likelihoods","title":"Combining template-based and parametric Likelihoods","text":"

While we presented the likelihoods for the template and parameteric models separately, they can also be combined into a single likelihood, by treating them each as separate channels. When combining the models, the data likelihoods of the binned and unbinned channels are multiplied.

\\[ \\mathcal{L}_{\\mathrm{combined}} = \\mathcal{L}_{\\mathrm{primary}} \\cdot \\mathcal{L}_\\mathrm{auxiliary} = \\left(\\prod_{c_\\mathrm{template}} \\mathcal{L}_{\\mathrm{primary}}^{c_\\mathrm{template}}\\right) \\left(\\prod_{c_\\mathrm{parametric}} \\mathcal{L}_{\\mathrm{primary}}^{c_\\mathrm{parametric}}\\right)\\cdot \\mathcal{L}_{\\mathrm{auxiliary}} \\]"},{"location":"what_combine_does/model_and_likelihood/#references-and-external-literature","title":"References and External Literature","text":"
  • See the Particle Data Group's Review of Statistics for various fundamental concepts used here.
  • The Particle Data Group's Review of Probability also has definitions of commonly used distributions, some of which are used here.
"},{"location":"what_combine_does/statistical_tests/","title":"Statistical Tests","text":"

Combine is a likelihood based statistical tool. That means that it uses the likelihood function to define statistical tests.

Combine provides a number of customization options for each test; as always it is up to the user to chose an appropriate test and options.

"},{"location":"what_combine_does/statistical_tests/#general-framework","title":"General Framework","text":""},{"location":"what_combine_does/statistical_tests/#statistical-tests_1","title":"Statistical tests","text":"

Combine implements a number of different customizable statistical tests. These tests can be used for purposes such as determining the significance of some new physics model over the standard model, setting limits, estimating parameters, and checking goodness of fit.

These tests are all performed on a given model (null hypothesis), and often require additional specification of an alternative model. The statistical test then typically requires defining some \"test statistic\", \\(t\\), which is simply any real-valued function of the observed data:

\\[ t(\\mathrm{data}) \\in \\mathbb{R} \\]

For example, in a simple coin-flipping experiment, the number of heads could be used as the test statistic.

The distribution of the test statistic should be estimated under the null hypothesis (and the alternative hypothesis, if applicable). Then the value of the test statistic on the actual observed data, \\(t^{\\mathrm{obs}}\\) is compared with its expected value under the relevant hypotheses.

This comparison, which depends on the test in question, defines the results of the test, which may be simple binary results (e.g. this model point is rejected at a given confidence level), or continuous (e.g. defining the degree to which the data are considered surprising, given the model). Often, as either a final result or as an intermediate step, the p-value of the observed test statistic under a given hypothesis is calculated.

How p-values are calculated

The distribution of the test statistic, \\(t\\) under some model hypothesis \\(\\mathcal{M}\\) is:

\\[t \\stackrel{\\mathcal{M}}{\\sim} D_{\\mathcal{M}}\\]

And the observed value of the test statistic is \\(t_{\\mathrm{obs}}\\). The p-value of the observed result gives the probability of having observed a test statistic at least as extreme as the actual observation. For example, this may be:

\\[p = \\int_{t_{\\mathrm{min}}}^{t_\\mathrm{obs}} D_{\\mathcal{M}} \\mathrm{d}t\\]

In some cases, the bounds of the integral may be modified, such as \\(( t_{\\mathrm{obs}}, t_{\\mathrm{max}} )\\) or \\((-t_{\\mathrm{obs}}, t_{\\mathrm{obs}} )\\), depending on the details of the test being performed. And specifically, for the distribution in question, whether an observed value in the right tail, left tail, or either tail of the distribution is considered as unexpected.

The p-values using the left-tail and right tail are related to each other via \\(p_{\\mathrm{left}} = 1 - p_{\\mathrm{right}}\\).

"},{"location":"what_combine_does/statistical_tests/#test-statistics","title":"Test Statistics","text":"

The test statistic can be any real valued function of the data. While in principle, many valid test statistics can be used, the choice of tests statistic is very important as it influences the power of the statistical test.

By associating a single real value with every observation, the test statistic allows us to recast the question \"how likely was this observation?\" in the form of a quantitative question about the value of the test statistic. Ideally a good test statistic should return different values for likely outcomes as compared to unlikely outcomes and the expected distributions under the null and alternate hypotheses should be well-separated.

In many situations, extremely useful test statistics, sometimes optimal ones for particular tasks, can be constructed from the likelihood function itself:

\\[ t(\\mathrm{data}) = f(\\mathcal{L}) \\]

Even for a given statistical test, several likelihood-based test-statistics may be suitable, and for some tests combine implements multiple test-statistics from which the user can choose.

"},{"location":"what_combine_does/statistical_tests/#tests-with-likelihood-ratio-test-statistics","title":"Tests with Likelihood Ratio Test Statistics","text":"

The likelihood function itself often forms a good basis for building test statistics.

Typically the absolute value of the likelihood itself is not very meaningful as it depends on many fixed aspects we are usually not interested in on their own, like the size of the parameter space and the number of observations. However, quantities such as the ratio of the likelihood at two different points in parameter space are very informative about the relative merits of those two models.

"},{"location":"what_combine_does/statistical_tests/#the-likelihood-ratio-and-likelihood-ratio-based-test-statistics","title":"The likelihood ratio and likelihood ratio based test statistics","text":"

A very useful test statistic is the likelihood ratio of two models:

\\[ \\Lambda \\equiv \\frac{\\mathcal{L}_{\\mathcal{M}}}{\\mathcal{L}_{\\mathcal{M}'}} \\]

For technical and convenience reasons, often the negative logarithm of the likelihood ratio is used:

\\[t \\propto -\\log(\\Lambda) = \\log(\\mathcal{L}_{\\mathcal{M}'}) - \\log(\\mathcal{L}_{\\mathcal{M}})\\]

With different proportionality constants being most convenient in different circumstances. The negative sign is used by convention since usually the ratios are constructed so that the larger likelihood value must be in the denominator. This way, \\(t\\) is positive, and larger values of \\(t\\) represent larger differences between the likelihoods of the two models.

"},{"location":"what_combine_does/statistical_tests/#sets-of-test-statistics","title":"Sets of test statistics","text":"

If the parameters of both likelihoods in the ratio are fixed to a single value, then that defines a single test statistic. Often, however, we are interested in testing \"sets\" of models, parameterized by some set of values \\((\\vec{\\mu}, \\vec{\\nu})\\).

This is important in limit setting for example, where we perform statistical tests to exclude entire ranges of the parameter space.

In these cases, the likelihood ratio (or a function of it) can be used to define a set of test statistics parameterized by the model parameters. For example, a very useful set of test statistics is:

$$ t_{\\vec{\\mu}} \\propto -\\log\\left(\\frac{\\mathcal{L}(\\vec{\\mu})}{\\mathcal{L}(\\vec{\\hat{\\mu}})}\\right) $$.

Where the likelihood parameters in the bottom are fixed to their maximum likelihood values, but the parameter \\(\\vec{\\mu}\\) indexing the test statistic appears in the numerator of the likelihood ratio.

When calculating the p-values for these statistical tests, the p-values are calculated at each point in parameter space using the test statistic for that point. In other words, the observed and expected distributions of the test statistics are computed separately at each parameter point \\(\\vec{\\mu}\\) being considered.

"},{"location":"what_combine_does/statistical_tests/#expected-distributions-of-likelihood-ratio-test-statistics","title":"Expected distributions of likelihood ratio test statistics","text":"

Under appropriate conditions, the distribution of \\(t_\\vec{\\mu}\\) can be approximated analytically, via Wilks' Theorem or other extensions of that work. Then, the p-value of the observed test statistic can be calculated from the known form of the expected distribution. This is also true for a number of the other test statistics derived from the likelihood ratio, where asymptotic approximations have been derived.

Combine provides asymptotic methods, for limit setting, significance tests, and computing confidence intervals which make used of these approximations for fast calculations.

In the general case, however, the distribution of the test statistic is not known, and it must be estimated. Typically it is estimated by generating many sets of pseudo-data from the model and using the emprirical distribution of the test statistic.

Combine also provides methods for limit setting, significance tests, and computing confidence intervals which use pseudodata generation to estimate the expected test-statistic distributions, and therefore don't depend on the asymptotic approximation. Methods are also provided for generating pseudodata without running a particular test, which can be saved and used for estimating expected distributions.

"},{"location":"what_combine_does/statistical_tests/#parameter-estimation-using-the-likelihood-ratio","title":"Parameter Estimation using the likelihood ratio","text":"

A common use case for likelihood ratios is estimating the values of some parameters, such as the parameters of interest, \\(\\vec{\\mu}\\). The point estimate for the parameters is simply the maximum likelihood estimate, but the likelihood ratio can be used for estimating the uncertainty as a confidence region.

A confidence region for the parameters \\(\\vec{\\mu}\\) can be defined by using an appropriate test statistic. Typically, we use the profile likelihood ratio:

\\[ t_{\\vec{\\mu}} \\propto -\\log\\left(\\frac{\\mathcal{L}(\\vec{\\mu},\\vec{\\hat{\\nu}}(\\vec{\\mu}))}{\\mathcal{L}(\\vec{\\hat{\\mu}},\\vec{\\hat{\\nu}})}\\right) \\]

Where the likelihood in the top is the value of the likelihood at a point \\(\\vec{\\mu}\\) profiled over \\(\\vec{\\nu}\\); and the likelihood on the bottom is at the best fit point.

Then the confidence region can be defined as the region where the p-value of the observed test-statistic is less than the confidence level:

\\[ \\{ \\vec{\\mu}_{\\mathrm{CL}} \\} = \\{ \\vec{\\mu} : p_{\\vec{\\mu}} \\lt \\mathrm{CL} \\}.\\]

This construction will satisfy the frequentist coverage property that the confidence region contains the parameter values used to generate the data in \\(\\mathrm{CL}\\) fraction of cases.

In many cases, Wilks' theorem can be used to calculate the p-value and the criteria on \\(p_{\\vec{\\mu}}\\) can be converted directly into a criterion on \\(t_{\\vec{\\mu}}\\) itself, \\(t_{\\vec{\\mu}} \\lt \\gamma_{\\mathrm{CL}}\\). Where \\(\\gamma_{\\mathrm{CL}}\\) is a known function of the confidence level which depends on the parameter space being considered.

"},{"location":"what_combine_does/statistical_tests/#discoveries-using-the-likelihood-ratio","title":"Discoveries using the likelihood ratio","text":"

A common method for claiming discovery is based on a likelihood ratio test by showing that the new physics model has a \"significantly\" larger likelihood than the standard model.

This could be done by using the standard profile likelihood ratio test statistic:

\\[ t_{\\mathrm{NP}} = -2\\log\\left(\\frac{\\mathcal{L}(\\mu_{\\mathrm{NP}} = 0, \\vec{\\hat{\\nu}}(\\mu_{\\mathrm{NP}} = 0))}{\\mathcal{L}(\\hat{\\mu}_{\\mathrm{NP}},\\vec{\\hat{\\nu}})}\\right) \\]

Where \\(\\mu_{\\mathrm{NP}}\\) represents the strength of some new physics quantity, such as the cross section for creation of a new particle. However, this would also allow for claiming \"discovery\" in cases where the best fit value is negative, i.e. \\(\\hat{\\mu} \\lt 0\\), which in particle physics is often an unphysical model, such as a negative cross section. In order to avoid such a situation, we typically use a modified test statistic:

\\[ q_{0} = \\begin{cases} 0 & \\hat{\\mu} \\lt 0 \\\\ -2\\log\\left(\\frac{\\mathcal{L}(\\mathrm{\\mu}_{\\mathrm{NP}} = 0)}{\\mathcal{L}(\\hat{\\mu}_{\\mathrm{NP}})}\\right) & \\hat{\\mu} \\geq 0 \\end{cases} \\]

which excludes the possibility of claiming discovery when the best fit value of \\(\\mu\\) is negative.

As with the likelihood ratio test statistic, \\(t\\), defined above, under suitable conditions, analytic expressions for the distribution of \\(q_0\\) are known.

Once the value \\(q_{0}(\\mathrm{data})\\) is calculated, it can be compared to the expected distribution of \\(q_{0}\\) under the standard model hypothesis to calculate the p-value. If the p-value is below some threshold, discovery is often claimed. In high-energy physics the standard threshold is \\(\\sim 5\\times10^{-7}\\).

"},{"location":"what_combine_does/statistical_tests/#limit-setting-using-the-likelihood-ratio","title":"Limit Setting using the likelihood ratio","text":"

Various test statistics built from likelihood ratios can be used for limit setting, i.e. excluding some parameter values.

One could set limits on a parameter \\(\\mu\\) by finding the values of \\(\\mu\\) that are outside the confidence regions defined above by using the likelihood ratio test statistic:

\\[ t_{\\mu} = -2\\log\\left(\\frac{\\mathcal{L}(\\mu)}{\\mathcal{L}(\\hat{\\mu})}\\right) \\]

However, this could \"exclude\" \\(\\mu = 0\\) or small values of \\(\\mu\\) at a typical limit setting confidence level, such as 95%, while still not claiming a discovery. This is considered undesirable, and often we only want to set upper limits on the value of \\(\\mu\\), rather than excluding any possible set of parameters outside our chosen confidence interval.

This can be done using a modified test statistic:

\\[ \\tilde{t}_{\\mu} = -2\\log\\left(\\frac{\\mathcal{L}(\\mu)}{\\mathcal{L}(\\min(\\mu,\\hat{\\mu}))}\\right) = \\begin{cases} -2\\log\\left(\\frac{\\mathcal{L}(\\mu)}{\\mathcal{L}(\\hat{\\mu})}\\right)& \\hat{\\mu} \\lt \\mu \\\\ 0 & \\mu \\leq \\hat{\\mu} \\end{cases} \\]

However, this can also have undesirable properties when the best fit value, \\(\\hat{\\mu}\\), is less than 0. In that case, we may set limits below 0. In order to avoid these situations, another modified test statistic can be used:

\\[ \\tilde{q}_{\\mu} = \\begin{cases} -2\\log\\left(\\frac{\\mathcal{L}(\\mu)}{\\mathcal{L}(\\mu = 0)}\\right)& \\hat{\\mu} \\lt 0 \\\\ -2\\log\\left(\\frac{\\mathcal{L}(\\mu)}{\\mathcal{L}(\\hat{\\mu})}\\right)& 0 \\lt \\hat{\\mu} \\lt \\mu \\\\ 0& \\mu \\lt \\hat{\\mu} \\end{cases} \\]

Which also has a known distribution under appropriate conditions, or can be estimated from pseudo-experiments. One can then set a limit at a given confidence level, \\(\\mathrm{CL}\\), by finding the value of \\(\\mu\\) for which \\(p_{\\mu} \\equiv p(t_{\\mu}(\\mathrm{data});\\mathcal{M}_{\\mu}) = 1 - \\mathrm{CL}\\). Larger values of \\(\\mu\\) will have smaller p-values and are considered excluded at the given confidence level.

However, this procedure is rarely used, in almost every case we use a modified test procedure which uses the \\(\\mathrm{CL}_{s}\\) criterion, explained below.

"},{"location":"what_combine_does/statistical_tests/#the-cls-criterion","title":"The CLs criterion","text":"

Regardless of which of these test statistics is used, the standard test-methodology has some undesirable properties for limit setting.

Even for an experiment with almost no sensitivity to new physics, 5% of the time the experiment is performed we expect the experimenter to find \\(p_{\\mu} \\lt 0.05\\) for small values of \\(\\mu\\) and set limits on parameter values to which the experiment is not sensitive!

In order to avoid such situations the \\(\\mathrm{CL}_{s}\\) criterion was developped, as explained in these two papers. Rather than requiring \\(p_{\\mu} \\lt (1-\\mathrm{CL})\\) to exclude \\(\\mu\\), as would be done in the general framework described above, the \\(\\mathrm{CL}_{s}\\) criterion requires:

\\[ \\frac{p_{\\mu}}{1-p_{b}} \\lt (1-\\mathrm{CL}) \\]

Where \\(p_{\\mu}\\) is the usual probability of observing the observed value of the test statistic under the signal + background model with signal strength \\(\\mu\\), and \\(p_{b}\\) is the p-value for the background-only hypothesis, with the p-value defined using the opposite tail from the definition of \\(p_{\\mu}\\).

Using the \\(\\mathrm{CL}_{s}\\) criterion fixes the issue of setting limits much stricter than the experimental sensitivity, because for values of \\(\\mu\\) to which the experiment is not sensitive the distribution of the test statistic under the signal hypothesis is nearly the same as under the background hypothesis. Therefore, given the use of opposite tails in the p-value definition, \\(p_{\\mu} \\approx 1-p_{b}\\), and the ratio approaches 1.

Note that this means that a limit set using the \\(\\mathrm{CL}_{s}\\) criterion at a given \\(\\mathrm{CL}\\) will exclude the true parameter value \\(\\mu\\) with a frequency less than the nominal rate of \\(1-\\mathrm{CL}\\). The actual frequency at which it is excluded depends on the sensitivity of the experiment to that parameter value.

"},{"location":"what_combine_does/statistical_tests/#goodness-of-fit-tests-using-the-likelihood-ratio","title":"Goodness of fit tests using the likelihood ratio","text":"

The likelihood ratio can also be used as a measure of goodness of fit, i.e. describing how well the data match the model for binned data.

A standard likelihood-based measure of the goodness of fit is determined by using the log likelihood ratio with the likelihood in the denominator coming from the saturated model.

\\[ t_{\\mathrm{saturated}} \\propto -\\log\\left(\\frac{\\mathcal{L}_{\\mathcal{M}}}{\\mathcal{L}_{\\mathcal{M}_\\mathrm{saturated}}}\\right) \\]

Here \\(\\mathcal{M}\\) is whatever model one is testing the goodness of fit for, and the saturated model is a model for which the prediction matches the observed value in every bin. Typically, the saturated model would be one in which there are as many free parameters as bins.

This ratio is then providing a comparison between how well the actual data are fit as compared to a hypothetical optimal fit.

Unfortunately, the distribution of \\(t_{\\mathcal{saturated}}\\) usually is not known a priori and has to be estimated by generating pseudodata from the model \\(\\mathcal{L}\\) and calculating the empirical distribution of the statistic.

Once the distribution is determined, a p-value for the statistic can be derived which indicates the probability of observing data with that quality of fit given the model, and therefore serves as a measure of the goodness of fit.

"},{"location":"what_combine_does/statistical_tests/#channel-compatibility-test-using-the-likelihood-ratio","title":"Channel Compatibility test using the likelihood ratio","text":"

When performing an anlysis across many different channels (for example, different Higgs decay modes), it is often interesting to check the level of compatibility of the various channels.

Combine implements a channel compatibility test, by considering the a model, \\(\\mathcal{M}_{\\mathrm{c-independent}}\\), in which the signal is independent in every channel. As a test statistic, this test uses the likelihood ratio between the best fit value of the nominal model and the model with independent signal strength for each channel:

\\[ t = -\\log\\left(\\frac{\\mathcal{L}_{\\mathcal{M}}(\\vec{\\hat{\\mu}},\\vec{\\hat{\\nu}})}{\\mathcal{L}_{\\mathcal{M}_{\\mathrm{c-indep}}}(\\vec{\\hat{\\mu}}_{c1}, \\vec{\\hat{\\mu}}_{c2}, ..., \\vec{\\hat{\\nu}})}\\right) \\]

The distribution of the test statistic is not known a priori, and needs to be calculated by generating pseudo-data samples.

"},{"location":"what_combine_does/statistical_tests/#other-statistical-tests","title":"Other Statistical Tests","text":"

While combine is a likelihood based statistical framework, it does not require that all statistical tests use the likelihood ratio.

"},{"location":"what_combine_does/statistical_tests/#other-goodness-of-fit-tests","title":"Other Goodness of Fit Tests","text":"

As well as the saturated goodness of fit test, defined above, combine implements Kolmogorov-Smirnov and Anderson-Darling goodness of fit tests.

For the Kolomogorov-Smirnov (KS) test, the test statistic is the maximum absolute difference between the cumulative distribution function between the data and the model:

\\[ D = \\max_{x} | F_{\\mathcal{M}}(x) - F_{\\mathrm{data}}(x) | \\]

Where \\(F(x)\\) is the Cumulative Distribution Function (i.e. cumulative sum) of the model or data at point \\(\\vec{x}\\).

For the Anderson-Darling (AD) test, the test statistic is based on the integral of the square of the difference between the two cumulative distribution functions. The square difference is modified by a weighting function which gives more importance to differences in the tails:

\\[ A^2 = \\int_{x_{\\mathrm{min}}}^{x_{\\mathrm{max}}} \\frac{ (F_{\\mathcal{M}}(x) - F_{\\mathrm{data}}(x))^2}{ F_\\mathcal{M}(x) (1 - F_{\\mathcal{M}}(x)) } \\mathrm{d}F_\\mathcal{M}(x) \\]

Notably, both the Anderson-Darling and Kolmogorov-Smirnov test rely on the cumulative distribution. Because the ordering of different channels of a model is not well defined, the tests themselves are not unambiguously defined over multiple channels.

"}]} \ No newline at end of file +{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Introduction","text":"

These pages document the RooStats / RooFit - based software tool used for statistical analysis within the CMS experiment - Combine. Note that while this tool was originally developed in the Higgs Physics Analysis Group (PAG), its usage is now widespread within CMS.

Combine provides a command-line interface to many different statistical techniques, available inside RooFit/RooStats, that are used widely inside CMS.

The package exists on GitHub under https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit

For more information about Git, GitHub and its usage in CMS, see http://cms-sw.github.io/cmssw/faq.html

The code can be checked out from GitHub and compiled on top of a CMSSW release that includes a recent RooFit/RooStats, or via standalone compilation without CMSSW dependencies. See the instructions for installation of Combine below.

"},{"location":"#installation-instructions","title":"Installation instructions","text":"

Installation instructions and recommended versions can be found below. Since v9.0.0, the versioning follows the semantic versioning 2.0.0 standard. Earlier versions are not guaranteed to follow the standard.

"},{"location":"#within-cmssw-recommended-for-cms-users","title":"Within CMSSW (recommended for CMS users)","text":"

The instructions below are for installation within a CMSSW environment. For end users that do not need to commit or do any development, the following recipes should be sufficient. To choose a release version, you can find the latest releases on github under https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/releases

"},{"location":"#combine-v9-recommended-version","title":"Combine v9 - recommended version","text":"

The nominal installation method is inside CMSSW. The current release targets the CMSSW 11_3_X series because this release has both python2 and python3 ROOT bindings, allowing a more gradual migration of user code to python3. Combine is fully python3-compatible and, with some adaptations, can also work in 12_X releases.

CMSSW 11_3_X runs on slc7, which can be setup using apptainer (see detailed instructions):

cmssw-el7\ncmsrel CMSSW_11_3_4\ncd CMSSW_11_3_4/src\ncmsenv\ngit clone https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit.git HiggsAnalysis/CombinedLimit\ncd HiggsAnalysis/CombinedLimit\n

Update to a recommended tag - currently the recommended tag is v9.2.0: see release notes

cd $CMSSW_BASE/src/HiggsAnalysis/CombinedLimit\ngit fetch origin\ngit checkout v9.2.0\nscramv1 b clean; scramv1 b # always make a clean build\n
"},{"location":"#combine-v8-cmssw_10_2_x-release-series","title":"Combine v8: CMSSW_10_2_X release series","text":"

Setting up the environment (once):

cmssw-el7\ncmsrel CMSSW_10_2_13\ncd CMSSW_10_2_13/src\ncmsenv\ngit clone https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit.git HiggsAnalysis/CombinedLimit\ncd HiggsAnalysis/CombinedLimit\n

Update to a recommended tag - currently the recommended tag is v8.2.0: see release notes

cd $CMSSW_BASE/src/HiggsAnalysis/CombinedLimit\ngit fetch origin\ngit checkout v8.2.0\nscramv1 b clean; scramv1 b # always make a clean build\n
"},{"location":"#slc6cc7-release-cmssw_8_1_x","title":"SLC6/CC7 release CMSSW_8_1_X","text":"

Setting up OS using apptainer (see detailed instructions):

# For CC7:\ncmssw-el7\n# For SLC6:\ncmssw-el6\n\ncmsrel CMSSW_8_1_0\ncd CMSSW_8_1_0/src\ncmsenv\ngit clone https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit.git HiggsAnalysis/CombinedLimit\ncd HiggsAnalysis/CombinedLimit\n

Update to a recommended tag - currently the recommended tag for CMSSW_8_1_X is v7.0.13:

cd $CMSSW_BASE/src/HiggsAnalysis/CombinedLimit\ngit fetch origin\ngit checkout v7.0.13\nscramv1 b clean; scramv1 b # always make a clean build\n
"},{"location":"#oustide-of-cmssw-recommended-for-non-cms-users","title":"Oustide of CMSSW (recommended for non-CMS users)","text":"

Pre-compiled versions of the tool are available as container images from the CMS cloud. These containers can be downloaded and run using Docker. If you have docker running you can pull and run the latest image using,

docker run --name combine -it gitlab-registry.cern.ch/cms-cloud/combine-standalone:latest\n

You will now have the compiled Combine binary available as well as the complete package of tool. The container can be re-started using docker start -i combine.

"},{"location":"#standalone-compilation","title":"Standalone compilation","text":"

The standalone version can be easily compiled using cvmfs as it relies on dependencies that are already installed at /cvmfs/cms.cern.ch/. Access to /cvmfs/cms.cern.ch/ can be obtained from lxplus machines or via CernVM. See CernVM for further details on the latter. In case you do not want to use the cvmfs area, you will need to adapt the locations of the dependencies listed in both the Makefile and env_standalone.sh files.

git clone https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit.git HiggsAnalysis/CombinedLimit\ncd HiggsAnalysis/CombinedLimit/ \n# git checkout <some release>\n. env_standalone.sh\nmake -j 4\n

You will need to source env_standalone.sh each time you want to use the package, or add it to your login environment.

"},{"location":"#standalone-compilation-with-lcg","title":"Standalone compilation with LCG","text":"

For compilation outside of CMSSW, for example to use ROOT versions not yet available in CMSSW, one can compile against LCG releases. The current default is to compile with LCG_102, which contains ROOT 6.26:

git clone https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit.git HiggsAnalysis/CombinedLimit\ncd HiggsAnalysis/CombinedLimit\nsource env_lcg.sh \nmake LCG=1 -j 8\n

To change the LCG version, edit env_lcg.sh.

The resulting binaries can be moved for use in a batch job if the following files are included in the job tarball:

tar -zcf Combine_LCG_env.tar.gz build interface src/classes.h --exclude=obj\n
"},{"location":"#standalone-compilation-with-conda","title":"Standalone compilation with conda","text":"

This recipe will work both for linux and MacOS

git clone https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit.git HiggsAnalysis/CombinedLimit\ncd HiggsAnalysis/CombinedLimit\n\nconda install --name base mamba # faster conda\nmamba env create -f conda_env.yml\n\nconda activate combine\nsource set_conda_env_vars.sh\n# Need to reactivate\nconda deactivate\nconda activate combine\n\nmake CONDA=1 -j 8\n

Using Combine from then on should only require sourcing the conda environment

conda activate combine\n

Note: on OS X, Combine can only accept workspaces, so run text2workspace.py first. This is due to an issue with child processes and LD_LIBRARY_PATH (see note in Makefile)

"},{"location":"#standalone-compilation-with-cernvm","title":"Standalone compilation with CernVM","text":"

Combine, either standalone or not, can be compiled via CVMFS using access to /cvmfs/cms.cern.ch/ obtained using a virtual machine - CernVM. To use CernVM You should have access to CERN IT resources. If you are a CERN user you can use your account, otherwise you can request a lightweight account. If you have a CERN user account, we strongly suggest you simply run one of the other standalone installations, which are simpler and faster than using a VM.

You should have a working VM on your local machine, compatible with CernVM, such as VirtualBox. All the required software can be downloaded here. At least 2GB of disk space should be reserved on the virtual machine for Combine to work properly and the machine must be contextualized to add the CMS group to CVMFS. A minimal working setup is described below.

  1. Download the CernVM-launcher for your operating system, following the instructions available [here] for your operating system (https://cernvm.readthedocs.io/en/stable/cpt-launch.html#installation

  2. Prepare a CMS context. You can use the CMS open data one already available on gitHub: wget https://raw.githubusercontent.com/cernvm/public-contexts/master/cms-opendata-2011.context)

  3. Launch the virtual machine cernvm-launch create --name combine --cpus 2 cms-opendata-2011.context

  4. In the VM, proceed with an installation of combine

Installation through CernVM is maintained on a best-effort basis and these instructions may not be up to date.

"},{"location":"#what-has-changed-between-tags","title":"What has changed between tags?","text":"

You can generate a diff of any two tags (eg for v9.1.0 and v9.0.0) by using the following url:

https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/compare/v9.0.0...v9.1.0

Replace the tag names in the url to any tags you would like to compare.

"},{"location":"#for-developers","title":"For developers","text":"

We use the Fork and Pull model for development: each user creates a copy of the repository on GitHub, commits their requests there, and then sends pull requests for the administrators to merge.

Prerequisites

  1. Register on GitHub, as needed anyway for CMSSW development: http://cms-sw.github.io/cmssw/faq.html

  2. Register your SSH key on GitHub: https://help.github.com/articles/generating-ssh-keys

  3. Fork the repository to create your copy of it: https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit/fork (more documentation at https://help.github.com/articles/fork-a-repo )

You will now be able to browse your fork of the repository from https://github.com/your-github-user-name/HiggsAnalysis-CombinedLimit

We strongly encourage you to contribute any developments you make back to the main repository. See contributing.md for details about contributing.

"},{"location":"#combineharvestercombinetools","title":"CombineHarvester/CombineTools","text":"

CombineTools is an additional tool for submitting Combine jobs to batch systems or crab, which was originally developed in the context of Higgs to tau tau analyses. Since the repository contains a certain amount of analysis-specific code, the following scripts can be used to clone it with a sparse checkout for just the core CombineHarvester/CombineTools subpackage, speeding up the checkout and compile times:

git clone via ssh:

bash <(curl -s https://raw.githubusercontent.com/cms-analysis/CombineHarvester/main/CombineTools/scripts/sparse-checkout-ssh.sh)\n

git clone via https:

bash <(curl -s https://raw.githubusercontent.com/cms-analysis/CombineHarvester/main/CombineTools/scripts/sparse-checkout-https.sh)\n

make sure to run scram to compile the CombineTools package.

See the CombineHarvester documentation pages for more details on using this tool and additional features available in the full package.

"},{"location":"CernVM/","title":"CernVM","text":""},{"location":"CernVM/#standalone-use-inside-cernvm","title":"Standalone use inside CernVM","text":"

Standalone by adding the CMS group to the CVMFS Configuration. A minimal CernVM working context setup can be found in the CernVM Marketplace under Experimental/HiggsCombine or at https://cernvm-online.cern.ch/context/view/9ee5960ce4b143f5829e72bbbb26d382. At least 2GB of disk space should be reserved on the virtual machine for Combine to work properly.

"},{"location":"CernVM/#available-machines-for-standalone-combine","title":"Available machines for standalone combine","text":"

The standalone version can be easily compiled via CVMFS as it relies on dependencies which are already installed at /cvmfs/cms.cern.ch/. Access to /cvmfs/cms.cern.ch/ can be obtained from lxplus machines or via CernVM. The only requirement will be to add the CMS group to the CVMFS configuration as shown in the picture

At least 2GB of disk space should be reserved on the virtual machine for combine to work properly. A minimal CernVM working context setup can be found in the CernVM Marketplace under Experimental/HiggsCombine.

To use this predefined context, first locally launch the CernVM (eg you can use the .ova with VirtualBox, by downloading from here and launching the downloaded file. You can click on \"pair an instance of CernVM\" from the cernvm-online dashboard, which displays a PIN. In the VirtualBox terminal, pair the virtual machine with this PIN code (enter in the terminal using #PIN eg #123456. After this, you will be asked again for username (use user) and then a password (use hcomb).

In case you do not want to use the cvmfs area, you will need to adapt the location of the dependencies listed in both the Makefile and env_standalone.sh files.

"},{"location":"releaseNotes/","title":"Release notes","text":""},{"location":"releaseNotes/#cmssw-10_2_x-v800","title":"CMSSW 10_2_X - v8.0.0","text":"

This release contains all of the changes listed for v7.0.13 below. In addition:

  • New documentation pages, using the mkdocs framework. The documentation source is included in the repository as simple markdown files. Users are welcome to make additions and corrections as pull requests to this repo.
  • It is now possible to include additional constraint terms for regularisiation when unfolding using combine. Detailed documentation for this is given here.
  • The option -S 0 to remove all systematic uncertainties has been removed. Instead, to freeze all constrained nuisance parameters the option --freezeParameters allConstrainedNuisances should be used, which replaces the previous shortcut of --freezeParameters all.
  • The possibility to use some old method names has now been fully removed. When setting the -M option, FitDiagnostics, AsymptoticLimits and Significance must be used instead of, respectively, MaxLikelihoodFit, Asymptotic and ProfileLikelihood.
"},{"location":"releaseNotes/#cmssw-8_1_x-v7013","title":"CMSSW 8_1_X - v7.0.13","text":"
  • Nuisance edit selections for bins, processes or systematic names now require a complete string match. For example, nuisance edit add procA binA [...] will no longer match procAB and binAB. Note that regex selections can still be used to match multiple labels, but again are now required to match the full strings.
  • Nuisance parameters can now be frozen using attributes that have been assigned to the corresponding RooRealVars. Syntax is --freezeWithAttributes attr1,attr2,...,attrN.
  • For Higgs analyses: added YR4 cross sections, branching ratios and partial width uncertainties in data/lhc-hxswg/sm/, as used in HIG-17-031
  • [EXPERIMENTAL] For binned analyses using autoMCStats a faster implementation of the vertical template morphing for shape uncertainties can be enabled at runtime with the option --X-rtd FAST_VERTICAL_MORPH. Any results using this flag should be validated carefully against the default.
"},{"location":"part2/bin-wise-stats/","title":"Automatic statistical uncertainties","text":""},{"location":"part2/bin-wise-stats/#introduction","title":"Introduction","text":"

The text2workspace.py script is able to produce a type of workspace, using a set of new histogram classes, in which bin-wise statistical uncertainties are added automatically. This can be built for shape-based datacards where the inputs are in TH1 format. Datacards that use RooDataHists are not supported. The bin errrors (i.e. values returned by TH1::GetBinError) are used to model the uncertainties.

By default the script will attempt to assign a single nuisance parameter to scale the sum of the process yields in each bin, constrained by the total uncertainty, instead of requiring separate parameters, one per process. This is sometimes referred to as the Barlow-Beeston-lite approach, and is useful as it minimises the number of parameters required in the maximum likelihood fit. A useful description of this approach may be found in section 5 of this report.

"},{"location":"part2/bin-wise-stats/#usage-instructions","title":"Usage instructions","text":"

The following line should be added at the bottom of the datacard, underneath the systematics, to produce a new-style workspace and optionally enable the automatic bin-wise uncertainties:

[channel] autoMCStats [threshold] [include-signal = 0] [hist-mode = 1]\n

The first string channel should give the name of the channels (bins) in the datacard for which the new histogram classes should be used. The wildcard * is supported for selecting multiple channels in one go. The value of threshold should be set to a value greater than or equal to zero to enable the creation of automatic bin-wise uncertainties, or -1 to use the new histogram classes without these uncertainties. A positive value sets the threshold on the effective number of unweighted events above which the uncertainty will be modeled with the Barlow-Beeston-lite approach described above. Below the threshold an individual uncertainty per-process will be created. The algorithm is described in more detail below.

The last two settings are optional. The first of these, include-signal has a default value of 0 but can be set to 1 as an alternative. By default, the total nominal yield and uncertainty used to test the threshold excludes signal processes. The reason for this is that typically the initial signal normalization is arbitrary, and could unduly lead to a bin being considered well-populated despite poorly populated background templates. Setting this flag will include the signal processes in the uncertainty analysis. Note that this option only affects the logic for creating a single Barlow-Beeston-lite parameter vs. separate per-process parameters - the uncertainties on all signal processes are always included in the actual model! The second flag changes the way the normalization effect of shape-altering uncertainties is handled. In the default mode (1) the normalization is handled separately from the shape morphing via a an asymmetric log-normal term. This is identical to how Combine has always handled shape morphing. When set to 2, the normalization will be adjusted in the shape morphing directly. Unless there is a strong motivation we encourage users to leave this on the default setting.

"},{"location":"part2/bin-wise-stats/#description-of-the-algorithm","title":"Description of the algorithm","text":"

When threshold is set to a number of effective unweighted events greater than or equal to zero, denoted \\(n^{\\text{threshold}}\\), the following algorithm is applied to each bin:

  1. Sum the yields \\(n_{i}\\) and uncertainties \\(e_{i}\\) of each background process \\(i\\) in the bin. Note that the \\(n_i\\) and \\(e_i\\) include the nominal effect of any scaling parameters that have been set in the datacard, for example rateParams. \\(n_{\\text{tot}} = \\sum_{i\\,\\in\\,\\text{bkg}}n_i\\), \\(e_{\\text{tot}} = \\sqrt{\\sum_{i\\,\\in\\,\\text{bkg}}e_i^{2}}\\)
  2. If \\(e_{\\text{tot}} = 0\\), the bin is skipped and no parameters are created. If this is the case, it is a good idea to check why there is no uncertainty in the background prediction in this bin!
  3. The effective number of unweighted events is defined as \\(n_{\\text{tot}}^{\\text{eff}} = n_{\\text{tot}}^{2} / e_{\\text{tot}}^{2}\\), rounded to the nearest integer.
  4. If \\(n_{\\text{tot}}^{\\text{eff}} \\leq n^{\\text{threshold}}\\): separate uncertainties will be created for each process. Processes where \\(e_{i} = 0\\) are skipped. If the number of effective events for a given process is lower than \\(n^{\\text{threshold}}\\) a Poisson-constrained parameter will be created. Otherwise a Gaussian-constrained parameter is used.
  5. If \\(n_{\\text{tot}}^{\\text{eff}} \\gt n^{\\text{threshold}}\\): A single Gaussian-constrained Barlow-Beeston-lite parameter is created that will scale the total yield in the bin.
  6. Note that the values of \\(e_{i}\\), and therefore \\(e_{tot}\\), will be updated automatically in the model whenever the process normalizations change.
  7. A Gaussian-constrained parameter \\(x\\) has a nominal value of zero and scales the yield as \\(n_{\\text{tot}} + x \\cdot e_{\\text{tot}}\\). The Poisson-constrained parameters are expressed as a yield multiplier with nominal value one: \\(n_{\\text{tot}} \\cdot x\\).

The output from text2workspace.py will give details on how each bin has been treated by this algorithm, for example:

Show example output
============================================================\nAnalysing bin errors for: prop_binhtt_et_6_7TeV\nPoisson cut-off: 10\nProcesses excluded for sums: ZH qqH WH ggH\n============================================================\nBin        Contents        Error           Notes\n0          0.000000        0.000000        total sum\n0          0.000000        0.000000        excluding marked processes\n  => Error is zero, ignore\n------------------------------------------------------------\n1          0.120983        0.035333        total sum\n1          0.120983        0.035333        excluding marked processes\n1          12.000000       3.464102        Unweighted events, alpha=0.010082\n  => Total parameter prop_binhtt_et_6_7TeV_bin1[0.00,-7.00,7.00] to be gaussian constrained\n------------------------------------------------------------\n2          0.472198        0.232096        total sum\n2          0.472198        0.232096        excluding marked processes\n2          4.000000        2.000000        Unweighted events, alpha=0.118049\n  => Number of weighted events is below poisson threshold\n    ZH                   0.000000        0.000000\n      => Error is zero, ignore\n  ----------------------------------------------------------\n    W                    0.050606        0.029220\n                         3.000000        1.732051        Unweighted events, alpha=0.016869\n      => Product of prop_binhtt_et_6_7TeV_bin2_W[1.00,0.00,12.15] and const [3] to be poisson constrained\n  ----------------------------------------------------------\n    ZJ                   0.142444        0.140865\n                         1.000000        1.000000        Unweighted events, alpha=0.142444\n      => Product of prop_binhtt_et_6_7TeV_bin2_ZJ[1.00,0.00,30.85] and const [1] to be poisson constrained\n  ----------------------------------------------------------\n"},{"location":"part2/bin-wise-stats/#analytic-minimisation","title":"Analytic minimisation","text":"

One significant advantage of the Barlow-Beeston-lite approach is that the maximum likelihood estimate of each nuisance parameter has a simple analytic form that depends only on \\(n_{\\text{tot}}\\), \\(e_{\\text{tot}}\\) and the observed number of data events in the relevant bin. Therefore when minimising the negative log-likelihood of the whole model it is possible to remove these parameters from the fit and set them to their best-fit values automatically. For models with large numbers of bins this can reduce the fit time and increase the fit stability. The analytic minimisation is enabled by default starting in combine v8.2.0, you can disable it by adding the option --X-rtd MINIMIZER_no_analytic when running Combine.

\n

The figure below shows a performance comparison of the analytical minimisation versus the number of bins in the likelihood function. The real time (in sections) for a typical minimisation of a binned likelihood is shown as a function of the number of bins when invoking the analytic minimisation of the nuisance parameters versus the default numerical approach.

\n\nShow Comparison\n

"},{"location":"part2/bin-wise-stats/#technical-details","title":"Technical details","text":"

Up until recently text2workspace.py would only construct the PDF for each channel using a RooAddPdf, i.e. each component process is represented by a separate PDF and normalization coefficient. However, in order to model bin-wise statistical uncertainties, the alternative RooRealSumPdf can be more useful, as each process is represented by a RooFit function object instead of a PDF, and we can vary the bin yields directly. As such, a new RooFit histogram class CMSHistFunc is introduced, which offers the same vertical template morphing algorithms offered by the current default histogram PDF, FastVerticalInterpHistPdf2. Accompanying this is the CMSHistErrorPropagator class. This evaluates a sum of CMSHistFunc objects, each multiplied by a coefficient. It is also able to scale the summed yield of each bin to account for bin-wise statistical uncertainty nuisance parameters.

\n\n

Warning

\n

One disadvantage of this new approach comes when evaluating the expectation for individual processes, for example when using the --saveShapes option in the FitDiagnostics mode of Combine. The Barlow-Beeston-lite parameters scale the sum of the process yields directly, so extra work is needed to distribute this total scaling back to each individual process. To achieve this, an additional class CMSHistFuncWrapper has been created that, given a particular CMSHistFunc, the CMSHistErrorPropagator will distribute an appropriate fraction of the total yield shift to each bin. As a consequence of the extra computation needed to distribute the yield shifts in this way, the evaluation of individual process shapes in --saveShapes can take longer then previously.

"},{"location":"part2/physicsmodels/","title":"Physics Models","text":"

Combine can be run directly on the text-based datacard. However, for more advanced physics models, the internal step to convert the datacard to a binary workspace should be performed by the user. To create a binary workspace starting from a datacard.txt, you can run

text2workspace.py datacard.txt -o workspace.root\n

By default (without the -o option), the binary workspace will be named datacard.root - i.e the .txt suffix will be replaced by .root.

A full set of options for text2workspace can be found by running text2workspace.py --help.

The default model that will be produced when running text2workspace is one in which all processes identified as signal are multiplied by a common multiplier r. This is all that is needed for simply setting limits or calculating significances.

text2workspace will convert the datacard into a PDF that summarizes the analysis. For example, let's take a look at the data/tutorials/counting/simple-counting-experiment.txt datacard.

# Simple counting experiment, with one signal and one background process\n# Extremely simplified version of the 35/pb H->WW analysis for mH = 200 GeV,\n# for 4th generation exclusion (EWK-10-009, arxiv:1102.5429v1)\nimax 1  number of channels\njmax 1  number of backgrounds\nkmax 2  number of nuisance parameters (sources of systematical uncertainties)\n------------\n# we have just one channel, in which we observe 0 events\nbin         1\nobservation 0\n------------\n# now we list the expected events for signal and all backgrounds in that bin\n# the second 'process' line must have a positive number for backgrounds, and 0 for signal\n# then we list the independent sources of uncertainties, and give their effect (syst. error)\n# on each process and bin\nbin             1      1\nprocess       ggh4G  Bckg\nprocess         0      1\nrate           4.76  1.47\n------------\ndeltaS  lnN    1.20    -    20% uncertainty on signal\ndeltaB  lnN      -   1.50   50% uncertainty on background\n

If we run text2workspace.py on this datacard and take a look at the workspace (w) inside the .root file produced, we will find a number of different objects representing the signal, background, and observed event rates, as well as the nuisance parameters and signal strength \\(r\\). Note that often in the statistics literature, this parameter is referred to as \\(\\mu\\).

From these objects, the necessary PDF has been constructed (named model_s). For this counting experiment we will expect a simple PDF of the form

\\[ p(n_{\\mathrm{obs}}| r,\\nu_{S},\\nu_{B})\\propto \\dfrac{[r\\cdot n_{S}(\\nu_{S})+n_{B}(\\nu_{B})]^{n_{\\mathrm{obs}}} } {n_{\\mathrm{obs}}!}e^{-[r\\cdot n_{S}(\\nu_{S})+n_{B}(\\nu_{B})]} \\cdot e^{-\\frac{1}{2}(\\nu_{S}- y_{S})^{2}} \\cdot e^{-\\frac{1}{2}(\\nu_{B}- y_{B})^{2}} \\]

where the expected signal and background rates are expressed as functions of the nuisance parameters, \\(n_{S}(\\nu_{S}) = 4.76(1+0.2)^{\\nu_{S}}~\\) and \\(~n_{B}(\\nu_{B}) = 1.47(1+0.5)^{\\nu_{B}}\\). The \\(y_{S},~y_{B}\\) are the auxiliary observables. In the code, these will have the same name as the corresponding nuisance parameter, with the extension _In.

The first term represents the usual Poisson expression for observing \\(n_{\\mathrm{obs}}\\) events, while the second two are the Gaussian constraint terms for the nuisance parameters. In this case \\({y_S}={y_B}=0\\), and the widths of both Gaussians are 1.

A combination of counting experiments (or a binned shape datacard) will look like a product of PDFs of this kind. For parametric/unbinned analyses, the PDF for each process in each channel is provided instead of the using the Poisson terms and a product runs over the bin counts/events.

"},{"location":"part2/physicsmodels/#model-building","title":"Model building","text":"

For more complex models, PhysicsModels can be produced. To use a different physics model instead of the default one, use the option -P as in

text2workspace.py datacard -P HiggsAnalysis.CombinedLimit.PythonFile:modelName\n

Generic models can be implemented by writing a python class that:

  • defines the model parameters (by default it is just the signal strength modifier r)
  • defines how signal and background yields depend on the parameters (by default, the signal scales linearly with r, backgrounds are constant)
  • potentially also modifies the systematic uncertainties (e.g. switch off theory uncertainties on cross section when measuring the cross section itself)

In the case of SM-like Higgs boson measurements, the class should inherit from SMLikeHiggsModel (redefining getHiggsSignalYieldScale), while beyond that one can inherit from PhysicsModel. You can find some examples in PhysicsModel.py.

In the 4-process model (PhysicsModel:floatingXSHiggs, you will see that each of the 4 dominant Higgs boson production modes get separate scaling parameters, r_ggH, r_qqH, r_ttH and r_VH (or r_ZH and r_WH) as defined in,

def doParametersOfInterest(self):\n  \"\"\"Create POI and other parameters, and define the POI set.\"\"\"\n  # --- Signal Strength as only POI ---\n  if \"ggH\" in self.modes: self.modelBuilder.doVar(\"r_ggH[1,%s,%s]\" % (self.ggHRange[0], self.ggHRange[1]))\n  if \"qqH\" in self.modes: self.modelBuilder.doVar(\"r_qqH[1,%s,%s]\" % (self.qqHRange[0], self.qqHRange[1]))\n  if \"VH\"  in self.modes: self.modelBuilder.doVar(\"r_VH[1,%s,%s]\"  % (self.VHRange [0], self.VHRange [1]))\n  if \"WH\"  in self.modes: self.modelBuilder.doVar(\"r_WH[1,%s,%s]\"  % (self.WHRange [0], self.WHRange [1]))\n  if \"ZH\"  in self.modes: self.modelBuilder.doVar(\"r_ZH[1,%s,%s]\"  % (self.ZHRange [0], self.ZHRange [1]))\n  if \"ttH\" in self.modes: self.modelBuilder.doVar(\"r_ttH[1,%s,%s]\" % (self.ttHRange[0], self.ttHRange[1]))\n  poi = \",\".join([\"r_\"+m for m in self.modes])\n  if self.pois: poi = self.pois\n  ...\n

The mapping of which POI scales which process is handled via the following function,

def getHiggsSignalYieldScale(self,production,decay, energy):\n  if production == \"ggH\": return (\"r_ggH\" if \"ggH\" in self.modes else 1)\n  if production == \"qqH\": return (\"r_qqH\" if \"qqH\" in self.modes else 1)\n  if production == \"ttH\": return (\"r_ttH\" if \"ttH\" in self.modes else (\"r_ggH\" if self.ttHasggH else 1))\n  if production in [ \"WH\", \"ZH\", \"VH\" ]: return (\"r_VH\" if \"VH\" in self.modes else 1)\n  raise RuntimeError, \"Unknown production mode '%s'\" % production\n

You should note that text2workspace will look for the python module in PYTHONPATH. If you want to keep your model local, you'll need to add the location of the python file to PYTHONPATH.

A number of models used in the LHC Higgs combination paper can be found in LHCHCGModels.py.

The models can be applied to the datacard by using the -P option, for example -P HiggsAnalysis.CombinedLimit.HiggsCouplings:c7, and others that are defined in HiggsCouplings.py.

Below are some (more generic) example models that also exist in GitHub.

"},{"location":"part2/physicsmodels/#multisignalmodel-ready-made-model-for-multiple-signal-processes","title":"MultiSignalModel ready made model for multiple signal processes","text":"

Combine already contains a model HiggsAnalysis.CombinedLimit.PhysicsModel:multiSignalModel that can be used to assign different signal strengths to multiple processes in a datacard, configurable from the command line.

The model is configured by passing one or more mappings in the form --PO 'map=bin/process:parameter' to text2workspace:

  • bin and process can be arbitrary regular expressions matching the bin names and process names in the datacard. Note that mappings are applied both to signals and to background processes; if a line matches multiple mappings, precedence is given to the last one in the order they are in the command line. It is suggested to put quotes around the argument of --PO so that the shell does not try to expand any * signs in the patterns.
  • parameter is the POI to use to scale that process (name[starting_value,min,max] the first time a parameter is defined, then just name if used more than once). Special values are 1 and 0==; ==0 means \"drop the process completely from the model\", while 1 means to \"keep the yield as is in the card with no scaling\" (as normally done for backgrounds); 1 is the default that is applied to processes that have no mappings. Therefore it is normally not needed, but it may be used to override a previous more generic match in the same command line (e.g. --PO 'map=.*/ggH:r[1,0,5]' --PO 'map=bin37/ggH:1' would treat ggH as signal in general, but count it as background in the channel bin37).

Passing the additional option --PO verbose will set the code to verbose mode, printing out the scaling factors for each process; we encourage the use this option to make sure that the processes are being scaled correctly.

The MultiSignalModel will define all parameters as parameters of interest, but that can be then changed from the command line, as described in the following subsection.

Some examples, taking as reference the toy datacard test/multiDim/toy-hgg-125.txt:

  • Scale both ggH and qqH with the same signal strength r (that is what the default physics model of Combine does for all signals; if they all have the same systematic uncertainties, it is also equivalent to adding up their yields and writing them as a single column in the card)
  $ text2workspace.py -P HiggsAnalysis.CombinedLimit.PhysicsModel:multiSignalModel  --PO verbose --PO 'map=.*/ggH:r[1,0,10]' --PO 'map=.*/qqH:r' toy-hgg-125.txt -o toy-1d.root\n  [...]\n  Will create a POI  r  with factory  r[1,0,10]\n  Mapping  r  to  ['.*/ggH']  patterns\n  Mapping  r  to  ['.*/qqH']  patterns\n  [...]\n  Will scale  incl/bkg  by  1\n  Will scale  incl/ggH  by  r\n  Will scale  incl/qqH  by  r\n  Will scale  dijet/bkg  by  1\n  Will scale  dijet/ggH  by  r\n  Will scale  dijet/qqH  by  r\n
  • Define two independent parameters of interest r_ggH and r_qqH
  $ text2workspace.py -P HiggsAnalysis.CombinedLimit.PhysicsModel:multiSignalModel  --PO verbose --PO 'map=.*/ggH:r_ggH[1,0,10]' --PO 'map=.*/qqH:r_qqH[1,0,20]' toy-hgg-125.txt -o toy-2d.root\n  [...]\n  Will create a POI  r_ggH  with factory  r_ggH[1,0,10]\n  Mapping  r_ggH  to  ['.*/ggH']  patterns\n  Will create a POI  r_qqH  with factory  r_qqH[1,0,20]\n  Mapping  r_qqH  to  ['.*/qqH']  patterns\n  [...]\n  Will scale  incl/bkg  by  1\n  Will scale  incl/ggH  by  r_ggH\n  Will scale  incl/qqH  by  r_qqH\n  Will scale  dijet/bkg  by  1\n  Will scale  dijet/ggH  by  r_ggH\n  Will scale  dijet/qqH  by  r_qqH\n
  • Fix ggH to SM, define only qqH as parameter
  $ text2workspace.py -P HiggsAnalysis.CombinedLimit.PhysicsModel:multiSignalModel  --PO verbose --PO 'map=.*/ggH:1' --PO 'map=.*/qqH:r_qqH[1,0,20]' toy-hgg-125.txt -o toy-1d-qqH.root\n  [...]\n  Mapping  1  to  ['.*/ggH']  patterns\n  Will create a POI  r_qqH  with factory  r_qqH[1,0,20]\n  Mapping  r_qqH  to  ['.*/qqH']  patterns\n  [...]\n  Will scale  incl/bkg  by  1\n  Will scale  incl/ggH  by  1\n  Will scale  incl/qqH  by  r_qqH\n  Will scale  dijet/bkg  by  1\n  Will scale  dijet/ggH  by  1\n  Will scale  dijet/qqH  by  r_qqH\n
  • Drop ggH , and define only qqH as parameter
 $ text2workspace.py -P HiggsAnalysis.CombinedLimit.PhysicsModel:multiSignalModel  --PO verbose --PO 'map=.*/ggH:0' --PO 'map=.*/qqH:r_qqH[1,0,20]' toy-hgg-125.txt -o toy-1d-qqH0-only.root\n [...]\n Mapping  0  to  ['.*/ggH']  patterns\n Will create a POI  r_qqH  with factory  r_qqH[1,0,20]\n Mapping  r_qqH  to  ['.*/qqH']  patterns\n [...]\n Will scale  incl/bkg  by  1\n Will scale  incl/ggH  by  0\n Will scale  incl/qqH  by  r_qqH\n Will scale  dijet/bkg  by  1\n Will scale  dijet/ggH  by  0\n Will scale  dijet/qqH  by  r_qqH\n
"},{"location":"part2/physicsmodels/#two-hypothesis-testing","title":"Two Hypothesis testing","text":"

The PhysicsModel that encodes the signal model above is the twoHypothesisHiggs, which assumes signal processes with suffix _ALT will exist in the datacard. An example of such a datacard can be found under data/benchmarks/simple-counting/twoSignals-3bin-bigBSyst.txt

 $ text2workspace.py twoSignals-3bin-bigBSyst.txt -P HiggsAnalysis.CombinedLimit.HiggsJPC:twoHypothesisHiggs -m 125.7 --PO verbose -o jcp_hww.root\n\n MH (not there before) will be assumed to be 125.7\n Process  S  will get norm  not_x\n Process  S_ALT  will get norm  x\n Process  S  will get norm  not_x\n Process  S_ALT  will get norm  x\n Process  S  will get norm  not_x\n Process  S_ALT  will get norm  x\n

The two processes (S and S_ALT) will get different scaling parameters. The LEP-style likelihood for hypothesis testing can now be used by setting x or not_x to 1 and 0 and comparing the two likelihood evaluations.

"},{"location":"part2/physicsmodels/#signal-background-interference","title":"Signal-background interference","text":"

Since negative probability distribution functions do not exist, the recommended way to implement this is to start from the expression for the individual amplitudes \\(A\\) and the parameter of interest \\(k\\),

\\[ \\mathrm{Yield} = |k * A_{s} + A_{b}|^2 = k^2 * |A_{s}|^2 + k * 2 \\Re(A_{s}^* A_{b}) + |A_{b}|^2 = \\mu * S + \\sqrt{\\mu} * I + B \\]

where

\\(\\mu = k^2, ~S = |A_{s}|^2,~B = |A_b|^2\\) and \\(S+B+I = |A_s + A_b|^2\\).

With some algebra you can work out that,

\\(\\mathrm{Yield} = \\sqrt{\\mu} * \\left[S+B+I\\right] + (\\mu-\\sqrt{\\mu}) * \\left[S\\right] + (1-\\sqrt{\\mu}) * \\left[B\\right]\\)

where square brackets represent the input (histograms as TH1 or RooDataHists) that one needs to provide.

An example of this scheme is implemented in a HiggsWidth and is completely general, since all of the three components above are strictly positive. In this example, the POI is CMS_zz4l_mu and the equations for the three components are scaled (separately for the qqH and ggH processes) as,

 self.modelBuilder.factory_( \"expr::ggH_s_func(\\\"@0-sqrt(@0)\\\", CMS_zz4l_mu)\")\n self.modelBuilder.factory_(  \"expr::ggH_b_func(\\\"1-sqrt(@0)\\\", CMS_zz4l_mu)\")\n self.modelBuilder.factory_(  \"expr::ggH_sbi_func(\\\"sqrt(@0)\\\", CMS_zz4l_mu)\")\n\n self.modelBuilder.factory_( \"expr::qqH_s_func(\\\"@0-sqrt(@0)\\\", CMS_zz4l_mu)\")\n self.modelBuilder.factory_(  \"expr::qqH_b_func(\\\"1-sqrt(@0)\\\", CMS_zz4l_mu)\")\n self.modelBuilder.factory_(  \"expr::qqH_sbi_func(\\\"sqrt(@0)\\\", CMS_zz4l_mu)\")\n
"},{"location":"part2/physicsmodels/#multi-process-interference","title":"Multi-process interference","text":"

The above formulation can be extended to multiple parameters of interest (POIs). See AnalyticAnomalousCoupling for an example. However, the computational performance scales quadratically with the number of POIs, and can get extremely expensive for 10 or more, as may be encountered often with EFT analyses. To alleviate this issue, an accelerated interference modeling technique is implemented for template-based analyses via the interferenceModel physics model. In this model, each bin yield \\(y\\) is parameterized

\\[ y(\\vec{\\mu}) = y_0 (\\vec{\\mu}^\\top M \\vec{\\mu}) \\]

as a function of the POI vector \\(\\vec{\\mu}\\), a nominal template \\(y_0\\), and a scaling matrix \\(M\\). To see how this parameterization relates to that of the previous section, we can define:

\\[ y_0 = A_b^2, \\qquad M = \\frac{1}{A_b^2} \\begin{bmatrix} |A_s|^2 & \\Re(A_s^* A_b) \\\\ \\Re(A_s A_b^*) & |A_b|^2 \\end{bmatrix}, \\qquad \\vec{\\mu} = \\begin{bmatrix} \\sqrt{\\mu} \\\\ 1 \\end{bmatrix} \\]

which leads to the same parameterization. At present, this technique only works with CMSHistFunc-based workspaces, as these are the most common workspace types encountered and the default when using autoMCStats. To use this model, for each bin find \\(y_0\\) and put it into the datacard as a signal process, then find \\(M\\) and save the lower triangular component as an array in a scaling.json file with a syntax as follows:

[\n  {\n    \"channel\": \"my_channel\",\n    \"process\": \"my_nominal_process\",\n    \"parameters\": [\"sqrt_mu[1,0,2]\", \"Bscaling[1]\"],\n    \"scaling\": [\n      [0.5, 0.1, 1.0],\n      [0.6, 0.2, 1.0],\n      [0.7, 0.3, 1.0]\n    ]\n  }\n]\n

where the parameters are declared using RooFit's factory syntax and each row of the scaling field represents the scaling information of a bin, e.g. if \\(y_0 = |A_b|^2\\) then each row would contain three entries:

\\[ |A_s|^2 / |A_b|^2,\\quad \\Re(A_s^* A_b)/|A_b|^2,\\quad 1 \\]

For several coefficients, one would enumerate as follows:

scaling = []\nfor ibin in range(nbins):\n    binscaling = []\n    for icoef in range(ncoef):\n        for jcoef in range(icoef + 1):\n            binscaling.append(amplitude_squared_for(ibin, icoef, jcoef))\n    scaling.append(binscaling)\n

Then, to construct the workspace, run

text2workspace.py card.txt -P HiggsAnalysis.CombinedLimit.InterferenceModels:interferenceModel \\\n    --PO verbose --PO scalingData=scaling.json\n

For large amounts of scaling data, you can optionally use gzipped json (.json.gz) or pickle (.pkl.gz) files with 2D numpy arrays for the scaling coefficients instead of lists. The function numpy.tril_indices(ncoef) is helpful for extracting the lower triangle of a square matrix.

You could pick any nominal template, and adjust the scaling as appropriate. Generally it is advisable to use a nominal template corresponding to near where you expect the best-fit values of the POIs to be so that the shape systematic effects are well-modeled in that region.

It may be the case that the relative contributions of the terms are themselves a function of the POIs. For example, in VBF di-Higgs production, BSM modifications to the production rate can be parameterized in the \"kappa\" framework via three diagrams, with scaling coefficients \\(\\kappa_V \\kappa_\\lambda\\), \\(\\kappa_V^2\\), and \\(\\kappa_{2V}\\), respectively, that interfere. In that case, you can declare formulas with the factory syntax to represent each amplitude as follows:

[\n  {\n    \"channel\": \"a_vbf_channel\",\n    \"process\": \"VBFHH\",\n    \"parameters\": [\"expr::a0('@0*@1', kv[1,0,2], kl[1,0,2])\", \"expr::a1('@0*@0', kv[1,0,2])\", \"k2v[1,0,2]\"],\n    \"scaling\": [\n      [3.30353674666415, -8.54170982038222, 22.96464188467882, 4.2353483207128, -11.07996258835088, 5.504469544697623],\n      [2.20644332142891, -7.076836641962523, 23.50989689214267, 4.053185685866683, -13.08569222837996, 7.502346155380032]\n    ]\n  }\n]\n

However, you will need to manually specify what the POIs should be when creating the workspace using the POIs= physics option, e.g.

text2workspace.py card.txt -P HiggsAnalysis.CombinedLimit.InterferenceModels:interferenceModel \\\n  --PO scalingData=scaling.json --PO 'POIs=kl[1,0,2]:kv[1,0,2]:k2v[1,0,2]'\n
"},{"location":"part2/settinguptheanalysis/","title":"Preparing the datacard","text":"

The input to Combine, which defines the details of the analysis, is a plain ASCII file we will refer to as datacard. This is true whether the analysis is a simple counting experiment or a shape analysis.

"},{"location":"part2/settinguptheanalysis/#a-simple-counting-experiment","title":"A simple counting experiment","text":"

The file data/tutorials/counting/realistic-counting-experiment.txt shows an example of a counting experiment.

The first lines can be used to add some descriptive information. Those lines must start with a \"#\", and they are not parsed by Combine:

# Simple counting experiment, with one signal and a few background processes\n# Simplified version of the 35/pb H->WW analysis for mH = 160 GeV\n

Following this, one declares the number of observables, imax, that are present in the model used to set limits / extract confidence intervals. The number of observables will typically be the number of channels in a counting experiment. The value * can be specified for imax, which tells Combine to determine the number of observables from the rest of the datacard. In order to better catch mistakes, it is recommended to explicitly specify the value.

imax 1  number of channels\n

This declaration is followed by a specification of the number of background sources to be considered, jmax, and the number of independent sources of systematic uncertainty, kmax:

jmax 3  number of backgrounds\nkmax 5  number of nuisance parameters (sources of systematic uncertainty)\n

In the example there is 1 channel, there are 3 background sources, and there are 5 independent sources of systematic uncertainty.

After providing this information, the following lines describe what is observed in data: the number of events observed in each channel. The first line, starting with bin, defines the label used for each channel. In the example we have 1 channel, labelled 1, and in the following line, observation, the number of observed events is given: 0 in this example.

# we have just one channel, in which we observe 0 events\nbin bin1\nobservation 0\n

This is followed by information related to the expected number of events, for each bin and process, arranged in (#channels)*(#processes) columns.

bin          bin1     bin1     bin1     bin1\nprocess         ggH  qqWW  ggWW  others\nprocess          0     1     2     3\nrate           1.47  0.63  0.06  0.22\n
  • The bin line identifies the channel that the column refers to. It ranges from 1 to the value of imax declared above.
  • The first process line contains the names of the various process sources
  • The second process line is a numerical process identifier. Backgrounds are given a positive number, while 0 and negative numbers are used for signal processes. Different process identifiers must be used for different processes.
  • The last line, rate, gives the expected number of events for the given process in the specified bin

If a process does not contribute in a given bin, it can be removed from the datacard, or the rate can be set to 0.

The final section of the datacard describes the systematic uncertainties:

lumi    lnN    1.11    -   1.11    -    lumi affects both signal and gg->WW (mc-driven). lnN = lognormal\nxs_ggH  lnN    1.16    -     -     -    gg->H cross section + signal efficiency + other minor ones.\nWW_norm gmN 4    -   0.16    -     -    WW estimate of 0.64 comes from sidebands: 4 events in sideband times 0.16 (=> ~50% statistical uncertainty)\nxs_ggWW lnN      -     -   1.50    -    50% uncertainty on gg->WW cross section\nbg_others lnN    -     -     -   1.30   30% uncertainty on the rest of the backgrounds\n
  • The first column is the name of the nuisance parameter, a label that is used to identify the uncertainty
  • The second column identifies the type of distribution used to describe the nuisance parameter
    • lnN stands for Log-normal, which is the recommended choice for multiplicative corrections (efficiencies, cross sections, ...). If \u0394x/x is the relative uncertainty in the multiplicative correction, one should put 1+\u0394x/x in the column corresponding to the process and channel. Asymmetric log-normals are instead supported by providing \u03badown/\u03baup where \u03badown is the ratio of the the yield to the nominal value for a -1\u03c3 deviation of the nuisance parameter and \u03baup is the ratio of the yield to the nominal value for a \\(+1\\sigma\\) deviation. Note that for a single-value log-normal with value \\(\\kappa=1+\\Delta x/x\\), the yield of the process it is associated with is multiplied by \\(\\kappa^{\\theta}\\). At \\(\\theta=0\\) the nominal yield is retained, at \\(\\theta=1\\sigma\\) the yield is multiplied by \\(\\kappa\\) and at \\(\\theta=-1\\sigma\\) the yield is multiplied by \\(1/\\kappa\\). This means that an uncertainty represented as 1.2 does not multiply the nominal yield by 0.8 for \\(\\theta=-1\\sigma\\); but by 0.8333. It may therefore be desirable to encode large uncertainties that have a symmetric effect on the yield as asymmetric log-normals instead.
    • gmN stands for Gamma, and is the recommended choice for the statistical uncertainty in a background determined from the number of events in a control region (or in an MC sample with limited sample size). If the control region or simulated sample contains N events, and the extrapolation factor from the control region to the signal region is \u03b1, one shoud put N just after the gmN keyword, and then the value of \u03b1 in the relevant (bin,process) column. The yield specified in the rate line for this (bin,process) combination should equal N\u03b1.
    • lnU stands for log-uniform distribution. A value of 1+\u03b5 in the column will imply that the yield of this background is allowed to float freely between x(1+\u03b5) and x/(1+\u03b5). In particular, if \u03b5 is small, this is approximately (x-\u0394x,x+\u0394x) with \u03b5=\u0394x/x. This distribution is typically useful when you want to set a large a-priori uncertainty on a given background process, and then rely on the correlation between channels to constrain it. Note that for this use case, we usually recommend using a rateParam instead. If you do use lnU, please be aware that while Gaussian-like uncertainties behave in a similar way under profiling and marginalization, uniform uncertainties do not. This means the impact of the uncertainty on the result will depend on how the nuisance parameters are treated.
  • The next (#channels)*(#processes) columns indicate the relative effect of the systematic uncertainty on the rate of each process in each channel. The columns are aligned with those in the previous lines declaring bins, processes, and rates.

In the example, there are 5 uncertainties:

  • The first uncertainty has an 11% effect on the signal and on the ggWW process.
  • The second uncertainty affects the signal by 16%, but leaves the background processes unaffected
  • The third line specifies that the qqWW background comes from a sideband with 4 observed events and an extrapolation factor of 0.16; the resulting uncertainty in the expected yield is \\(1/\\sqrt{4+1}\\) = 45%
  • The fourth uncertainty does not affect the signal, has a 50% effect on the ggWW background, and leaves the other backgrounds unaffected
  • The fifth uncertainty does not affect the signal, has a 30% effect on the others background process, and does not affect the remaining backgrounds.
"},{"location":"part2/settinguptheanalysis/#shape-analyses","title":"Shape analyses","text":"

The datacard has to be supplemented with two extensions:

  • A new block of lines defining how channels and processes are mapped into shapes.
  • The block for systematics can now also contain rows with shape uncertainties.

The expected shape can be parametric, or not. In the first case the parametric PDFs have to be given as input to the tool. In the latter case, for each channel, histograms have to be provided for the expected shape of each process. The data have to be provided as input as a histogram to perform a binned shape analysis, and as a RooDataSet to perform an unbinned shape analysis.

Warning

If using RooFit-based inputs (RooDataHists/RooDataSets/RooAbsPdfs) then you need to ensure you are using different RooRealVars as the observable in each category entering the statistical analysis. It is possible to use the same RooRealVar if the observable has the same range (and binning if using binned data) in each category, although in most cases it is simpler to avoid doing this.

"},{"location":"part2/settinguptheanalysis/#rates-for-shape-analyses","title":"Rates for shape analyses","text":"

As with the counting experiment, the total nominal rate of a given process must be identified in the rate line of the datacard. However, there are special options for shape-based analyses, as follows:

  • A value of -1 in the rate line means Combine will calculate the rate from the input TH1 (via TH1::Integral) or RooDataSet/RooDataHist (via RooAbsData::sumEntries).
  • For parametric shapes (RooAbsPdf), if a parameter with the name pdfname_norm is found in the input workspace, the rate will be multiplied by the value of that parameter. Note that since this parameter can be freely floating, the normalization of a process can be set freely float this way. This can also be achieved through the use of rateParams.
"},{"location":"part2/settinguptheanalysis/#binned-shape-analyses","title":"Binned shape analyses","text":"

For each channel, histograms have to be provided for the observed shape and for the expected shape of each process.

  • Within each channel, all histograms must have the same binning.
  • The normalization of the data histogram must correspond to the number of observed events.
  • The normalization of the expected histograms must match the expected event yields.

The Combine tool can take as input histograms saved as TH1, as RooAbsHist in a RooFit workspace (an example of how to create a RooFit workspace and save histograms is available in github), or from a pandas dataframe (example).

The block of lines defining the mapping (first block in the datacard) contains one or more rows of the form

  • shapes process channel file histogram [histogram_with_systematics] **

In this line,

  • process is any one the process names, or * for all processes, or data_obs for the observed data;
  • channel is any one the process names, or * for all channels;
  • file, histogram and histogram_with_systematics identify the names of the files and of the histograms within the file, after making some replacements (if any are found):
    • $PROCESS is replaced with the process name (or \"data_obs\" for the observed data);
    • $CHANNEL is replaced with the channel name;
    • $SYSTEMATIC is replaced with the name of the systematic + (Up, Down);
    • $MASS is replaced with the chosen (Higgs boson) mass value that is passed as a command-line option when running the tool

In addition, user-defined keywords can be used. Any word in the datacard $WORD will be replaced by VALUE when including the option --keyword-value WORD=VALUE. This option can be repeated multiple times for multiple keywords.

"},{"location":"part2/settinguptheanalysis/#template-shape-uncertainties","title":"Template shape uncertainties","text":"

Shape uncertainties can be taken into account by vertical interpolation of the histograms. The shapes (fraction of events \\(f\\) in each bin) are interpolated using a spline for shifts below +/- 1\u03c3 and linearly outside of that. Specifically, for nuisance parameter values \\(|\\nu|\\leq 1\\)

\\[ f(\\nu) = \\frac{1}{2} \\left( (\\delta^{+}-\\delta^{-})\\nu + \\frac{1}{8}(\\delta^{+}+\\delta^{-})(3\\nu^6 - 10\\nu^4 + 15\\nu^2) \\right) \\]

and for \\(|\\nu|> 1\\) (\\(|\\nu|<-1\\)), \\(f(\\nu)\\) is a straight line with gradient \\(\\delta^{+}\\) (\\(\\delta^{-}\\)), where \\(\\delta^{+}=f(\\nu=1)-f(\\nu=0)\\), and \\(\\delta^{-}=f(\\nu=-1)-f(\\nu=0)\\), derived using the nominal and up/down histograms. This interpolation is designed so that the values of \\(f(\\nu)\\) and its derivatives are continuous for all values of \\(\\nu\\).

The normalizations are interpolated linearly in log scale, just like we do for log-normal uncertainties. If the value in a given bin is negative for some value of \\(\\nu\\), the value will be truncated at 0.

For each shape uncertainty and process/channel affected by it, two additional input shapes have to be provided. These are obtained by shifting the parameter up and down by one standard deviation. When building the likelihood, each shape uncertainty is associated to a nuisance parameter taken from a unit gaussian distribution, which is used to interpolate or extrapolate using the specified histograms.

For each given shape uncertainty, the part of the datacard describing shape uncertainties must contain a row

  • ** name shape effect_for_each_process_and_channel **

The effect can be \"-\" or 0 for no effect, 1 for the normal effect, and something different from 1 to test larger or smaller effects (in that case, the unit gaussian is scaled by that factor before using it as parameter for the interpolation).

The datacard in data/tutorials/shapes/simple-shapes-TH1.txt provides an example of how to include shapes in the datacard. In the first block the following line specifies the shape mapping:

shapes * * simple-shapes-TH1.root $PROCESS $PROCESS_$SYSTEMATIC\n

The last block concerns the treatment of the systematic uncertainties that affect shapes. In this case there are two uncertainties with a shape-altering effect.

alpha  shape    -           1   uncertainty on background shape and normalization\nsigma  shape    0.5         -   uncertainty on signal resolution. Assume the histogram is a 2 sigma shift,\n#                                so divide the unit gaussian by 2 before doing the interpolation\n

There are two options for the interpolation algorithm in the \"shape\" uncertainty. Putting shape will result in an interpolation of the fraction of events in each bin. That is, the histograms are first normalized before interpolation. Putting shapeN while instead base the interpolation on the logs of the fraction in each bin. For both shape and shapeN, the total normalization is interpolated using an asymmetric log-normal, so that the effect of the systematic on both the shape and normalization are accounted for. The following image shows a comparison of the two algorithms for the example datacard.

In this case there are two processes, signal and background, and two uncertainties affecting the background (alpha) and signal shapes (sigma). In the ROOT file, two histograms per systematic have to be provided, they are the shapes obtained, for the specific process, by shifting the parameter associated with the uncertainty up and down by a standard deviation: background_alphaUp and background_alphaDown, signal_sigmaUp and signal_sigmaDown.

The content of the ROOT file simple-shapes-TH1.root associated with the datacard data/tutorials/shapes/simple-shapes-TH1.txt is:

root [0]\nAttaching file simple-shapes-TH1.root as _file0...\nroot [1] _file0->ls()\nTFile**     simple-shapes-TH1.root\n TFile*     simple-shapes-TH1.root\n  KEY: TH1F signal;1    Histogram of signal__x\n  KEY: TH1F signal_sigmaUp;1    Histogram of signal__x\n  KEY: TH1F signal_sigmaDown;1  Histogram of signal__x\n  KEY: TH1F background;1    Histogram of background__x\n  KEY: TH1F background_alphaUp;1    Histogram of background__x\n  KEY: TH1F background_alphaDown;1  Histogram of background__x\n  KEY: TH1F data_obs;1  Histogram of data_obs__x\n  KEY: TH1F data_sig;1  Histogram of data_sig__x\n

For example, without shape uncertainties there would only be one row with shapes * * shapes.root $CHANNEL/$PROCESS Then, to give a simple example for two channels (\"e\", \"mu\") with three processes ()\"higgs\", \"zz\", \"top\"), the ROOT file contents should look like:

histogram meaning e/data_obs observed data in electron channel e/higgs expected shape for higgs in electron channel e/zz expected shape for ZZ in electron channel e/top expected shape for top in electron channel mu/data_obs observed data in muon channel mu/higgs expected shape for higgs in muon channel mu/zz expected shape for ZZ in muon channel mu/top expected shape for top in muon channel

If there is also an uncertainty that affects the shape, e.g. the jet energy scale, shape histograms for the jet energy scale shifted up and down by one sigma need to be included. This could be done by creating a folder for each process and writing a line like

shapes * * shapes.root $CHANNEL/$PROCESS/nominal $CHANNEL/$PROCESS/$SYSTEMATIC

or a postifx can be added to the histogram name:

shapes * * shapes.root $CHANNEL/$PROCESS $CHANNEL/$PROCESS_$SYSTEMATIC

Warning

If you have a nuisance parameter that has shape effects on some processes (using shape) and rate effects on other processes (using lnN) you should use a single line for the systematic uncertainty with shape?. This will tell Combine to fist look for Up/Down systematic templates for that process and if it doesnt find them, it will interpret the number that you put for the process as a lnN instead.

For a detailed example of a template-based binned analysis, see the H\u2192\u03c4\u03c4 2014 DAS tutorial, or in our Tutorial pages.

"},{"location":"part2/settinguptheanalysis/#unbinned-or-parametric-shape-analyses","title":"Unbinned or parametric shape analyses","text":"

In some cases, it can be convenient to describe the expected signal and background shapes in terms of analytical functions, rather than templates. Typical examples are searches/measurements where the signal is apparent as a narrow peak over a smooth continuum background. In this context, uncertainties affecting the shapes of the signal and backgrounds can be implemented naturally as uncertainties in the parameters of those analytical functions. It is also possible to adopt an agnostic approach in which the parameters of the background model are left freely floating in the fit to the data, i.e. only requiring the background to be well described by a smooth function.

Technically, this is implemented by means of the RooFit package, which allows writing generic probability density functions, and saving them into ROOT files. The PDFs can be either taken from RooFit's standard library of functions (e.g. Gaussians, polynomials, ...) or hand-coded in C++, and combined together to form even more complex shapes.

In the datacard using templates, the column after the file name would have been the name of the histogram. For parametric analysis we need two names to identify the mapping, separated by a colon (:).

shapes process channel shapes.root workspace_name:pdf_name

The first part identifies the name of the input RooWorkspace containing the PDF, and the second part the name of the RooAbsPdf inside it (or, for the observed data, the RooAbsData). It is possible to have multiple input workspaces, just as there can be multiple input ROOT files. You can use any of the usual RooFit pre-defined PDFs for your signal and background models.

Warning

If in your model you are using RooAddPdfs, in which the coefficients are not defined recursively, Combine will not interpret them correctly. You can add the option --X-rtd ADDNLL_RECURSIVE=0 to any Combine command in order to recover the correct interpretation, however we recommend that you instead re-define your PDF so that the coefficients are recursive (as described in the RooAddPdf documentation) and keep the total normalization (i.e the extended term) as a separate object, as in the case of the tutorial datacard.

For example, take a look at the data/tutorials/shapes/simple-shapes-parametric.txt. We see the following line:

shapes * * simple-shapes-parametric_input.root w:$PROCESS\n[...]\nbin          1          1\nprocess      sig    bkg\n

which indicates that the input file simple-shapes-parametric_input.root should contain an input workspace (w) with PDFs named sig and bkg, since these are the names of the two processes in the datacard. Additionally, we expect there to be a data set named data_obs. If we look at the contents of the workspace in data/tutorials/shapes/simple-shapes-parametric_input.root, this is indeed what we see:

root [1] w->Print()\n\nRooWorkspace(w) w contents\n\nvariables\n---------\n(MH,bkg_norm,cc_a0,cc_a1,cc_a2,j,vogian_sigma,vogian_width)\n\np.d.f.s\n-------\nRooChebychev::bkg[ x=j coefList=(cc_a0,cc_a1,cc_a2) ] = 2.6243\nRooVoigtian::sig[ x=j mean=MH width=vogian_width sigma=vogian_sigma ] = 0.000639771\n\ndatasets\n--------\nRooDataSet::data_obs(j)\n

In this datacard, the signal is parameterized in terms of the hypothesized mass (MH). Combine will use this variable, instead of creating its own, which will be interpreted as the value for -m. For this reason, we should add the option -m 30 (or something else within the observable range) when running Combine. You will also see there is a variable named bkg_norm. This is used to normalize the background rate (see the section on Rate parameters below for details).

Warning

Combine will not accept RooExtendedPdfs as input. This is to alleviate a bug that lead to improper treatment of the normalization when using multiple RooExtendedPdfs to describe a single process. You should instead use RooAbsPdfs and provide the rate as a separate object (see the Rate parameters section).

The part of the datacard related to the systematics can include lines with the syntax

  • name param X Y

These lines encode uncertainties in the parameters of the signal and background PDFs. The parameter is to be assigned a Gaussian uncertainty of Y around its mean value of X. One can change the mean value from 0 to 1 (or any value, if one so chooses) if the parameter in question is multiplicative instead of additive.

In the data/tutorials/shapes/simple-shapes-parametric.txt datacard, there are lines for one such parametric uncertainty,

sigma   param 1.0      0.1\n

meaning there is a parameter in the input workspace called sigma, that should be constrained with a Gaussian centered at 1.0 with a width of 0.1. Note that the exact interpretation of these parameters is left to the user since the signal PDF is constructed externally by you. All Combine knows is that 1.0 should be the most likely value and 0.1 is its 1\u03c3 uncertainy. Asymmetric uncertainties are written using the syntax -1\u03c3/+1\u03c3 in the datacard, as is the case for lnN uncertainties.

If one wants to specify a parameter that is freely floating across its given range, and not Gaussian constrained, the following syntax is used:

  • name flatParam **

Though this is not strictly necessary in frequentist methods using profiled likelihoods, as Combine will still profile these nuisances when performing fits (as is the case for the simple-shapes-parametric.txt datacard).

Warning

All parameters that are floating or constant in the user's input workspaces will remain floating or constant. Combine will not modify those for you!

A full example of a parametric analysis can be found in this H\u2192\u03b3\u03b3 2014 DAS tutorial or in our Tutorial pages.

"},{"location":"part2/settinguptheanalysis/#caveat-on-using-parametric-pdfs-with-binned-datasets","title":"Caveat on using parametric PDFs with binned datasets","text":"

Users should be aware of a feature that affects the use of parametric PDFs together with binned datasets.

RooFit uses the integral of the PDF, computed analytically (or numerically, but disregarding the binning), to normalize it, but computes the expected event yield in each bin by evaluating the PDF at the bin center. This means that if the variation of the pdf is sizeable within the bin, there is a mismatch between the sum of the event yields per bin and the PDF normalization, which can cause a bias in the fits. More specifically, the bias is present if the contribution of the second derivative integrated in the bin size is not negligible. For linear functions, an evaluation at the bin center is correct. There are two recommended ways to work around this issue:

1. Use narrow bins

It is recommended to use bins that are significantly finer than the characteristic scale of the PDFs. Even in the absence of this feature, this would be advisable. Note that this caveat does not apply to analyses using templates (they are constant across each bin, so there is no bias), or using unbinned datasets.

2. Use a RooParametricShapeBinPdf

Another solution (currently only implemented for 1-dimensional histograms) is to use a custom PDF that performs the correct integrals internally, as in RooParametricShapeBinPdf.

Note that this PDF class now allows parameters that are themselves RooAbsReal objects (i.e. functions of other variables). The integrals are handled internally by calling the underlying PDF's createIntegral() method with named ranges created for each of the bins. This means that if the analytical integrals for the underlying PDF are available, they will be used.

The constructor for this class requires a RooAbsReal (eg any RooAbsPdf) along with a list of RooRealVars (the parameters, excluding the observable \\(x\\)),

RooParametricShapeBinPdf(const char *name, const char *title,  RooAbsReal& _pdf, RooAbsReal& _x, RooArgList& _pars, const TH1 &_shape )\n

Below is a comparison of a fit to a binned dataset containing 1000 events with one observable \\(0 \\leq x \\leq 100\\). The fit function is a RooExponential of the form \\(e^{xp}\\).

In the upper plot, the data are binned in 100 evenly-spaced bins, while in the lower plot, there are three irregular bins. The blue lines show the result of the fit when using the RooExponential directly, while the red lines show the result when wrapping the PDF inside a RooParametricShapeBinPdf. In the narrow binned case, the two agree well, while for wide bins, accounting for the integral over the bin yields a better fit.

You should note that using this class will result in slower fits, so you should first decide whether the added accuracy is enough to justify the reduced efficiency.

"},{"location":"part2/settinguptheanalysis/#beyond-simple-datacards","title":"Beyond simple datacards","text":"

Datacards can be extended in order to provide additional functionality and flexibility during runtime. These can also allow for the production of more complicated models and for producing more advanced results.

"},{"location":"part2/settinguptheanalysis/#rate-parameters","title":"Rate parameters","text":"

The overall rate \"expected\" of a particular process in a particular bin does not necessarily need to be a fixed quantity. Scale factors can be introduced to modify the rate directly in the datacards for ANY type of analysis. This can be achieved using the directive rateParam in the datacard with the following syntax,

name rateParam bin process initial_value [min,max]\n

The [min,max] argument is optional. If it is not included, Combine will remove the range of this parameter. This will produce a new parameter, which multiplies the rate of that particular process in the given bin by its value, in the model (unless it already exists).

You can attach the same rateParam to multiple processes/bins by either using a wild card (eg * will match everything, QCD_* will match everything starting with QCD_, etc.) in the name of the bin and/or process, or by repeating the rateParam line in the datacard for different bins/processes with the same name.

Warning

rateParam is not a shortcut to evaluate the post-fit yield of a process since other nuisance parameters can also change the normalization. E.g., finding that the rateParam best-fit value is 0.9 does not necessarily imply that the process yield is 0.9 times the initial yield. The best is to evaluate the yield taking into account the values of all nuisance parameters using --saveNormalizations.

This parameter is, by default, freely floating. It is possible to include a Gaussian constraint on any rateParam that is floating (i.e not a formula or spline) by adding a param nuisance line in the datacard with the same name.

In addition to rate modifiers that are freely floating, modifiers that are functions of other parameters can be included using the following syntax,

name rateParam bin process formula args\n

where args is a comma-separated list of the arguments for the string formula. You can include other nuisance parameters in the formula, including ones that are Gaussian constrained (i,e via the param directive.)

Below is an example datacard that uses the rateParam directive to implement an ABCD-like method in Combine. For a more realistic description of its use for ABCD, see the single-lepton SUSY search implementation described here.

imax 4  number of channels\njmax 0  number of processes -1\nkmax *  number of nuisance parameters (sources of systematical uncertainties)\n-------\nbin                   B      C       D        A\nobservation           50    100      500      10\n-------\nbin                   B      C       D        A\nprocess               bkg    bkg     bkg      bkg\nprocess               1      1       1         1\nrate                  1      1       1         1\n-------\n\nalpha rateParam A bkg (@0*@1/@2) beta,gamma,delta\nbeta  rateParam B bkg 50\ngamma rateParam C bkg 100\ndelta rateParam D bkg 500\n

For more examples of using rateParam (eg for fitting process normalizations in control regions and signal regions simultaneously) see this 2016 CMS tutorial

Finally, any pre-existing RooAbsReal inside some ROOT file with a workspace can be imported using the following:

name rateParam bin process rootfile:workspacename\n

The name should correspond to the name of the object that is being picked up inside the RooWorkspace. A simple example using the SM XS and BR splines available in HiggsAnalysis/CombinedLimit can be found under data/tutorials/rate_params/simple_sm_datacard.txt

"},{"location":"part2/settinguptheanalysis/#extra-arguments","title":"Extra arguments","text":"

If a parameter is intended to be used, and it is not a user-defined param or rateParam, it can be picked up by first issuing an extArgs directive before this line in the datacard. The syntax for extArgs is:

name extArg rootfile:workspacename\n

The string \":RecycleConflictNodes\" can be added at the end of the final argument (i.e. rootfile:workspacename:RecycleConflictNodes) to apply the corresponding RooFit option when the object is imported into the workspace. It is also possible to simply add a RooRealVar using extArg for use in function rateParams with the following

name extArg init [min,max]\n

Note that the [min,max] argument is optional and if not included, the code will remove the range of this parameter.

"},{"location":"part2/settinguptheanalysis/#manipulation-of-nuisance-parameters","title":"Manipulation of Nuisance parameters","text":"

It can often be useful to modify datacards, or the runtime behavior, without having to modify individual systematic lines. This can be achieved through nuisance parameter modifiers.

"},{"location":"part2/settinguptheanalysis/#nuisance-modifiers","title":"Nuisance modifiers","text":"

If a nuisance parameter needs to be renamed for certain processes/channels, it can be done using a single nuisance edit directive at the end of a datacard

nuisance edit rename process channel oldname newname [options]\n

Note that the wildcard (*) can be used for either a process, a channel, or both. This will have the effect that nuisance parameters affecting a given process/channel will be renamed, thereby de-correlating between processes/channels. Use the option ifexists to skip/avoid an error if the nuisance paremeter is not found. This kind of command will only affect nuisances of the type shape[N], lnN. Instead, if you also want to change the names of param type nuisances, you can use a global version

nuisance edit rename oldname newname\n

which will rename all shape[N], lnN and param nuisances found in one go. You should make sure these commands come after any process/channel specific ones in the datacard. This version does not accept options.

Other edits are also supported, as follows:

  • nuisance edit add process channel name pdf value [options] -> add a new nuisance parameter to a process
  • nuisance edit drop process channel name [options] -> remove this nuisance from the process/channel. Use the option ifexists to skip/avoid errors if the nuisance parameter is not found.
  • nuisance edit changepdf name newpdf -> change the PDF type of a given nuisance parameter to newpdf.
  • nuisance edit split process channel oldname newname1 newname2 value1 value2 -> split a nuisance parameter line into two separate nuisance parameters called newname1 and newname2 with values value1 and value2. This will produce two separate lines so that the original nuisance parameter oldname is split into two uncorrelated nuisances.
  • nuisance edit freeze name [options] -> set nuisance parameter frozen by default. Can be overridden on the command line using the --floatNuisances option. Use the option ifexists to skip/avoid errors if the nuisance parameter not found.
  • nuisance edit merge process channel name1 name2 -> merge systematic name2 into name1 by adding their values in quadrature and removing name2. This only works if, for each process and channel included, the uncertainties both increase or both reduce the process yield. For example, you can add 1.1 to 1.2, but not to 0.9.

The above edits (excluding the renaming) support nuisance parameters of the types shape[N], lnN, lnU, gmN, param, flatParam, rateParam, or discrete.

"},{"location":"part2/settinguptheanalysis/#groups-of-nuisances","title":"Groups of nuisances","text":"

Often it is desirable to freeze one or more nuisance parameters to check the impact they have on limits, likelihood scans, significances etc.

However, for large groups of nuisance parameters (eg everything associated to theory) it is easier to define nuisance groups in the datacard. The following line in a datacard will, for example, produce a group of nuisance parameters with the group name theory that contains two parameters, QCDscale and pdf.

theory group = QCDscale pdf\n

Multiple groups can be defined in this way. It is also possible to extend nuisance parameters groups in datacards using += in place of =.

These groups can be manipulated at runtime (eg for freezing all nuisance parameterss associated to a group at runtime, see Running the tool). You can find more info on groups of nuisances here

Note that when using the automatic addition of statistical uncertainties (autoMCStats), the corresponding nuisance parameters are created by text2workspace.py and so do not exist in the datacards. It is therefore not possible to add autoMCStats parameters to groups of nuisances in the way described above. However, text2workspace.py will automatically create a group labelled autoMCStats, which contains all autoMCStats parameters.

This group is useful for freezing all parameters created by autoMCStats. For freezing subsets of the parameters, for example if the datacard contains two categories, cat_label_1 and cat_label_2, to only freeze the autoMCStat parameters created for category cat_label_1, the regular expression features can be used. In this example this can be achieved by using --freezeParameters 'rgx{prop_bincat_label_1_bin.*}'.

"},{"location":"part2/settinguptheanalysis/#combination-of-multiple-datacards","title":"Combination of multiple datacards","text":"

If you have separate channels, each with their own datacard, it is possible to produce a combined datacard using the script combineCards.py

The syntax is simple: combineCards.py Name1=card1.txt Name2=card2.txt .... > card.txt If the input datacards had just one bin each, the output channels will be called Name1, Name2, and so on. Otherwise, a prefix Name1_ ... Name2_ will be added to the bin labels in each datacard. The supplied bin names Name1, Name2, etc. must themselves conform to valid C++/python identifier syntax.

Warning

When combining datacards, you should keep in mind that systematic uncertainties that have different names will be assumed to be uncorrelated, and those with the same name will be assumed 100% correlated. An uncertainty correlated across channels must have the same PDF. in all cards (i.e. always lnN, or all gmN with same N. Note that shape and lnN can be interchanged via the shape? directive). Furthermore, when using parametric models, \"parameter\" objects such as RooRealVar, RooAbsReal, and RooAbsCategory (parameters, PDF indices etc) with the same name will be assumed to be the same object. If this is not intended, you may encounter unexpected behaviour, such as the order of combining cards having an impact on the results. Make sure that such objects are named differently in your inputs if they represent different things! Instead, Combine will try to rename other \"shape\" objects (such as PDFs) automatically.

The combineCards.py script will fail if you are trying to combine a shape datacard with a counting datacard. You can however convert a counting datacard into an equivalent shape-based one by adding a line shapes * * FAKE in the datacard after the imax, jmax, and kmax section. Alternatively, you can add the option -S to combineCards.py, which will do this for you while creating the combined datacard.

"},{"location":"part2/settinguptheanalysis/#automatic-production-of-datacards-and-workspaces","title":"Automatic production of datacards and workspaces","text":"

For complicated analyses or cases in which multiple datacards are needed (e.g. optimization studies), you can avoid writing these by hand. The object Datacard defines the analysis and can be created as a python object. The template python script below will produce the same workspace as running textToWorkspace.py (see the section on Physics Models) on the realistic-counting-experiment.txt datacard.

from HiggsAnalysis.CombinedLimit.DatacardParser import *\nfrom HiggsAnalysis.CombinedLimit.ModelTools import *\nfrom HiggsAnalysis.CombinedLimit.ShapeTools import *\nfrom HiggsAnalysis.CombinedLimit.PhysicsModel import *\n\nfrom sys import exit\nfrom optparse import OptionParser\nparser = OptionParser()\naddDatacardParserOptions(parser)\noptions,args = parser.parse_args()\noptions.bin = True # make a binary workspace\n\nDC = Datacard()\nMB = None\n\n############## Setup the datacard (must be filled in) ###########################\n\nDC.bins =   ['bin1'] # <type 'list'>\nDC.obs =    {'bin1': 0.0} # <type 'dict'>\nDC.processes =  ['ggH', 'qqWW', 'ggWW', 'others'] # <type 'list'>\nDC.signals =    ['ggH'] # <type 'list'>\nDC.isSignal =   {'qqWW': False, 'ggWW': False, 'ggH': True, 'others': False} # <type 'dict'>\nDC.keyline =    [('bin1', 'ggH', True), ('bin1', 'qqWW', False), ('bin1', 'ggWW', False), ('bin1', 'others', False)] # <type 'list'>\nDC.exp =    {'bin1': {'qqWW': 0.63, 'ggWW': 0.06, 'ggH': 1.47, 'others': 0.22}} # <type 'dict'>\nDC.systs =  [('lumi', False, 'lnN', [], {'bin1': {'qqWW': 0.0, 'ggWW': 1.11, 'ggH': 1.11, 'others': 0.0}}), ('xs_ggH', False, 'lnN', [], {'bin1': {'qqWW': 0.0, 'ggWW': 0.0, 'ggH': 1.16, 'others': 0.0}}), ('WW_norm', False, 'gmN', [4], {'bin1': {'qqWW': 0.16, 'ggWW': 0.0, 'ggH': 0.0, 'others': 0.0}}), ('xs_ggWW', False, 'lnN', [], {'bin1': {'qqWW': 0.0, 'ggWW': 1.5, 'ggH': 0.0, 'others': 0.0}}), ('bg_others', False, 'lnN', [], {'bin1': {'qqWW': 0.0, 'ggWW': 0.0, 'ggH': 0.0, 'others': 1.3}})] # <type 'list'>\nDC.shapeMap =   {} # <type 'dict'>\nDC.hasShapes =  False # <type 'bool'>\nDC.flatParamNuisances =  {} # <type 'dict'>\nDC.rateParams =  {} # <type 'dict'>\nDC.extArgs =    {} # <type 'dict'>\nDC.rateParamsOrder  =  set([]) # <type 'set'>\nDC.frozenNuisances  =  set([]) # <type 'set'>\nDC.systematicsShapeMap =  {} # <type 'dict'>\nDC.nuisanceEditLines    =  [] # <type 'list'>\nDC.groups   =  {} # <type 'dict'>\nDC.discretes    =  [] # <type 'list'>\n\n\n###### User defined options #############################################\n\noptions.out      = \"combine_workspace.root\"     # Output workspace name\noptions.fileName = \"./\"             # Path to input ROOT files\noptions.verbose  = \"1\"              # Verbosity\n\n##########################################################################\n\nif DC.hasShapes:\n    MB = ShapeBuilder(DC, options)\nelse:\n    MB = CountingModelBuilder(DC, options)\n\n# Set physics models\nMB.setPhysics(defaultModel)\nMB.doModel()\n

Any existing datacard can be converted into such a template python script by using the --dump-datacard option in text2workspace.py, in case a more complicated template is needed.

Warning

The above is not advised for final results, as this script is not easily combined with other analyses so should only be used for internal studies.

For the automatic generation of datacards that are combinable, you should instead use the CombineHarvester package, which includes many features for producing complex datacards in a reliable, automated way.

"},{"location":"part2/settinguptheanalysis/#sanity-checking-the-datacard","title":"Sanity checking the datacard","text":"

For large combinations with multiple channels/processes etc, the .txt file can get unwieldy to read through. There are some simple tools to help check and disseminate the contents of the cards.

In order to get a quick view of the systematic uncertainties included in the datacard, you can use the test/systematicsAnalyzer.py tool. This will produce a list of the systematic uncertainties (normalization and shape), indicating what type they are, which channels/processes they affect and the size of the effect on the normalization (for shape uncertainties, this will just be the overall uncertainty on the normalization).

The default output is a .html file that can be expanded to give more details about the effect of the systematic uncertainty for each channel/process. Add the option --format brief to obtain a simpler summary report direct to the terminal. An example output for the tutorial card data/tutorials/shapes/simple-shapes-TH1.txt is shown below.

$ python test/systematicsAnalyzer.py data/tutorials/shapes/simple-shapes-TH1.txt --all -f html > out.html\n

This will produce the following output in html format:

Nuisance Report Nuisance Report Nuisance (types)RangeProcessesChannels lumi (lnN) 1.0001.100 background, signal bin1(1) [+] bin1signal(1.1), background(1.0) alpha (shape) 1.1111.150 background bin1(1) [+] bin1background(0.900/1.150 (shape)) bgnorm (lnN) 1.0001.300 background, signal bin1(1) [+] bin1signal(1.0), background(1.3) sigma (shape) 1.0001.000 signal bin1(1) [+] bin1signal(1.000/1.000 (shape))

In case you only have a counting experiment datacard, include the option --noshape.

If you have a datacard that uses several rateParams or a Physics model that includes a complicated product of normalization terms in each process, you can check the values of the normalization (and which objects in the workspace comprise them) using the test/printWorkspaceNormalisations.py tool. As an example, the first few blocks of output for the tutorial card data/tutorials/counting/realistic-multi-channel.txt are given below:

Show example output
\n$ text2workspace.py data/tutorials/shapes/simple-shapes-parametric.txt -m 30\n$ python test/printWorkspaceNormalisations.py data/tutorials/counting/realistic-multi-channel.root                                                                                                           \n\n---------------------------------------------------------------------------\n---------------------------------------------------------------------------\nChannel - mu_tau\n---------------------------------------------------------------------------\n  Top-level normalisation for process ZTT -> n_exp_binmu_tau_proc_ZTT\n  -------------------------------------------------------------------------\nDumping ProcessNormalization n_exp_binmu_tau_proc_ZTT @ 0x6bbb610\n    nominal value: 329\n    log-normals (3):\n         kappa = 1.23, logKappa = 0.207014, theta = tauid = 0\n         kappa = 1.04, logKappa = 0.0392207, theta = ZtoLL = 0\n         kappa = 1.04, logKappa = 0.0392207, theta = effic = 0\n    asymm log-normals (0):\n    other terms (0):\n\n  -------------------------------------------------------------------------\n  default value =  329.0\n---------------------------------------------------------------------------\n  Top-level normalisation for process QCD -> n_exp_binmu_tau_proc_QCD\n  -------------------------------------------------------------------------\nDumping ProcessNormalization n_exp_binmu_tau_proc_QCD @ 0x6bbcaa0\n    nominal value: 259\n    log-normals (1):\n         kappa = 1.1, logKappa = 0.0953102, theta = QCDmu = 0\n    asymm log-normals (0):\n    other terms (0):\n\n  -------------------------------------------------------------------------\n  default value =  259.0\n---------------------------------------------------------------------------\n  Top-level normalisation for process higgs -> n_exp_binmu_tau_proc_higgs\n  -------------------------------------------------------------------------\nDumping ProcessNormalization n_exp_binmu_tau_proc_higgs @ 0x6bc6390\n    nominal value: 0.57\n    log-normals (3):\n         kappa = 1.11, logKappa = 0.10436, theta = lumi = 0\n         kappa = 1.23, logKappa = 0.207014, theta = tauid = 0\n         kappa = 1.04, logKappa = 0.0392207, theta = effic = 0\n    asymm log-normals (0):\n    other terms (1):\n         term r (class RooRealVar), value = 1\n\n  -------------------------------------------------------------------------\n  default value =  0.57\n---------------------------------------------------------------------------\n---------------------------------------------------------------------------\nChannel - e_mu\n---------------------------------------------------------------------------\n  Top-level normalisation for process ZTT -> n_exp_bine_mu_proc_ZTT\n  -------------------------------------------------------------------------\nDumping ProcessNormalization n_exp_bine_mu_proc_ZTT @ 0x6bc8910\n    nominal value: 88\n    log-normals (2):\n         kappa = 1.04, logKappa = 0.0392207, theta = ZtoLL = 0\n         kappa = 1.04, logKappa = 0.0392207, theta = effic = 0\n    asymm log-normals (0):\n    other terms (0):\n\n  -------------------------------------------------------------------------\n  default value =  88.0\n---------------------------------------------------------------------------\n

As you can see, for each channel, a report is given for the top-level rate object in the workspace, for each process contributing to that channel. You can also see the various terms that make up that rate. The default value is for the default parameters in the workspace (i.e when running text2workspace, these are the values created as default).

Another example is shown below for the workspace produced from the data/tutorials/shapes/simple-shapes-parametric.txt datacard.

Show example output
\n  text2workspace.py data/tutorials/shapes/simple-shapes-parametric.txt\n  python test/printWorkspaceNormalisations.py data/tutorials/shapes/simple-shapes-parametric.root\n  ...\n\n  ---------------------------------------------------------------------------\n  ---------------------------------------------------------------------------\n  Channel - bin1\n  ---------------------------------------------------------------------------\n    Top-level normalisation for process bkg -> n_exp_final_binbin1_proc_bkg\n    -------------------------------------------------------------------------\n  RooProduct::n_exp_final_binbin1_proc_bkg[ n_exp_binbin1_proc_bkg * shapeBkg_bkg_bin1__norm ] = 521.163\n   ... is a product, which contains  n_exp_binbin1_proc_bkg\n  RooRealVar::n_exp_binbin1_proc_bkg = 1 C  L(-INF - +INF)\n    -------------------------------------------------------------------------\n    default value =  521.163204829\n  ---------------------------------------------------------------------------\n    Top-level normalisation for process sig -> n_exp_binbin1_proc_sig\n    -------------------------------------------------------------------------\n  Dumping ProcessNormalization n_exp_binbin1_proc_sig @ 0x464f700\n      nominal value: 1\n      log-normals (1):\n           kappa = 1.1, logKappa = 0.0953102, theta = lumi = 0\n      asymm log-normals (0):\n      other terms (1):\n           term r (class RooRealVar), value = 1\n\n    -------------------------------------------------------------------------\n    default value =  1.0\n

This tells us that the normalization for the background process, named n_exp_final_binbin1_proc_bkg is a product of two objects n_exp_binbin1_proc_bkg * shapeBkg_bkg_bin1__norm. The first object is just from the rate line in the datacard (equal to 1) and the second is a floating parameter. For the signal, the normalisation is called n_exp_binbin1_proc_sig and is a ProcessNormalization object that contains the rate modifications due to the systematic uncertainties. You can see that it also has a \"nominal value\", which again is just from the value given in the rate line of the datacard (again=1).

"},{"location":"part3/commonstatsmethods/","title":"Common Statistical Methods","text":"

In this section, the most commonly used statistical methods from Combine will be covered, including specific instructions on how to obtain limits, significances, and likelihood scans. For all of these methods, the assumed parameter of interest (POI) is the overall signal strength \\(r\\) (i.e the default PhysicsModel). In general however, the first POI in the list of POIs (as defined by the PhysicsModel) will be taken instead of r. This may or may not make sense for any particular method, so care must be taken.

This section will assume that you are using the default physics model, unless otherwise specified.

"},{"location":"part3/commonstatsmethods/#asymptotic-frequentist-limits","title":"Asymptotic Frequentist Limits","text":"

The AsymptoticLimits method can be used to quickly compute an estimate of the observed and expected limits, which is accurate when the event yields are not too small and the systematic uncertainties do not play a major role in the result. The limit calculation relies on an asymptotic approximation of the distributions of the LHC test statistic, which is based on a profile likelihood ratio, under the signal and background hypotheses to compute two p-values \\(p_{\\mu}, p_{b}\\) and therefore \\(CL_s=p_{\\mu}/(1-p_{b})\\) (see the FAQ section for a description). This means it is the asymptotic approximation for evaluating limits with frequentist toys using the LHC test statistic for limits. In the definition below, the parameter \\(\\mu=r\\).

  • The test statistic is defined using the ratio of likelihoods \\(q_{\\mu} = -2\\ln[\\mathcal{L}(\\mu,\\hat{\\hat{\\nu}}(\\mu))/\\mathcal{L}(\\hat{\\mu},\\hat{\\nu})]\\) , in which the nuisance parameters are profiled separately for \\(\\mu=\\hat{\\mu}\\) and \\(\\mu\\). The value of \\(q_{\\mu}\\) is set to 0 when \\(\\hat{\\mu}>\\mu\\), giving a one-sided limit. Furthermore, the constraint \\(\\mu>0\\) is enforced in the fit. This means that if the unconstrained value of \\(\\hat{\\mu}\\) would be negative, the test statistic \\(q_{\\mu}\\) is evaluated as \\(-2\\ln[\\mathcal{L}(\\mu,\\hat{\\hat{\\nu}}(\\mu))/\\mathcal{L}(0,\\hat{\\hat{\\nu}}(0))]\\)

This method is the default Combine method: if you call Combine without specifying -M, the AsymptoticLimits method will be run.

A realistic example of a datacard for a counting experiment can be found in the HiggsCombination package: data/tutorials/counting/realistic-counting-experiment.txt

The AsymptoticLimits method can be run using

combine -M AsymptoticLimits realistic-counting-experiment.txt\n

The program will print the limit on the signal strength r (number of signal events / number of expected signal events) e .g. Observed Limit: r < 1.6297 @ 95% CL , the median expected limit Expected 50.0%: r < 2.3111, and edges of the 68% and 95% ranges for the expected limits.

 <<< Combine >>>\n>>> including systematics\n>>> method used to compute upper limit is AsymptoticLimits\n[...]\n -- AsymptoticLimits ( CLs ) --\nObserved Limit: r < 1.6281\nExpected  2.5%: r < 0.9640\nExpected 16.0%: r < 1.4329\nExpected 50.0%: r < 2.3281\nExpected 84.0%: r < 3.9800\nExpected 97.5%: r < 6.6194\n\nDone in 0.01 min (cpu), 0.01 min (real)\n

By default, the limits are calculated using the CLs prescription, as noted in the output, which takes the ratio of p-values under the signal plus background and background only hypothesis. This can be altered to using the strict p-value by using the option --rule CLsplusb (note that CLsplusb is the jargon for calculating the p-value \\(p_{\\mu}\\)). You can also change the confidence level (default is 95%) to 90% using the option --cl 0.9 or any other confidence level. You can find the full list of options for AsymptoticLimits using --help -M AsymptoticLimits.

Warning

You may find that Combine issues a warning that the best fit for the background-only Asimov dataset returns a nonzero value for the signal strength;

WARNING: Best fit of asimov dataset is at r = 0.220944 (0.011047 times rMax), while it should be at zero

If this happens, you should check to make sure that there are no issues with the datacard or the Asimov generation used for your setup. For details on debugging, it is recommended that you follow the simple checks used by the HIG PAG here.

The program will also create a ROOT file higgsCombineTest.AsymptoticLimits.mH120.root containing a ROOT tree limit that contains the limit values and other bookkeeping information. The important columns are limit (the limit value) and quantileExpected (-1 for observed limit, 0.5 for median expected limit, 0.16/0.84 for the edges of the 65% interval band of expected limits, 0.025/0.975 for 95%).

$ root -l higgsCombineTest.AsymptoticLimits.mH120.root\nroot [0] limit->Scan(\"*\")\n************************************************************************************************************************************\n*    Row   *     limit *  limitErr *        mh *      syst *      iToy *     iSeed *  iChannel *     t_cpu *    t_real * quantileE *\n************************************************************************************************************************************\n*        0 * 0.9639892 *         0 *       120 *         1 *         0 *    123456 *         0 *         0 *         0 * 0.0250000 *\n*        1 * 1.4329109 *         0 *       120 *         1 *         0 *    123456 *         0 *         0 *         0 * 0.1599999 *\n*        2 *  2.328125 *         0 *       120 *         1 *         0 *    123456 *         0 *         0 *         0 *       0.5 *\n*        3 * 3.9799661 *         0 *       120 *         1 *         0 *    123456 *         0 *         0 *         0 * 0.8399999 *\n*        4 * 6.6194028 *         0 *       120 *         1 *         0 *    123456 *         0 *         0 *         0 * 0.9750000 *\n*        5 * 1.6281188 * 0.0050568 *       120 *         1 *         0 *    123456 *         0 * 0.0035000 * 0.0055123 *        -1 *\n************************************************************************************************************************************\n
"},{"location":"part3/commonstatsmethods/#blind-limits","title":"Blind limits","text":"

The AsymptoticLimits calculation follows the frequentist paradigm for calculating expected limits. This means that the routine will first fit the observed data, conditionally for a fixed value of r, and set the nuisance parameters to the values obtained in the fit for generating the Asimov data set. This means it calculates the post-fit or a-posteriori expected limit. In order to use the pre-fit nuisance parameters (to calculate an a-priori limit), you must add the option --noFitAsimov or --bypassFrequentistFit.

For blinding the results completely (i.e not using the data) you can include the option --run blind.

Warning

While you can use -t -1 to get blind limits, if the correct options are passed, we strongly recommend to use --run blind.

"},{"location":"part3/commonstatsmethods/#splitting-points","title":"Splitting points","text":"

In case your model is particularly complex, you can perform the asymptotic calculation by determining the value of CLs for a set grid of points (in r) and merging the results. This is done by using the option --singlePoint X for multiple values of X, hadd'ing the output files and reading them back in,

combine -M AsymptoticLimits realistic-counting-experiment.txt --singlePoint 0.1 -n 0.1\ncombine -M AsymptoticLimits realistic-counting-experiment.txt --singlePoint 0.2 -n 0.2\ncombine -M AsymptoticLimits realistic-counting-experiment.txt --singlePoint 0.3 -n 0.3\n...\n\nhadd limits.root higgsCombine*.AsymptoticLimits.*\n\ncombine -M AsymptoticLimits realistic-counting-experiment.txt --getLimitFromGrid limits.root\n
"},{"location":"part3/commonstatsmethods/#asymptotic-significances","title":"Asymptotic Significances","text":"

The significance of a result is calculated using a ratio of profiled likelihoods, one in which the signal strength is set to 0 and the other in which it is free to float. The evaluated quantity is \\(-2\\ln[\\mathcal{L}(\\mu=0,\\hat{\\hat{\\nu}}(0))/\\mathcal{L}(\\hat{\\mu},\\hat{\\nu})]\\), in which the nuisance parameters are profiled separately for \\(\\mu=\\hat{\\mu}\\) and \\(\\mu=0\\).

The distribution of this test statistic can be determined using Wilks' theorem provided the number of events is large enough (i.e in the Asymptotic limit). The significance (or p-value) can therefore be calculated very quickly. The Significance method can be used for this.

It is also possible to calculate the ratio of likelihoods between the freely floating signal strength to that of a fixed signal strength other than 0, by specifying it with the option --signalForSignificance=X.

Info

This calculation assumes that the signal strength can only be positive (i.e we are not interested in negative signal strengths). This behaviour can be altered by including the option --uncapped.

"},{"location":"part3/commonstatsmethods/#compute-the-observed-significance","title":"Compute the observed significance","text":"

The observed significance is calculated using the Significance method, as

combine -M Significance datacard.txt

The printed output will report the significance and the p-value, for example, when using the realistic-counting-experiment.txt datacard, you will see

 <<< Combine >>>\n>>> including systematics\n>>> method used is Significance\n[...]\n -- Significance --\nSignificance: 0\n       (p-value = 0.5)\nDone in 0.00 min (cpu), 0.01 min (real)\n

which is not surprising since 0 events were observed in that datacard.

The output ROOT file will contain the significance value in the branch limit. To store the p-value instead, include the option --pval. The significance and p-value can be converted between one another using the RooFit functions RooFit::PValueToSignificance and RooFit::SignificanceToPValue.

When calculating the significance, you may find it useful to resort to a brute-force fitting algorithm that scans the nll (repeating fits until a certain tolerance is reached), bypassing MINOS, which can be activated with the option bruteForce. This can be tuned using the options setBruteForceAlgo, setBruteForceTypeAndAlgo and setBruteForceTolerance.

"},{"location":"part3/commonstatsmethods/#computing-the-expected-significance","title":"Computing the expected significance","text":"

The expected significance can be computed from an Asimov data set of signal+background. There are two options for this:

  • a-posteriori expected: will depend on the observed dataset.
  • a-priori expected (the default behavior): does not depend on the observed dataset, and so is a good metric for optimizing an analysis when still blinded.

The a-priori expected significance from the Asimov dataset is calculated as

combine -M Significance datacard.txt -t -1 --expectSignal=1\n

In order to produce the a-posteriori expected significance, just generate a post-fit Asimov data set by adding the option --toysFreq in the command above.

The output format is the same as for observed significances: the variable limit in the tree will be filled with the significance (or with the p-value if you put also the option --pvalue)

"},{"location":"part3/commonstatsmethods/#bayesian-limits-and-credible-regions","title":"Bayesian Limits and Credible regions","text":"

Bayesian calculation of limits requires the user to assume a particular prior distribution for the parameter of interest (default r). You can specify the prior using the --prior option, the default is a flat pior in r.

"},{"location":"part3/commonstatsmethods/#computing-the-observed-bayesian-limit-for-simple-models","title":"Computing the observed bayesian limit (for simple models)","text":"

The BayesianSimple method computes a Bayesian limit performing classical numerical integration. This is very fast and accurate, but only works for simple models (a few channels and nuisance parameters).

combine -M BayesianSimple simple-counting-experiment.txt\n[...]\n\n -- BayesianSimple --\nLimit: r < 0.672292 @ 95% CL\nDone in 0.04 min (cpu), 0.05 min (real)\n

The output tree will contain a single entry corresponding to the observed 95% confidence level upper limit. The confidence level can be modified to 100*X% using --cl X.

"},{"location":"part3/commonstatsmethods/#computing-the-observed-bayesian-limit-for-arbitrary-models","title":"Computing the observed bayesian limit (for arbitrary models)","text":"

The MarkovChainMC method computes a Bayesian limit performing a Monte Carlo integration. From the statistical point of view it is identical to the BayesianSimple method, only the technical implementation is different. The method is slower, but can also handle complex models. For this method you can increase the accuracy of the result by increasing the number of Markov Chains, at the expense of a longer running time (option --tries, default is 10). Let's use the realistic counting experiment datacard to test the method.

To use the MarkovChainMC method, users need to specify this method in the command line, together with the options they want to use. For instance, to set the number of times the algorithm will run with different random seeds, use option --tries:

combine -M MarkovChainMC realistic-counting-experiment.txt --tries 100\n[...]\n\n -- MarkovChainMC --\nLimit: r < 2.20438 +/- 0.0144695 @ 95% CL (100 tries)\nAverage chain acceptance: 0.078118\nDone in 0.14 min (cpu), 0.15 min (real)\n

Again, the resulting limit tree will contain the result. You can also save the chains using the option --saveChain, which will then also be included in the output file.

Exclusion regions can be made from the posterior once an ordering principle is defined to decide how to grow the contour (there is an infinite number of possible regions that contain 68% of the posterior pdf). Below is a simple example script that can be used to plot the posterior distribution from these chains and calculate the smallest such region. Note that in this example we are ignoring the burn-in. This can be added by e.g. changing for i in range(mychain.numEntries()): to for i in range(200,mychain.numEntries()): for a burn-in of 200.

Show example script
\nimport ROOT\n\nrmin = 0\nrmax = 30\nnbins = 100\nCL = 0.95\nchains = \"higgsCombineTest.MarkovChainMC.blahblahblah.root\"\n\ndef findSmallestInterval(hist,CL):\n bins = hist.GetNbinsX()\n best_i = 1\n best_j = 1\n bd = bins+1\n val = 0;\n for i in range(1,bins+1):\n   integral = hist.GetBinContent(i)\n   for j in range(i+1,bins+2):\n    integral += hist.GetBinContent(j)\n    if integral > CL :\n      val = integral\n      break\n   if integral > CL and  j-i < bd :\n     bd = j-i\n     best_j = j+1\n     best_i = i\n     val = integral\n return hist.GetBinLowEdge(best_i), hist.GetBinLowEdge(best_j), val\n\nfi_MCMC = ROOT.TFile.Open(chains)\n# Sum up all of the chains (or we could take the average limit)\nmychain=0\nfor k in fi_MCMC.Get(\"toys\").GetListOfKeys():\n    obj = k.ReadObj\n    if mychain ==0:\n        mychain = k.ReadObj().GetAsDataSet()\n    else :\n        mychain.append(k.ReadObj().GetAsDataSet())\nhist = ROOT.TH1F(\"h_post\",\";r;posterior probability\",nbins,rmin,rmax)\nfor i in range(mychain.numEntries()):\n#for i in range(200,mychain.numEntries()): burn-in of 200\n  mychain.get(i)\n  hist.Fill(mychain.get(i).getRealValue(\"r\"), mychain.weight())\nhist.Scale(1./hist.Integral())\nhist.SetLineColor(1)\nvl,vu,trueCL = findSmallestInterval(hist,CL)\nhistCL = hist.Clone()\nfor b in range(nbins):\n  if histCL.GetBinLowEdge(b+1) < vl or histCL.GetBinLowEdge(b+2)>vu: histCL.SetBinContent(b+1,0)\nc6a = ROOT.TCanvas()\nhistCL.SetFillColor(ROOT.kAzure-3)\nhistCL.SetFillStyle(1001)\nhist.Draw()\nhistCL.Draw(\"histFsame\")\nhist.Draw(\"histsame\")\nll = ROOT.TLine(vl,0,vl,2*hist.GetBinContent(hist.FindBin(vl))); ll.SetLineColor(2); ll.SetLineWidth(2)\nlu = ROOT.TLine(vu,0,vu,2*hist.GetBinContent(hist.FindBin(vu))); lu.SetLineColor(2); lu.SetLineWidth(2)\nll.Draw()\nlu.Draw()\n\nprint \" %g %% (%g %%) interval (target)  = %g < r < %g \"%(trueCL,CL,vl,vu)\n

Running the script on the output file produced for the same datacard (including the --saveChain option) will produce the following output

0.950975 % (0.95 %) interval (target)  = 0 < r < 2.2\n

along with a plot of the posterior distribution shown below. This is the same as the output from Combine, but the script can also be used to find lower limits (for example) or credible intervals.

An example to make contours when ordering by probability density can be found in bayesContours.cxx. Note that the implementation is simplistic, with no clever handling of bin sizes nor smoothing of statistical fluctuations.

The MarkovChainMC algorithm has many configurable parameters, and you are encouraged to experiment with those. The default configuration might not be the best for your analysis.

"},{"location":"part3/commonstatsmethods/#iterations-burn-in-tries","title":"Iterations, burn-in, tries","text":"

Three parameters control how the MCMC integration is performed:

  • the number of tries (option --tries): the algorithm will run multiple times with different random seeds. The truncated mean and RMS of the different results are reported. The default value is 10, which should be sufficient for a quick computation. For a more accurate result you might want to increase this number up to even ~200.
  • the number of iterations (option -i) determines how many points are proposed to fill a single Markov Chain. The default value is 10k, and a plausible range is between 5k (for quick checks) and 20-30k for lengthy calculations. Beyond 30k, the time vs accuracy can be balanced better by increasing the number of chains (option --tries).
  • the number of burn-in steps (option -b) is the number of points that are removed from the beginning of the chain before using it to compute the limit. The default is 200. If the chain is very long, we recommend to increase this value a bit (e.g. to several hundreds). Using a number of burn-in steps below 50 is likely to result in a bias towards earlier stages of the chain before a reasonable convergence.
"},{"location":"part3/commonstatsmethods/#proposals","title":"Proposals","text":"

The option --proposal controls the way new points are proposed to fill in the MC chain.

  • uniform: pick points at random. This works well if you have very few nuisance parameters (or none at all), but normally fails if you have many.
  • gaus: Use a product of independent gaussians, one for each nuisance parameter. The sigma of the gaussian for each variable is 1/5 of the range of the variable. This behaviour can be controlled using the parameter --propHelperWidthRangeDivisor. This proposal appears to work well for up to around 15 nuisance parameters, provided that the range of the nuisance parameters is in the range \u00b15\u03c3. This method does not work when there are no nuisance parameters.
  • ortho (default): This proposal is similar to the multi-gaussian proposal. However, at every step only a single coordinate of the point is varied, so that the acceptance of the chain is high even for a large number of nuisance parameters (i.e. more than 20).
  • fit: Run a fit and use the uncertainty matrix from HESSE to construct a proposal (or the one from MINOS if the option --runMinos is specified). This can give biased results, so this method is not recommended in general.

If you believe there is something going wrong, e.g. if your chain remains stuck after accepting only a few events, the option --debugProposal can be used to obtain a printout of the first N proposed points. This can help you understand what is happening; for example if you have a region of the phase space with probability zero, the gaus and fit proposal can get stuck there forever.

"},{"location":"part3/commonstatsmethods/#computing-the-expected-bayesian-limit","title":"Computing the expected bayesian limit","text":"

The expected limit is computed by generating many toy MC data sets and computing the limit for each of them. This can be done passing the option -t . E.g. to run 100 toys with the BayesianSimple method, you can run

combine -M BayesianSimple datacard.txt -t 100\n

The program will print out the mean and median limit, as well as the 68% and 95% quantiles of the distributions of the limits. This time, the output ROOT tree will contain one entry per toy.

For more heavy methods (eg the MarkovChainMC) you will probably want to split this calculation into multiple jobs. To do this, just run Combine multiple times specifying a smaller number of toys (as low as 1), using a different seed to initialize the random number generator each time. The option -s can be used for this; if you set it to -1, the starting seed will be initialized randomly at the beginning of the job. Finally, you can merge the resulting trees with hadd and look at the distribution in the merged file.

"},{"location":"part3/commonstatsmethods/#multidimensional-bayesian-credible-regions","title":"Multidimensional bayesian credible regions","text":"

The MarkovChainMC method allows the user to produce the posterior PDF as a function of (in principle) any number of POIs. In order to do so, you first need to create a workspace with more than one parameter, as explained in the physics models section.

For example, let us use the toy datacard data/tutorials/multiDim/toy-hgg-125.txt (counting experiment that vaguely resembles an early H\u2192\u03b3\u03b3 analysis at 125 GeV) and convert the datacard into a workspace with 2 parameters, the ggH and qqH cross sections, using text2workspace.

text2workspace.py data/tutorials/multiDim/toy-hgg-125.txt -P HiggsAnalysis.CombinedLimit.PhysicsModel:floatingXSHiggs --PO modes=ggH,qqH -o workspace.root\n

Now we just run one (or more) MCMC chain(s) and save them in the output tree. By default, the nuisance parameters will be marginalized (integrated) over their PDFs. You can ignore the complaints about not being able to compute an upper limit (since for more than 1D, this is not well-defined),

combine -M MarkovChainMC workspace.root --tries 1 --saveChain -i 1000000 -m 125 -s 12345\n

The output of the Markov Chain is again a RooDataSet of weighted events distributed according to the posterior PDF (after you cut out the burn in part), so it can be used to make histograms or other distributions of the posterior PDF. See as an example bayesPosterior2D.cxx.

Below is an example of the output of the macro,

$ root -l higgsCombineTest.MarkovChainMC....\n.L bayesPosterior2D.cxx\nbayesPosterior2D(\"bayes2D\",\"Posterior PDF\")\n

"},{"location":"part3/commonstatsmethods/#computing-limits-with-toys","title":"Computing Limits with toys","text":"

The HybridNew method is used to compute either the hybrid bayesian-frequentist limits, popularly known as \"CLs of LEP or Tevatron type\", or the fully frequentist limits, which are the current recommended method by the LHC Higgs Combination Group. Note that these methods can be resource intensive for complex models.

It is possible to define the criterion used for setting limits using --rule CLs (to use the CLs criterion) or --rule CLsplusb (to calculate the limit using \\(p_{\\mu}\\)) and as always the confidence level desired using --cl=X.

The choice of test statistic can be made via the option --testStat. Different methodologies for the treatment of the nuisance parameters are available. While it is possible to mix different test statistics with different nuisance parameter treatments, we strongly do not recommend this. Instead one should follow one of the following three procedures. Note that the signal strength \\(r\\) here is given the more common notation \\(\\mu\\).

  • LEP-style: --testStat LEP --generateNuisances=1 --fitNuisances=0

    • The test statistic is defined using the ratio of likelihoods \\(q_{\\mathrm{LEP}}=-2\\ln[\\mathcal{L}(\\mu=0)/\\mathcal{L}(\\mu)]\\).
    • The nuisance parameters are fixed to their nominal values for the purpose of evaluating the likelihood, while for generating toys, the nuisance parameters are first randomized within their PDFs before generation of the toy.
  • TEV-style: --testStat TEV --generateNuisances=0 --generateExternalMeasurements=1 --fitNuisances=1

    • The test statistic is defined using the ratio of likelihoods \\(q_{\\mathrm{TEV}}=-2\\ln[\\mathcal{L}(\\mu=0,\\hat{\\hat{\\mu}}(0))/\\mathcal{L}(\\mu,\\hat{\\hat{\\nu}}(\\mu))]\\), in which the nuisance parameters are profiled separately for \\(\\mu=0\\) and \\(\\mu\\).
    • For the purposes of toy generation, the nuisance parameters are fixed to their post-fit values from the data (conditional on \\(\\mu\\)), while the constraint terms are randomized for the evaluation of the likelihood.
  • LHC-style: --LHCmode LHC-limits , which is the shortcut for --testStat LHC --generateNuisances=0 --generateExternalMeasurements=1 --fitNuisances=1

    • The test statistic is defined using the ratio of likelihoods \\(q_{\\mu} = -2\\ln[\\mathcal{L}(\\mu,\\hat{\\hat{\\nu}}(\\mu))/\\mathcal{L}(\\hat{\\mu},\\hat{\\nu})]\\) , in which the nuisance parameters are profiled separately for \\(mu=\\hat{\\mu}\\) and \\(\\mu\\).
    • The value of \\(q_{\\mu}\\) set to 0 when \\(\\hat{\\mu}>\\mu\\) giving a one-sided limit. Furthermore, the constraint \\(\\mu>0\\) is enforced in the fit. This means that if the unconstrained value of \\(\\hat{\\mu}\\) would be negative, the test statistic \\(q_{\\mu}\\) is evaluated as \\(-2\\ln[\\mathcal{L}(\\mu,\\hat{\\hat{\\nu}}(\\mu))/\\mathcal{L}(0,\\hat{\\hat{\\nu}}(0))]\\).
    • For the purposes of toy generation, the nuisance parameters are fixed to their post-fit values from the data (conditionally on the value of \\(\\mu\\)), while the constraint terms are randomized in the evaluation of the likelihood.

Warning

The recommended style is the LHC-style. Please note that this method is sensitive to the observation in data since the post-fit (after a fit to the data) values of the nuisance parameters (assuming different values of r) are used when generating the toys. For completely blind limits you can first generate a pre-fit asimov toy data set (described in the toy data generation section) and use that in place of the data. You can use this toy by passing the argument -D toysFileName.root:toys/toy_asimov

While the above shortcuts are the commonly used versions, variations can be tested. The treatment of the nuisances can be changed to the so-called \"Hybrid-Bayesian\" method, which effectively integrates over the nuisance parameters. This is especially relevant when you have very few expected events in your data, and you are using those events to constrain background processes. This can be achieved by setting --generateNuisances=1 --generateExternalMeasurements=0. In case you want to avoid first fitting to the data to choose the nominal values you can additionally pass --fitNuisances=0.

Warning

If you have unconstrained parameters in your model (rateParam, or if you are using a _norm variable for a PDF) and you want to use the \"Hybrid-Bayesian\" method, you must declare these as flatParam in your datacard. When running text2workspace you must add the option --X-assign-flatParam-prior in the command line. This will create uniform priors for these parameters. These are needed for this method and they would otherwise not get created.

Info

Note that (observed and expected) values of the test statistic stored in the instances of RooStats::HypoTestResult when the option --saveHybridResult is passed are defined without the factor 2. They are therefore twice as small as the values given by the formulas above. This factor is however included automatically by all plotting scripts supplied within the Combine package. If you use your own plotting scripts, you need to make sure to incorporate the factor 2.

"},{"location":"part3/commonstatsmethods/#simple-models","title":"Simple models","text":"

For relatively simple models, the observed and expected limits can be calculated interactively. Since the LHC-style is the recommended set of options for calculating limits using toys, we will use that in this section. However, the same procedure can be followed with the other sets of options.

combine realistic-counting-experiment.txt -M HybridNew --LHCmode LHC-limits\n
Show output
 <<< Combine >>>\n>>> including systematics\n>>> using the Profile Likelihood test statistics modified for upper limits (Q_LHC)\n>>> method used is HybridNew\n>>> random number generator seed is 123456\nComputing results starting from observation (a-posteriori)\nSearch for upper limit to the limit\n  r = 20 +/- 0\n    CLs = 0 +/- 0\n    CLs      = 0 +/- 0\n    CLb      = 0.264 +/- 0.0394263\n    CLsplusb = 0 +/- 0\n\nSearch for lower limit to the limit\nNow doing proper bracketing & bisection\n  r = 10 +/- 10\n    CLs = 0 +/- 0\n    CLs      = 0 +/- 0\n    CLb      = 0.288 +/- 0.0405024\n    CLsplusb = 0 +/- 0\n\n  r = 5 +/- 5\n    CLs = 0 +/- 0\n    CLs      = 0 +/- 0\n    CLb      = 0.152 +/- 0.0321118\n    CLsplusb = 0 +/- 0\n\n  r = 2.5 +/- 2.5\n    CLs = 0.0192308 +/- 0.0139799\n    CLs = 0.02008 +/- 0.0103371\n    CLs = 0.0271712 +/- 0.00999051\n    CLs = 0.0239524 +/- 0.00783634\n    CLs      = 0.0239524 +/- 0.00783634\n    CLb      = 0.208748 +/- 0.0181211\n    CLsplusb = 0.005 +/- 0.00157718\n\n  r = 2.00696 +/- 1.25\n    CLs = 0.0740741 +/- 0.0288829\n    CLs = 0.0730182 +/- 0.0200897\n    CLs = 0.0694474 +/- 0.0166468\n    CLs = 0.0640182 +/- 0.0131693\n    CLs = 0.0595 +/- 0.010864\n    CLs = 0.0650862 +/- 0.0105575\n    CLs = 0.0629286 +/- 0.00966301\n    CLs = 0.0634945 +/- 0.00914091\n    CLs = 0.060914 +/- 0.00852667\n    CLs = 0.06295 +/- 0.00830083\n    CLs = 0.0612758 +/- 0.00778181\n    CLs = 0.0608142 +/- 0.00747001\n    CLs = 0.0587169 +/- 0.00697039\n    CLs = 0.0591432 +/- 0.00678587\n    CLs = 0.0599683 +/- 0.00666966\n    CLs = 0.0574868 +/- 0.00630809\n    CLs = 0.0571451 +/- 0.00608177\n    CLs = 0.0553836 +/- 0.00585531\n    CLs = 0.0531612 +/- 0.0055234\n    CLs = 0.0516837 +/- 0.0052607\n    CLs = 0.0496776 +/- 0.00499783\n    CLs      = 0.0496776 +/- 0.00499783\n    CLb      = 0.216635 +/- 0.00801002\n    CLsplusb = 0.0107619 +/- 0.00100693\n\nTrying to move the interval edges closer\n  r = 1.00348 +/- 0\n    CLs = 0.191176 +/- 0.0459911\n    CLs      = 0.191176 +/- 0.0459911\n    CLb      = 0.272 +/- 0.0398011\n    CLsplusb = 0.052 +/- 0.00992935\n\n  r = 1.50522 +/- 0\n    CLs = 0.125 +/- 0.0444346\n    CLs = 0.09538 +/- 0.0248075\n    CLs = 0.107714 +/- 0.0226712\n    CLs = 0.103711 +/- 0.018789\n    CLs = 0.0845069 +/- 0.0142341\n    CLs = 0.0828468 +/- 0.0126789\n    CLs = 0.0879647 +/- 0.0122332\n    CLs      = 0.0879647 +/- 0.0122332\n    CLb      = 0.211124 +/- 0.0137494\n    CLsplusb = 0.0185714 +/- 0.00228201\n\n  r = 1.75609 +/- 0\n    CLs = 0.0703125 +/- 0.0255807\n    CLs = 0.0595593 +/- 0.0171995\n    CLs = 0.0555271 +/- 0.0137075\n    CLs = 0.0548727 +/- 0.0120557\n    CLs = 0.0527832 +/- 0.0103348\n    CLs = 0.0555828 +/- 0.00998248\n    CLs = 0.0567971 +/- 0.00923449\n    CLs = 0.0581822 +/- 0.00871417\n    CLs = 0.0588835 +/- 0.00836245\n    CLs = 0.0594035 +/- 0.00784761\n    CLs = 0.0590583 +/- 0.00752672\n    CLs = 0.0552067 +/- 0.00695542\n    CLs = 0.0560446 +/- 0.00679746\n    CLs = 0.0548083 +/- 0.0064351\n    CLs = 0.0566998 +/- 0.00627124\n    CLs = 0.0561576 +/- 0.00601888\n    CLs = 0.0551643 +/- 0.00576338\n    CLs = 0.0583584 +/- 0.00582854\n    CLs = 0.0585691 +/- 0.0057078\n    CLs = 0.0599114 +/- 0.00564585\n    CLs = 0.061987 +/- 0.00566905\n    CLs = 0.061836 +/- 0.00549856\n    CLs = 0.0616849 +/- 0.0053773\n    CLs = 0.0605352 +/- 0.00516844\n    CLs = 0.0602028 +/- 0.00502875\n    CLs = 0.058667 +/- 0.00486263\n    CLs      = 0.058667 +/- 0.00486263\n    CLb      = 0.222901 +/- 0.00727258\n    CLsplusb = 0.0130769 +/- 0.000996375\n\n  r = 2.25348 +/- 0\n    CLs = 0.0192308 +/- 0.0139799\n    CLs = 0.0173103 +/- 0.00886481\n    CLs      = 0.0173103 +/- 0.00886481\n    CLb      = 0.231076 +/- 0.0266062\n    CLsplusb = 0.004 +/- 0.001996\n\n  r = 2.13022 +/- 0\n    CLs = 0.0441176 +/- 0.0190309\n    CLs = 0.0557778 +/- 0.01736\n    CLs = 0.0496461 +/- 0.0132776\n    CLs = 0.0479048 +/- 0.0114407\n    CLs = 0.0419333 +/- 0.00925719\n    CLs = 0.0367934 +/- 0.0077345\n    CLs = 0.0339814 +/- 0.00684844\n    CLs = 0.03438 +/- 0.0064704\n    CLs = 0.0337633 +/- 0.00597315\n    CLs = 0.0321262 +/- 0.00551608\n    CLs      = 0.0321262 +/- 0.00551608\n    CLb      = 0.230342 +/- 0.0118665\n    CLsplusb = 0.0074 +/- 0.00121204\n\n  r = 2.06859 +/- 0\n    CLs = 0.0357143 +/- 0.0217521\n    CLs = 0.0381957 +/- 0.0152597\n    CLs = 0.0368622 +/- 0.0117105\n    CLs = 0.0415097 +/- 0.0106676\n    CLs = 0.0442816 +/- 0.0100457\n    CLs = 0.0376644 +/- 0.00847235\n    CLs = 0.0395133 +/- 0.0080427\n    CLs = 0.0377625 +/- 0.00727262\n    CLs = 0.0364415 +/- 0.00667827\n    CLs = 0.0368015 +/- 0.00628517\n    CLs = 0.0357251 +/- 0.00586442\n    CLs = 0.0341604 +/- 0.00546373\n    CLs = 0.0361935 +/- 0.00549648\n    CLs = 0.0403254 +/- 0.00565172\n    CLs = 0.0408613 +/- 0.00554124\n    CLs = 0.0416682 +/- 0.00539651\n    CLs = 0.0432645 +/- 0.00538062\n    CLs = 0.0435229 +/- 0.00516945\n    CLs = 0.0427647 +/- 0.00501322\n    CLs = 0.0414894 +/- 0.00479711\n    CLs      = 0.0414894 +/- 0.00479711\n    CLb      = 0.202461 +/- 0.00800632\n    CLsplusb = 0.0084 +/- 0.000912658\n\n\n -- HybridNew, before fit --\nLimit: r < 2.00696 +/- 1.25 [1.50522, 2.13022]\nWarning in : Could not create the Migrad minimizer. Try using the minimizer Minuit\nFit to 5 points: 1.91034 +/- 0.0388334\n\n -- Hybrid New --\nLimit: r < 1.91034 +/- 0.0388334 @ 95% CL\nDone in 0.01 min (cpu), 4.09 min (real)\nFailed to delete temporary file roostats-Sprxsw.root: No such file or directory\n\n

\n\n

The result stored in the limit branch of the output tree will be the upper limit (and its error, stored in limitErr). The default behaviour will be, as above, to search for the upper limit on r. However, the values of \\(p_{\\mu}, p_{b}\\) and CLs can be calculated for a particular value r=X by specifying the option --singlePoint=X. In this case, the value stored in the branch limit will be the value of CLs (or \\(p_{\\mu}\\)) (see the FAQ section).

"},{"location":"part3/commonstatsmethods/#expected-limits","title":"Expected Limits","text":"

For simple models, we can run interactively 5 times to compute the median expected and the 68% and 95% central interval boundaries. For this, we can use the HybridNew method with the same options as for the observed limit, but adding a --expectedFromGrid=<quantile>. Here, the quantile should be set to 0.5 for the median, 0.84 for the +ve side of the 68% band, 0.16 for the -ve side of the 68% band, 0.975 for the +ve side of the 95% band, and 0.025 for the -ve side of the 95% band.

\n

The output file will contain the value of the quantile in the branch quantileExpected. This branch can therefore be used to separate the points.

"},{"location":"part3/commonstatsmethods/#accuracy","title":"Accuracy","text":"

The search for the limit is performed using an adaptive algorithm, terminating when the estimate of the limit value is below some limit or when the precision cannot be improved further with the specified options. The options controlling this behaviour are:

\n
    \n
  • rAbsAcc, rRelAcc: define the accuracy on the limit at which the search stops. The default values are 0.1 and 0.05 respectively, meaning that the search is stopped when \u0394r < 0.1 or \u0394r/r < 0.05.
  • \n
  • clsAcc: this determines the absolute accuracy up to which the CLs values are computed when searching for the limit. The default is 0.5%. Raising the accuracy above this value will significantly increase the time needed to run the algorithm, as you need N2 more toys to improve the accuracy by a factor N. You can consider increasing this value if you are computing limits with a larger CL (e.g. 90% or 68%). Note that if you are using the CLsplusb rule, this parameter will control the uncertainty on \\(p_{\\mu}\\) rather than CLs.
  • \n
  • T or toysH: controls the minimum number of toys that are generated for each point. The default value of 500 should be sufficient when computing the limit at 90-95% CL. You can decrease this number if you are computing limits at 68% CL, or increase it if you are using 99% CL.
  • \n
\n

Note, to further improve the accuracy when searching for the upper limit, Combine will also fit an exponential function to several of the points and interpolate to find the crossing.

"},{"location":"part3/commonstatsmethods/#complex-models","title":"Complex models","text":"

For complicated models, it is best to produce a grid of test statistic distributions at various values of the signal strength, and use it to compute the observed and expected limit and central intervals. This approach is convenient for complex models, since the grid of points can be distributed across any number of jobs. In this approach we will store the distributions of the test statistic at different values of the signal strength using the option --saveHybridResult. The distribution at a single value of r=X can be determined by

\n
combine datacard.txt -M HybridNew --LHCmode LHC-limits --singlePoint X --saveToys --saveHybridResult -T 500 --clsAcc 0\n
\n\n

Warning

\n

We have specified the accuracy here by including --clsAcc=0, which turns off adaptive sampling, and specifying the number of toys to be 500 with the -T N option. For complex models, it may be necessary to internally split the toys over a number of instances of HybridNew using the option --iterations I. The total number of toys will be the product I*N.

\n\n

The above can be repeated several times, in parallel, to build the distribution of the test statistic (passing the random seed option -s -1). Once all of the distributions have been calculated, the resulting output files can be merged into one using hadd, and read back to calculate the limit, specifying the merged file with --grid=merged.root.

\n

The observed limit can be obtained with

\n
combine datacard.txt -M HybridNew --LHCmode LHC-limits --readHybridResults --grid=merged.root\n
\n

and similarly, the median expected and quantiles can be determined using

\n
combine datacard.txt -M HybridNew --LHCmode LHC-limits --readHybridResults --grid=merged.root --expectedFromGrid <quantile>\n
\n

substituting <quantile> with 0.5 for the median, 0.84 for the +ve side of the 68% band, 0.16 for the -ve side of the 68% band, 0.975 for the +ve side of the 95% band, and 0.025 for the -ve side of the 95% band. You should note that Combine will update the grid to improve the accuracy on the extracted limit by default. If you want to avoid this, you can use the option --noUpdateGrid. This will mean only the toys/points you produced in the grid will be used to compute the limit.

\n\n

Warning

\n

Make sure that if you specified a particular mass value (-m or --mass) in the commands for calculating the toys, you also specify the same mass when reading in the grid of distributions.

\n\n

The splitting of the jobs can be left to the user's preference. However, users may wish to use the combineTool for automating this, as described in the section on combineTool for job submission

"},{"location":"part3/commonstatsmethods/#plotting","title":"Plotting","text":"

A plot of the CLs (or \\(p_{\\mu}\\)) as a function of r, which is used to find the crossing, can be produced using the option --plot=limit_scan.png. This can be useful for judging if the chosen grid was sufficient for determining the upper limit.

\n

If we use our realistic-counting-experiment.txt datacard and generate a grid of points \\(r\\varepsilon[1.4,2.2]\\) in steps of 0.1, with 5000 toys for each point, the plot of the observed CLs vs r should look like the following,

\n

\n

You should judge in each case whether the limit is accurate given the spacing of the points and the precision of CLs at each point. If it is not sufficient, simply generate more points closer to the limit and/or more toys at each point.

\n

The distributions of the test statistic can also be plotted, at each value in the grid, using

\n
python test/plotTestStatCLs.py --input mygrid.root --poi r --val all --mass MASS\n
\n

The resulting output file will contain a canvas showing the distribution of the test statistics for the background only and signal+background hypotheses at each value of r. Use --help to see more options for this script.

\n\n

Info

\n

If you used the TEV or LEP style test statistic (using the commands as described above), then you should include the option --doublesided, which will also take care of defining the correct integrals for \\(p_{\\mu}\\) and \\(p_{b}\\). Click on the examples below to see what a typical output of this plotting tool will look like when using the LHC test statistic, or the TEV test statistic.

\n\n\nqLHC test stat example\n

\n\n\nqTEV test stat example\n

"},{"location":"part3/commonstatsmethods/#computing-significances-with-toys","title":"Computing Significances with toys","text":"

Computation of the expected significance with toys is a two-step procedure: first you need to run one or more jobs to construct the expected distribution of the test statistic. As for setting limits, there are a number of different possible configurations for generating toys. However, we will use the most commonly used option,

\n
    \n
  • LHC-style: --LHCmode LHC-significance\n, which is the shortcut for --testStat LHC --generateNuisances=0 --generateExternalMeasurements=1 --fitNuisances=1 --significance
      \n
    • The test statistic is defined using the ratio of likelihoods \\(q_{0} = -2\\ln[\\mathcal{L}(\\mu=0,\\hat{\\hat{\\nu}}(0))/\\mathcal{L}(\\hat{\\mu},\\hat{\\nu})]\\), in which the nuisance parameters are profiled separately for \\(\\mu=\\hat{\\mu}\\) and \\(\\mu=0\\).
    • \n
    • The value of the test statistic is set to 0 when \\(\\hat{\\mu}<0\\)
    • \n
    • For the purposes of toy generation, the nuisance parameters are fixed to their post-fit values from the data assuming no signal, while the constraint terms are randomized for the evaluation of the likelihood.
    • \n
    \n
  • \n
"},{"location":"part3/commonstatsmethods/#observed-significance","title":"Observed significance","text":"

To construct the distribution of the test statistic, the following command should be run as many times as necessary

\n
combine -M HybridNew datacard.txt --LHCmode LHC-significance  --saveToys --fullBToys --saveHybridResult -T toys -i iterations -s seed\n
\n

with different seeds, or using -s -1 for random seeds, then merge all those results into a single ROOT file with hadd. The toys can then be read back into combine using the option --toysFile=input.root --readHybridResult.

\n

The observed significance can be calculated as

\n
combine -M HybridNew datacard.txt --LHCmode LHC-significance --readHybridResult --toysFile=input.root [--pvalue ]\n
\n

where the option --pvalue will replace the result stored in the limit branch output tree to be the p-value instead of the signficance.

"},{"location":"part3/commonstatsmethods/#expected-significance-assuming-some-signal","title":"Expected significance, assuming some signal","text":"

The expected significance, assuming a signal with r=X can be calculated, by including the option --expectSignal X when generating the distribution of the test statistic and using the option --expectedFromGrid=0.5 when calculating the significance for the median. To get the \u00b11\u03c3 bands, use 0.16 and 0.84 instead of 0.5, and so on.

\n

The total number of background toys needs to be large enough to compute the value of the significance, but you need fewer signal toys (especially when you are only computing the median expected significance). For large significances, you can run most of the toys without the --fullBToys option, which will be about a factor 2 faster. Only a small part of the toys needs to be run with that option turned on.

\n

As with calculating limits with toys, these jobs can be submitted to the grid or batch systems with the help of the combineTool, as described in the section on combineTool for job submission

"},{"location":"part3/commonstatsmethods/#goodness-of-fit-tests","title":"Goodness of fit tests","text":"

The GoodnessOfFit method can be used to evaluate how compatible the observed data are with the model PDF.

\n

This method implements several algorithms, and will compute a goodness of fit indicator for the chosen algorithm and the data. The procedure is therefore to first run on the real data

\n
combine -M GoodnessOfFit datacard.txt --algo=<some-algo>\n
\n

and then to run on many toy MC data sets to determine the distribution of the goodness-of-fit indicator

\n
combine -M GoodnessOfFit datacard.txt --algo=<some-algo> -t <number-of-toys> -s <seed>\n
\n

When computing the goodness-of-fit, by default the signal strength is left floating in the fit, so that the measure is independent from the presence or absence of a signal. It is possible to fixe the signal strength to some value by passing the option --fixedSignalStrength=<value>.

\n

The following algorithms are implemented:

\n
    \n
  • \n

    saturated: Compute a goodness-of-fit measure for binned fits based on the saturated model, as prescribed by the Statistics Committee (note). This quantity is similar to a chi-square, but can be computed for an arbitrary combination of binned channels with arbitrary constraints.

    \n
  • \n
  • \n

    KS: Compute a goodness-of-fit measure for binned fits using the Kolmogorov-Smirnov test. It is based on the largest difference between the cumulative distribution function and the empirical distribution function of any bin.

    \n
  • \n
  • \n

    AD: Compute a goodness-of-fit measure for binned fits using the Anderson-Darling test. It is based on the integral of the difference between the cumulative distribution function and the empirical distribution function over all bins. It also gives the tail ends of the distribution a higher weighting.

    \n
  • \n
\n

The output tree will contain a branch called limit, which contains the value of the test statistic in each toy. You can make a histogram of this test statistic \\(t\\). From the distribution that is obtained in this way (\\(f(t)\\)) and the single value obtained by running on the observed data (\\(t_{0}\\)) you can calculate the p-value \\(p = \\int_{t=t_{0}}^{\\mathrm{+inf}} f(t) dt\\). Note: in rare cases the test statistic value for the toys can be undefined (for AS and KD). In this case we set the test statistic value to -1. When plotting the test statistic distribution, those toys should be excluded. This is automatically taken care of if you use the GoF collection script in CombineHarvester, which is described below.

\n

When generating toys, the default behavior will be used. See the section on toy generation for options that control how nuisance parameters are generated and fitted in these tests. It is recommended to use frequentist toys (--toysFreq) when running the saturated model, and the default toys for the other two tests.

\n

Further goodness-of-fit methods could be added on request, especially if volunteers are available to code them.\nThe output limit tree will contain the value of the test statistic in each toy (or the data)

\n\n

Warning

\n

The above algorithms are all concerned with one-sample tests. For two-sample tests, you can follow an example CMS HIN analysis described in this Twiki

"},{"location":"part3/commonstatsmethods/#masking-analysis-regions-in-the-saturated-model","title":"Masking analysis regions in the saturated model","text":"

For analyses that employ a simultaneous fit across signal and control regions, it may be useful to mask one or more analysis regions, either when the likelihood is maximized (fit) or when the test statistic is computed. This can be done by using the options --setParametersForFit and --setParametersForEval, respectively. The former will set parameters before each fit, while the latter is used to set parameters after each fit, but before the NLL is evaluated. Note, of course, that if the parameter in the list is floating, it will still be floating in each fit. Therefore, it will not affect the results when using --setParametersForFit.

\n

A realistic example for a binned shape analysis performed in one signal region and two control samples can be found in this directory of the Combine package Datacards-shape-analysis-multiple-regions.

\n

First of all, one needs to Combine the individual datacards to build a single model, and to introduce the channel masking variables as follow:

\n
combineCards.py signal_region.txt dimuon_control_region.txt singlemuon_control_region.txt > combined_card.txt\ntext2workspace.py combined_card.txt --channel-masks\n
\n

More information about the channel masking can be found in this\nsection Channel Masking. The saturated test static value for a simultaneous fit across all the analysis regions can be calculated as:

\n
combine -M GoodnessOfFit -d combined_card.root --algo=saturated -n _result_sb\n
\n

In this case, signal and control regions are included in both the fit and in the evaluation of the test statistic, and the signal strength is freely floating. This measures the compatibility between the signal+background fit and the observed data. Moreover, it can be interesting to assess the level of compatibility between the observed data in all the regions and the background prediction obtained by only fitting the control regions (CR-only fit). This can be evaluated as follow:

\n
combine -M GoodnessOfFit -d combined_card.root --algo=saturated -n _result_bonly_CRonly --setParametersForFit mask_ch1=1 --setParametersForEval mask_ch1=0 --freezeParameters r --setParameters r=0\n
\n

where the signal strength is frozen and the signal region is not considered in the fit (--setParametersForFit mask_ch1=1), but it is included in the test statistic computation (--setParametersForEval mask_ch1=0). To show the differences between the two models being tested, one can perform a fit to the data using the FitDiagnostics method as:

\n
combine -M FitDiagnostics -d combined_card.root -n _fit_result --saveShapes --saveWithUncertainties\ncombine -M FitDiagnostics -d combined_card.root -n _fit_CRonly_result --saveShapes --saveWithUncertainties --setParameters mask_ch1=1\n
\n

By taking the total background, the total signal, and the data shapes from the FitDiagnostics output, we can compare the post-fit predictions from the S+B fit (first case) and the CR-only fit (second case) with the observation as reported below:

\n\nFitDiagnostics S+B fit\n

\n\n\nFitDiagnostics CR-only fit\n

\n\n

To compute a p-value for the two results, one needs to compare the observed goodness-of-fit value previously computed with the expected distribution of the test statistic obtained in toys:

\n
    combine -M GoodnessOfFit combined_card.root --algo=saturated -n result_toy_sb --toysFrequentist -t 500\n    combine -M GoodnessOfFit -d combined_card.root --algo=saturated -n _result_bonly_CRonly_toy --setParametersForFit mask_ch1=1 --setParametersForEval mask_ch1=0 --freezeParameters r --setParameters r=0,mask_ch1=1 -t 500 --toysFrequentist\n
\n

where the former gives the result for the S+B model, while the latter gives the test-statistic for CR-only fit. The command --setParameters r=0,mask_ch1=1 is needed to ensure that toys are thrown using the nuisance parameters estimated from the CR-only fit to the data. The comparison between the observation and the expected distribition should look like the following two plots:

\n\nGoodness-of-fit for S+B model\n

\n\n\nGoodness-of-fit for CR-only model\n

"},{"location":"part3/commonstatsmethods/#making-a-plot-of-the-gof-test-statistic-distribution","title":"Making a plot of the GoF test statistic distribution","text":"

If you have also checked out the combineTool, you can use this to run batch jobs or on the grid (see here) and produce a plot of the results. Once the jobs have completed, you can hadd them together and run (e.g for the saturated model),

\n
combineTool.py -M CollectGoodnessOfFit --input data_run.root toys_run.root -m 125.0 -o gof.json\nplotGof.py gof.json --statistic saturated --mass 125.0 -o gof_plot --title-right=\"my label\"\n
"},{"location":"part3/commonstatsmethods/#channel-compatibility","title":"Channel Compatibility","text":"

The ChannelCompatibilityCheck method can be used to evaluate how compatible the measurements of the signal strength from the separate channels of a combination are with each other.

\n

The method performs two fits of the data, first with the nominal model in which all channels are assumed to have the same signal strength modifier \\(r\\), and then another allowing separate signal strengths \\(r_{i}\\) in each channel. A chisquare-like quantity is computed as \\(-2 \\ln \\mathcal{L}(\\mathrm{data}| r)/L(\\mathrm{data}|\\{r_{i}\\}_{i=1}^{N_{\\mathrm{chan}}})\\). Just like for the goodness-of-fit indicators, the expected distribution of this quantity under the nominal model can be computed from toy MC data sets.

\n

By default, the signal strength is kept floating in the fit with the nominal model. It can however be fixed to a given value by passing the option --fixedSignalStrength=<value>.

\n

In the default model built from the datacards, the signal strengths in all channels are constrained to be non-negative. One can allow negative signal strengths in the fits by changing the bound on the variable (option --rMin=<value>), which should make the quantity more chisquare-like under the hypothesis of zero signal; this however can create issues in channels with small backgrounds, since total expected yields and PDFs in each channel must be positive.

\n

Optionally, channels can be grouped together by using the option -g <name_fragment>, where <name_fragment> is a string which is common to all channels to be grouped together. The -g option can also be used to set the range for the each POI separately via -g <name>=<min>,<max>.

\n

When run with a verbosity of 1, as is the default, the program also prints out the best fit signal strengths in all channels. As the fit to all channels is done simultaneously, the correlation between the other systematic uncertainties is taken into account. Therefore, these results can differ from the ones obtained when fitting each channel separately.

\n

Below is an example output from Combine,

\n
$ combine -M ChannelCompatibilityCheck comb_hww.txt -m 160 -n HWW\n <<< Combine >>>\n>>> including systematics\n>>> method used to compute upper limit is ChannelCompatibilityCheck\n>>> random number generator seed is 123456\n\nSanity checks on the model: OK\nComputing limit starting from observation\n\n--- ChannelCompatibilityCheck ---\nNominal fit : r = 0.3431 -0.1408/+0.1636\nAlternate fit: r = 0.4010 -0.2173/+0.2724 in channel hww_0jsf_shape\nAlternate fit: r = 0.2359 -0.1854/+0.2297 in channel hww_0jof_shape\nAlternate fit: r = 0.7669 -0.4105/+0.5380 in channel hww_1jsf_shape\nAlternate fit: r = 0.3170 -0.3121/+0.3837 in channel hww_1jof_shape\nAlternate fit: r = 0.0000 -0.0000/+0.5129 in channel hww_2j_cut\nChi2-like compatibility variable: 2.16098\nDone in 0.08 min (cpu), 0.08 min (real)\n
\n

The output tree will contain the value of the compatibility (chi-square variable) in the limit branch. If the option --saveFitResult is specified, the output ROOT file also contains two RooFitResult objects fit_nominal and fit_alternate with the results of the two fits.

\n

This can be read and used to extract the best fit value for each channel, and the overall best fit value, using

\n
$ root -l\nTFile* _file0 = TFile::Open(\"higgsCombineTest.ChannelCompatibilityCheck.mH120.root\");\nfit_alternate->floatParsFinal().selectByName(\"*ChannelCompatibilityCheck*\")->Print(\"v\");\nfit_nominal->floatParsFinal().selectByName(\"r\")->Print(\"v\");\n
\n

The macro cccPlot.cxx can be used to produce a comparison plot of the best fit signal strengths from all channels.

"},{"location":"part3/commonstatsmethods/#likelihood-fits-and-scans","title":"Likelihood Fits and Scans","text":"

The MultiDimFit method can be used to perform multi-dimensional fits and likelihood-based scans/contours using models with several parameters of interest.

\n

Taking a toy datacard data/tutorials/multiDim/toy-hgg-125.txt (counting experiment which vaguely resembles an early H\u2192\u03b3\u03b3 analysis at 125 GeV), we need to convert the datacard into a workspace with 2 parameters, the ggH and qqH cross sections:

\n
text2workspace.py toy-hgg-125.txt -m 125 -P HiggsAnalysis.CombinedLimit.PhysicsModel:floatingXSHiggs --PO modes=ggH,qqH\n
\n

A number of different algorithms can be used with the option --algo <algo>,

\n
    \n
  • \n

    none (default): Perform a maximum likelihood fit combine -M MultiDimFit toy-hgg-125.root; The output ROOT tree will contain two columns, one for each parameter, with the fitted values.

    \n
  • \n
  • \n

    singles: Perform a fit of each parameter separately, treating the other parameters of interest as unconstrained nuisance parameters: combine -M MultiDimFit toy-hgg-125.root --algo singles --cl=0.68 . The output ROOT tree will contain two columns, one for each parameter, with the fitted values; there will be one row with the best fit point (and quantileExpected set to -1) and two rows for each fitted parameter, where the corresponding column will contain the maximum and minimum of that parameter in the 68% CL interval, according to a one-dimensional chi-square (i.e. uncertainties on each fitted parameter do not increase when adding other parameters if they are uncorrelated). Note that if you run, for example, with --cminDefaultMinimizerStrategy=0, these uncertainties will be derived from the Hessian, while --cminDefaultMinimizerStrategy=1 will invoke Minos to derive them.

    \n
  • \n
  • \n

    cross: Perform a joint fit of all parameters: combine -M MultiDimFit toy-hgg-125.root --algo=cross --cl=0.68. The output ROOT tree will have one row with the best fit point, and two rows for each parameter, corresponding to the minimum and maximum of that parameter on the likelihood contour corresponding to the specified CL, according to an N-dimensional chi-square (i.e. the uncertainties on each fitted parameter do increase when adding other parameters, even if they are uncorrelated). Note that this method does not produce 1D uncertainties on each parameter, and should not be taken as such.

    \n
  • \n
  • \n

    contour2d: Make a 68% CL contour \u00e0 la minos combine -M MultiDimFit toy-hgg-125.root --algo contour2d --points=20 --cl=0.68. The output will contain values corresponding to the best fit point (with quantileExpected set to -1) and for a set of points on the contour (with quantileExpected set to 1-CL, or something larger than that if the contour hits the boundary of the parameters). Probabilities are computed from the the n-dimensional \\(\\chi^{2}\\) distribution. For slow models, this method can be split by running several times with a different number of points, and merging the outputs. The contourPlot.cxx macro can be used to make plots out of this algorithm.

    \n
  • \n
  • \n

    random: Scan N random points and compute the probability out of the profile likelihood ratio combine -M MultiDimFit toy-hgg-125.root --algo random --points=20 --cl=0.68. Again, the best fit will have quantileExpected set to -1, while each random point will have quantileExpected set to the probability given by the profile likelihood ratio at that point.

    \n
  • \n
  • \n

    fixed: Compare the log-likelihood at a fixed point compared to the best fit. combine -M MultiDimFit toy-hgg-125.root --algo fixed --fixedPointPOIs r=r_fixed,MH=MH_fixed. The output tree will contain the difference in the negative log-likelihood between the points (\\(\\hat{r},\\hat{m}_{H}\\)) and (\\(\\hat{r}_{fixed},\\hat{m}_{H,fixed}\\)) in the branch deltaNLL.

    \n
  • \n
  • \n

    grid: Scan a fixed grid of points with approximately N points in total. combine -M MultiDimFit toy-hgg-125.root --algo grid --points=10000.

    \n
      \n
    • You can partition the job in multiple tasks by using the options --firstPoint and --lastPoint. For complicated scans, the points can be split as described in the combineTool for job submission section. The output file will contain a column deltaNLL with the difference in negative log-likelihood with respect to the best fit point. Ranges/contours can be evaluated by filling TGraphs or TH2 histograms with these points.
    • \n
    • By default the \"min\" and \"max\" of the POI ranges are not included and the points that are in the scan are centred , eg combine -M MultiDimFit --algo grid --rMin 0 --rMax 5 --points 5 will scan at the points \\(r=0.5, 1.5, 2.5, 3.5, 4.5\\). You can include the option --alignEdges 1, which causes the points to be aligned with the end-points of the parameter ranges - e.g. combine -M MultiDimFit --algo grid --rMin 0 --rMax 5 --points 6 --alignEdges 1 will scan at the points \\(r=0, 1, 2, 3, 4, 5\\). Note - the number of points must be increased by 1 to ensure both end points are included.
    • \n
    \n
  • \n
\n

With the algorithms none and singles you can save the RooFitResult from the initial fit using the option --saveFitResult. The fit result is saved into a new file called multidimfit.root.

\n

As usual, any floating nuisance parameters will be profiled. This behaviour can be modified by using the --freezeParameters option.

\n

For most of the methods, for lower-precision results you can turn off the profiling of the nuisance parameters by using the option --fastScan, which for complex models speeds up the process by several orders of magnitude. All nuisance parameters will be kept fixed at the value corresponding to the best fit point.

\n

As an example, let's produce the \\(-2\\Delta\\ln{\\mathcal{L}}\\) scan as a function of r_ggH and r_qqH from the toy H\u2192\u03b3\u03b3 datacard, with the nuisance parameters fixed to their global best fit values.

\n
combine toy-hgg-125.root -M MultiDimFit --algo grid --points 2000 --setParameterRanges r_qqH=0,10:r_ggH=0,4 -m 125 --fastScan\n
\n\nShow output\n
\n <<< Combine >>>\n>>> including systematics\n>>> method used is MultiDimFit\n>>> random number generator seed is 123456\nModelConfig 'ModelConfig' defines more than one parameter of interest. This is not supported in some statistical methods.\nSet Range of Parameter r_qqH To : (0,10)\nSet Range of Parameter r_ggH To : (0,4)\nComputing results starting from observation (a-posteriori)\n POI: r_ggH= 0.88152 -> [0,4]\n POI: r_qqH= 4.68297 -> [0,10]\nPoint 0/2025, (i,j) = (0,0), r_ggH = 0.044444, r_qqH = 0.111111\nPoint 11/2025, (i,j) = (0,11), r_ggH = 0.044444, r_qqH = 2.555556\nPoint 22/2025, (i,j) = (0,22), r_ggH = 0.044444, r_qqH = 5.000000\nPoint 33/2025, (i,j) = (0,33), r_ggH = 0.044444, r_qqH = 7.444444\nPoint 55/2025, (i,j) = (1,10), r_ggH = 0.133333, r_qqH = 2.333333\nPoint 66/2025, (i,j) = (1,21), r_ggH = 0.133333, r_qqH = 4.777778\nPoint 77/2025, (i,j) = (1,32), r_ggH = 0.133333, r_qqH = 7.222222\nPoint 88/2025, (i,j) = (1,43), r_ggH = 0.133333, r_qqH = 9.666667\nPoint 99/2025, (i,j) = (2,9), r_ggH = 0.222222, r_qqH = 2.111111\nPoint 110/2025, (i,j) = (2,20), r_ggH = 0.222222, r_qqH = 4.555556\nPoint 121/2025, (i,j) = (2,31), r_ggH = 0.222222, r_qqH = 7.000000\nPoint 132/2025, (i,j) = (2,42), r_ggH = 0.222222, r_qqH = 9.444444\nPoint 143/2025, (i,j) = (3,8), r_ggH = 0.311111, r_qqH = 1.888889\nPoint 154/2025, (i,j) = (3,19), r_ggH = 0.311111, r_qqH = 4.333333\nPoint 165/2025, (i,j) = (3,30), r_ggH = 0.311111, r_qqH = 6.777778\nPoint 176/2025, (i,j) = (3,41), r_ggH = 0.311111, r_qqH = 9.222222\nPoint 187/2025, (i,j) = (4,7), r_ggH = 0.400000, r_qqH = 1.666667\nPoint 198/2025, (i,j) = (4,18), r_ggH = 0.400000, r_qqH = 4.111111\nPoint 209/2025, (i,j) = (4,29), r_ggH = 0.400000, r_qqH = 6.555556\nPoint 220/2025, (i,j) = (4,40), r_ggH = 0.400000, r_qqH = 9.000000\n[...]\n\nDone in 0.00 min (cpu), 0.02 min (real)\n
\n\n

The scan, along with the best fit point can be drawn using root,

\n
$ root -l higgsCombineTest.MultiDimFit.mH125.root\n\nlimit->Draw(\"2*deltaNLL:r_ggH:r_qqH>>h(44,0,10,44,0,4)\",\"2*deltaNLL<10\",\"prof colz\")\n\nlimit->Draw(\"r_ggH:r_qqH\",\"quantileExpected == -1\",\"P same\")\nTGraph *best_fit = (TGraph*)gROOT->FindObject(\"Graph\")\n\nbest_fit->SetMarkerSize(3); best_fit->SetMarkerStyle(34); best_fit->Draw(\"p same\")\n
\n

\n

To make the full profiled scan, just remove the --fastScan option from the Combine command.

\n

Similarly, 1D scans can be drawn directly from the tree, however for 1D likelihood scans, there is a python script from the CombineHarvester/CombineTools package plot1DScan.py that can be used to make plots and extract the crossings of the 2*deltaNLL - e.g the 1\u03c3/2\u03c3 boundaries.

"},{"location":"part3/commonstatsmethods/#useful-options-for-likelihood-scans","title":"Useful options for likelihood scans","text":"

A number of common, useful options (especially for computing likelihood scans with the grid algo) are,

\n
    \n
  • --autoBoundsPOIs arg: Adjust bounds for the POIs if they end up close to the boundary. This can be a comma-separated list of POIs, or \"*\" to get all of them.
  • \n
  • --autoMaxPOIs arg: Adjust maxima for the POIs if they end up close to the boundary. Can be a list of POIs, or \"*\" to get all.
  • \n
  • --autoRange X: Set to any X >= 0 to do the scan in the \\(\\hat{p}\\) \\(\\pm\\) X\u03c3 range, where \\(\\hat{p}\\) and \u03c3 are the best fit parameter value and uncertainty from the initial fit (so it may be fairly approximate). In case you do not trust the estimate of the error from the initial fit, you can just centre the range on the best fit value by using the option --centeredRange X to do the scan in the \\(\\hat{p}\\) \\(\\pm\\) X range centered on the best fit value.
  • \n
  • --squareDistPoiStep: POI step size based on distance from the midpoint ( either (max-min)/2 or the best fit if used with --autoRange or --centeredRange ) rather than linear separation.
  • \n
  • --skipInitialFit: Skip the initial fit (saves time if, for example, a snapshot is loaded from a previous fit)
  • \n
\n

Below is a comparison in a likelihood scan, with 20 points, as a function of r_qqH with our toy-hgg-125.root workspace with and without some of these options. The options added tell Combine to scan more points closer to the minimum (best-fit) than with the default.

\n

\n

You may find it useful to use the --robustFit=1 option to turn on robust (brute-force) for likelihood scans (and other algorithms). You can set the strategy and tolerance when using the --robustFit option using the options --setRobustFitAlgo (default is Minuit2,migrad), setRobustFitStrategy (default is 0) and --setRobustFitTolerance (default is 0.1). If these options are not set, the defaults (set using cminDefaultMinimizerX options) will be used.

\n

If running --robustFit=1 with the algo singles, you can tune the accuracy of the routine used to find the crossing points of the likelihood using the option --setCrossingTolerance (the default is set to 0.0001)

\n

If you suspect your fits/uncertainties are not stable, you may also try to run custom HESSE-style calculation of the covariance matrix. This is enabled by running MultiDimFit with the --robustHesse=1 option. A simple example of how the default behaviour in a simple datacard is given here.

\n

For a full list of options use combine -M MultiDimFit --help

"},{"location":"part3/commonstatsmethods/#fitting-only-some-parameters","title":"Fitting only some parameters","text":"

If your model contains more than one parameter of interest, you can still decide to fit a smaller number of them, using the option --parameters (or -P), with a syntax like this:

\n
combine -M MultiDimFit [...] -P poi1 -P poi2 ... --floatOtherPOIs=(0|1)\n
\n

If --floatOtherPOIs is set to 0, the other parameters of interest (POIs), which are not included as a -P option, are kept fixed to their nominal values. If it's set to 1, they are kept floating, which has different consequences depending on algo:

\n
    \n
  • When running with --algo=singles, the other floating POIs are treated as unconstrained nuisance parameters.
  • \n
  • When running with --algo=cross or --algo=contour2d, the other floating POIs are treated as other POIs, and so they increase the number of dimensions of the chi-square.
  • \n
\n

As a result, when running with --floatOtherPOIs set to 1, the uncertainties on each fitted parameters do not depend on the selection of POIs passed to MultiDimFit, but only on the number of parameters of the model.

\n\n

Info

\n

Note that poi given to the the option -P can also be any nuisance parameter. However, by default, the other nuisance parameters are left floating, so in general this does not need to be specified.

\n\n

You can save the values of the other parameters of interest in the output tree by passing the option --saveInactivePOI=1. You can additionally save the post-fit values any nuisance parameter, function, or discrete index (RooCategory) defined in the workspace using the following options;

\n
    \n
  • --saveSpecifiedNuis=arg1,arg2,... will store the fitted value of any specified constrained nuisance parameter. Use all to save every constrained nuisance parameter. Note that if you want to store the values of flatParams (or floating parameters that are not defined in the datacard) or rateParams, which are unconstrained, you should instead use the generic option --trackParameters as described here.
  • \n
  • --saveSpecifiedFunc=arg1,arg2,... will store the value of any function (eg RooFormulaVar) in the model.
  • \n
  • --saveSpecifiedIndex=arg1,arg2,... will store the index of any RooCategory object - eg a discrete nuisance.
  • \n
"},{"location":"part3/commonstatsmethods/#using-best-fit-snapshots","title":"Using best fit snapshots","text":"

This can be used to save time when performing scans so that the best fit does not need to be repeated. It can also be used to perform scans with some nuisance parameters frozen to their best-fit values. This can be done as follows,

\n
    \n
  • Create a workspace for a floating \\(r,m_{H}\\) fit
  • \n
\n
text2workspace.py hgg_datacard_mva_8TeV_bernsteins.txt -m 125 -P HiggsAnalysis.CombinedLimit.PhysicsModel:floatingHiggsMass --PO higgsMassRange=120,130 -o testmass.root`\n
\n
    \n
  • Perfom the fit, saving the workspace
  • \n
\n
combine -m 123 -M MultiDimFit --saveWorkspace -n teststep1 testmass.root  --verbose 9\n
\n

Now we can load the best fit \\(\\hat{r},\\hat{m}_{H}\\) and fit for \\(r\\) freezing \\(m_{H}\\) and lumi_8TeV to their best-fit values,

\n
combine -m 123 -M MultiDimFit -d higgsCombineteststep1.MultiDimFit.mH123.root -w w --snapshotName \"MultiDimFit\" -n teststep2  --verbose 9 --freezeParameters MH,lumi_8TeV\n
"},{"location":"part3/commonstatsmethods/#feldman-cousins","title":"Feldman-Cousins","text":"

The Feldman-Cousins (FC) procedure for computing confidence intervals for a generic model is,

\n
    \n
  • use the profile likelihood ratio as the test statistic, \\(q(x) = - 2 \\ln \\mathcal{L}(x,\\hat{\\hat{\\nu}}(x))/\\mathcal{L}(\\hat{x},\\hat{\\nu})\\) where \\(x\\) is a point in the (N-dimensional) parameter space, and \\(\\hat{x}\\) is the point corresponding to the best fit. In this test statistic, the nuisance parameters are profiled, both in the numerator and denominator.
  • \n
  • for each point \\(x\\):
      \n
    • compute the observed test statistic \\(q_{\\mathrm{obs}}(x)\\)
    • \n
    • compute the expected distribution of \\(q(x)\\) under the hypothesis of \\(x\\) as the true value.
    • \n
    • accept the point in the region if \\(p_{x}=P\\left[q(x) > q_{\\mathrm{obs}}(x)| x\\right] > \\alpha\\)
    • \n
    \n
  • \n
\n

With a critical value \\(\\alpha\\).

\n

In Combine, you can perform this test on each individual point (param1, param2,...) = (value1,value2,...) by doing,

\n
combine workspace.root -M HybridNew --LHCmode LHC-feldman-cousins --clsAcc 0 --singlePoint  param1=value1,param2=value2,param3=value3,... --saveHybridResult [Other options for toys, iterations etc as with limits]\n
\n

The point belongs to your confidence region if \\(p_{x}\\) is larger than \\(\\alpha\\) (e.g. 0.3173 for a 1\u03c3 region, \\(1-\\alpha=0.6827\\)).

\n\n

Warning

\n

You should not use this method without the option --singlePoint. Although Combine will not complain, the algorithm to find the crossing will only find a single crossing and therefore not find the correct interval. Instead you should calculate the Feldman-Cousins intervals as described above.

"},{"location":"part3/commonstatsmethods/#physical-boundaries","title":"Physical boundaries","text":"

Imposing physical boundaries (such as requiring \\(\\mu>0\\) for a signal strength) is achieved by setting the ranges of the physics model parameters using

\n
--setParameterRanges param1=param1_min,param1_max:param2=param2_min,param2_max ....\n
\n

The boundary is imposed by restricting the parameter range(s) to those set by the user, in the fits. Note that this is a trick! The actual fitted value, as one of an ensemble of outcomes, can fall outside of the allowed region, while the boundary should be imposed on the physical parameter. The effect of restricting the parameter value in the fit is such that the test statistic is modified as follows ;

\n

\\(q(x) = - 2 \\ln \\mathcal{L}(x,\\hat{\\hat{\\theta}}(x))/\\mathcal{L}(\\hat{x},\\hat{\\nu})\\), if \\(\\hat{x}\\) in contained in the bounded range

\n

and,

\n

\\(q(x) = - 2 \\ln \\mathcal{L}(x,\\hat{\\hat{\\nu}}(x))/\\mathcal{L}(x_{B},\\hat{\\hat{\\nu}}(x_{B}))\\), if \\(\\hat{x}\\) is outside of the bounded range. Here \\(x_{B}\\) and \\(\\hat{\\hat{\\nu}}(x_{B})\\) are the values of \\(x\\) and \\(\\nu\\) which maximise the likelihood excluding values outside of the bounded region for \\(x\\) - typically, \\(x_{B}\\) will be found at one of the boundaries which is imposed. For example, if the boundary \\(x>0\\) is imposed, you will typically expect \\(x_{B}=0\\), when \\(\\hat{x}\\leq 0\\), and \\(x_{B}=\\hat{x}\\) otherewise.

\n

This can sometimes be an issue as Minuit may not know if has successfully converged when the minimum lies outside of that range. If there is no upper/lower boundary, just set that value to something far from the region of interest.

\n\n

Info

\n

One can also imagine imposing the boundaries by first allowing Minuit to find the minimum in the unrestricted region and then setting the test statistic to that in the case that minimum lies outside the physical boundary. This would avoid potential issues of convergence. If you are interested in implementing this version in Combine, please contact the development team.

"},{"location":"part3/commonstatsmethods/#extracting-contours-from-results-files","title":"Extracting contours from results files","text":"

As in general for HybridNew, you can split the task into multiple tasks (grid and/or batch) and then merge the outputs with hadd. You can also refer to the combineTool for job submission section for submitting the jobs to the grid/batch or if you have more than one parameter of interest, see the instructions for running HybridNew on a grid of parameter points on the CombineHarvest - HybridNewGrid documentation.

"},{"location":"part3/commonstatsmethods/#extracting-1d-intervals","title":"Extracting 1D intervals","text":"

For one-dimensional models only, and if the parameter behaves like a cross section, the code is able to interpolate and determine the values of your parameter on the contour (just like it does for the limits). As with limits, read in the grid of points and extract 1D intervals using,

\n
combine workspace.root -M HybridNew --LHCmode LHC-feldman-cousins --readHybridResults --grid=mergedfile.root --cl <1-alpha>\n
\n

The output tree will contain the values of the POI that crosses the critical value (\\(\\alpha\\)) - i.e, the boundaries of the confidence intervals.

\n

You can produce a plot of the value of \\(p_{x}\\) vs the parameter of interest \\(x\\) by adding the option --plot <plotname>.

"},{"location":"part3/commonstatsmethods/#extracting-2d-contours","title":"Extracting 2D contours","text":"

There is a tool for extracting 2D contours from the output of HybridNew located in test/makeFCcontour.py. This can be used provided the option --saveHybridResult was included when running HybridNew. It can be run with the usual Combine output files (or several of them) as input,

\n
./test/makeFCcontour.py  toysfile1.root toysfile2.root .... [options] -out outputfile.root\n
\n

To extract 2D contours, the names of each parameter must be given --xvar poi_x --yvar poi_y. The output will be a ROOT file containing a 2D histogram of value of \\(p_{x,y}\\) for each point \\((x,y)\\) which can be used to draw 2D contours. There will also be a histogram containing the number of toys found for each point.

\n

There are several options for reducing the running time, such as setting limits on the region of interest or the minimum number of toys required for a point to be included. Finally, adding the option --storeToys in this script will add histograms for each point to the output file of the test statistic distribution. This will increase the memory usage, as all of the toys will be kept in memory.

"},{"location":"part3/debugging/","title":"Debugging fits","text":"

When a fit fails there are several things you can do to investigate. CMS users can have a look at these slides from a previous Combine tutorial. This section contains a few pointers for some of the methods mentioned in the slides.

"},{"location":"part3/debugging/#analyzing-the-nll-shape-in-each-parameter","title":"Analyzing the NLL shape in each parameter","text":"

The FastScan mode of combineTool.py can be used to analyze the shape of the NLL as a function of each parameter in the fit model. The NLL is evaluated varying a single parameter at a time, the other parameters stay at the default values they have in the workspace. This produces a file with the NLL, plus its first and second derivatives, as a function of each parameter. Discontinuities in the derivatives, particularly if they are close to the minimum of the parameter, can be the source of issues with the fit.

The usage is as follows:

combineTool.py -M FastScan -w workspace.root:w

Note that this will make use of the data in the workspace for evaluating the NLL. To run this on an asimov data set, with r=1 injected, you can do the following:

combine -M GenerateOnly workspace.root -t -1 --saveToys --setParameters r=1\n\ncombineTool.py -M FastScan -w workspace.root:w -d higgsCombineTest.GenerateOnly.mH120.123456.root:toys/toy_asimov\n

higgsCombineTest.GenerateOnly.mH120.123456.root is generated by the first command; if you pass a value for -m or change the default output file name with -n the file name will be different and you should change the combineTool call accordingly.

"},{"location":"part3/nonstandard/","title":"Advanced Use Cases","text":"

This section will cover some of the more specific use cases for Combine that are not necessarily related to the main results of the analysis.

"},{"location":"part3/nonstandard/#fit-diagnostics","title":"Fit Diagnostics","text":"

If you want to diagnose your limits/fit results, you may first want to look at the HIG PAG standard checks, which are applied to all datacards and can be found here.

If you have already found the Higgs boson but it's an exotic one, instead of computing a limit or significance you might want to extract its cross section by performing a maximum-likelihood fit. Alternatively, you might want to know how compatible your data and your model are, e.g. how strongly your nuisance parameters are constrained, to what extent they are correlated, etc. These general diagnostic tools are contained in the method FitDiagnostics.

    combine -M FitDiagnostics datacard.txt\n

The program will print out the result of two fits. The first one is performed with the signal strength r (or the first POI in the list, in models with multiple POIs) set to zero and a second with floating r. The output ROOT tree will contain the best fit value for r and its uncertainty. You will also get a fitDiagnostics.root file containing the following objects:

Object Description nuisances_prefit RooArgSet containing the pre-fit values of the nuisance parameters, and their uncertainties from the external constraint terms only fit_b RooFitResult object containing the outcome of the fit of the data with signal strength set to zero fit_s RooFitResult object containing the outcome of the fit of the data with floating signal strength tree_prefit TTree of pre-fit nuisance parameter values and constraint terms (_In) tree_fit_sb TTree of fitted nuisance parameter values and constraint terms (_In) with floating signal strength tree_fit_b TTree of fitted nuisance parameter values and constraint terms (_In) with signal strength set to 0

by including the option --plots, you will additionally find the following contained in the ROOT file:

Object Description covariance_fit_s TH2D Covariance matrix of the parameters in the fit with floating signal strength covariance_fit_b TH2D Covariance matrix of the parameters in the fit with signal strength set to zero category_variable_prefit RooPlot plot of the pre-fit PDFs/templates with the data (or toy if running with -t) overlaid category_variable_fit_b RooPlot plot of the PDFs/templates from the background only fit with the data (or toy if running with -t) overlaid category_variable_fit_s RooPlot plot of the PDFs/templates from the signal+background fit with the data (or toy if running with -t) overlaid

There will be one RooPlot object per category in the likelihood, and one per variable if using a multi-dimensional dataset. For each of these additional objects a png file will also be produced.

Info

If you use the option --name, this additional name will be inserted into the file name for this output file.

As well as the values of the constrained nuisance parameters (and their constraints), you will also find branches for the number of \"bad\" nll calls (which you should check is not too large) and the status of the fit fit_status. The fit status is computed as follows

fit_status = 100 * hesse_status + 10 * minos_status +  minuit_summary_status\n

The minuit_summary_status is the usual status from Minuit, details of which can be found here. For the other status values, check these documentation links for the hesse_status and the minos_status.

A fit status of -1 indicates that the fit failed (Minuit summary was not 0 or 1) and hence the fit result is not valid.

"},{"location":"part3/nonstandard/#fit-options","title":"Fit options","text":"
  • If you only want to run the signal+background fit, and do not need the output file, you can run with --justFit. In case you would like to run only the signal+background fit but would like to produce the output file, you should use the option --skipBOnlyFit instead.
  • You can use --rMin and --rMax to set the range of the first POI; a range that is not too large compared with the uncertainties you expect from the fit usually gives more stable and accurate results.
  • By default, the uncertainties are computed using MINOS for the first POI and HESSE for all other parameters. For the nuisance parameters the uncertainties will therefore be symmetric. You can run MINOS for all parameters using the option --minos all, or for none of the parameters using --minos none. Note that running MINOS is slower so you should only consider using it if you think the HESSE uncertainties are not accurate.
  • If MINOS or HESSE fails to converge, you can try running with --robustFit=1. This will do a slower, but more robust, likelihood scan, which can be further controlled with the parameter --stepSize (the default value is 0.1, and is relative to the range of the parameter).
  • The strategy and tolerance when using the --robustFit option can be set using the options setRobustFitAlgo (default is Minuit2,migrad), setRobustFitStrategy (default is 0) and --setRobustFitTolerance (default is 0.1). If these options are not set, the defaults (set using cminDefaultMinimizerX options) will be used. You can also tune the accuracy of the routine used to find the crossing points of the likelihood using the option --setCrossingTolerance (the default is set to 0.0001)
  • If you find the covariance matrix provided by HESSE is not accurate (i.e. fit_s->Print() reports this was forced positive-definite) then a custom HESSE-style calculation of the covariance matrix can be used instead. This is enabled by running FitDiagnostics with the --robustHesse 1 option. Please note that the status reported by RooFitResult::Print() will contain covariance matrix quality: Unknown, matrix was externally provided when robustHesse is used, this is normal and does not indicate a problem. NB: one feature of the robustHesse algorithm is that if it still cannot calculate a positive-definite covariance matrix it will try to do so by dropping parameters from the hessian matrix before inverting. If this happens it will be reported in the output to the screen.
  • For other fitting options see the generic minimizer options section.
"},{"location":"part3/nonstandard/#fit-parameter-uncertainties","title":"Fit parameter uncertainties","text":"

If you get a warning message when running FitDiagnostics that says Unable to determine uncertainties on all fit parameters. This means the covariance matrix calculated in FitDiagnostics was not correct.

The most common problem is that the covariance matrix is forced positive-definite. In this case the constraints on fit parameters as taken from the covariance matrix are incorrect and should not be used. In particular, if you want to make post-fit plots of the distribution used in the signal extraction fit and are extracting the uncertainties on the signal and background expectations from the covariance matrix, the resulting values will not reflect the truth if the covariance matrix was incorrect. By default if this happens and you passed the --saveWithUncertainties flag when calling FitDiagnostics, this option will be ignored as calculating the uncertainties would lead to incorrect results. This behaviour can be overridden by passing --ignoreCovWarning.

Such problems with the covariance matrix can be caused by a number of things, for example:

  • Parameters being close to their boundaries after the fit.

  • Strong (anti-) correlations between some parameters. A discontinuity in the NLL function or its derivatives at or near the minimum.

If you are aware that your analysis has any of these features you could try resolving these. Setting --cminDefaultMinimizerStrategy 0 can also help with this problem.

"},{"location":"part3/nonstandard/#pre-and-post-fit-nuisance-parameters","title":"Pre- and post-fit nuisance parameters","text":"

It is possible to compare pre-fit and post-fit nuisance parameter values with the script diffNuisances.py. Taking as input a fitDiagnostics.root file, the script will by default print out the parameters that have changed significantly with respect to their initial estimate. For each of those parameters, it will print out the shift in value and the post-fit uncertainty, both normalized to the initial (pre-fit) value. The linear correlation between the parameter and the signal strength will also be printed.

python diffNuisances.py fitDiagnostics.root\n

The script has several options to toggle the thresholds used to decide whether a parameter has changed significantly, to get the printout of the absolute value of the nuisance parameters, and to get the output in another format for use on a webpage or in a note (the supported formats are html, latex, twiki). To print all of the parameters, use the option --all.

By default, the changes in the nuisance parameter values and uncertainties are given relative to their initial (pre-fit) values (usually relative to initial values of 0 and 1 for most nuisance types).

The values in the output will be \\((\\nu-\\nu_{I})/\\sigma_{I}\\) if the nuisance has a pre-fit uncertainty, otherwise they will be \\(\\nu-\\nu_{I}\\) (for example, a flatParam has no pre-fit uncertainty).

The reported uncertainty will be the ratio \\(\\sigma/\\sigma_{I}\\) - i.e the ratio of the post-fit to the pre-fit uncertainty. If there is no pre-fit uncertainty (as for flatParam nuisances), the post-fit uncertainty is shown.

To print the pre-fit and post-fit values and (asymmetric) uncertainties, rather than the ratios, the option --abs can be used.

Info

We recommend that you include the options --abs and --all to get the full information on all of the parameters (including unconstrained nuisance parameters) at least once when checking your datacards.

If instead of the nuisance parameter values, you wish to report the pulls, you can do so using the option --pullDef X, with X being one of the options listed below. You should note that since the pulls below are only defined when the pre-fit uncertainty exists, nothing will be reported for parameters that have no prior constraint (except in the case of the unconstPullAsym choice as described below). You may want to run without this option and --all to get information about those parameters.

  • relDiffAsymErrs: This is the same as the default output of the tool, except that only constrained parameters (i.e. where the pre-fit uncertainty is defined) are reported. The uncertainty is also reported and calculated as \\(\\sigma/\\sigma_{I}\\).

  • unconstPullAsym: Report the pull as \\(\\frac{\\nu-\\nu_{I}}{\\sigma}\\), where \\(\\nu_{I}\\) and \\(\\sigma\\) are the initial value and post-fit uncertainty of that nuisance parameter. The pull defined in this way will have no error bar, but all nuisance parameters will have a result in this case.

  • compatAsym: The pull is defined as \\(\\frac{\\nu-\\nu_{D}}{\\sqrt{\\sigma^{2}+\\sigma_{D}^{2}}}\\), where \\(\\nu_{D}\\) and \\(\\sigma_{D}\\) are calculated as \\(\\sigma_{D} = (\\frac{1}{\\sigma^{2}} - \\frac{1}{\\sigma_{I}^{2}})^{-1}\\) and \\(\\nu_{D} = \\sigma_{D}(\\nu - \\frac{\\nu_{I}}{\\sigma_{I}^{2}})\\). In this expression \\(\\nu_{I}\\) and \\(\\sigma_{I}\\) are the initial value and uncertainty of that nuisance parameter. This can be thought of as a compatibility between the initial measurement (prior) and an imagined measurement where only the data (with no constraint on the nuisance parameter) is used to measure the nuisance parameter. There is no error bar associated with this value.

  • diffPullAsym: The pull is defined as \\(\\frac{\\nu-\\nu_{I}}{\\sqrt{\\sigma_{I}^{2}-\\sigma^{2}}}\\), where \\(\\nu_{I}\\) and \\(\\sigma_{I}\\) are the pre-fit value and uncertainty (from L. Demortier and L. Lyons). If the denominator is close to 0 or the post-fit uncertainty is larger than the pre-fit (usually due to some failure in the calculation), the pull is not defined and the result will be reported as 0 +/- 999.

If using --pullDef, the results for all parameters for which the pull can be calculated will be shown (i.e --all will be set to true), not just those that have moved by some metric.

This script has the option (-g outputfile.root) to produce plots of the fitted values of the nuisance parameters and their post-fit, asymmetric uncertainties. Instead, the pulls defined using one of the options above, can be plotted using the option --pullDef X. In addition this will produce a plot showing a comparison between the post-fit and pre-fit (symmetrized) uncertainties on the nuisance parameters.

Info

In the above options, if an asymmetric uncertainty is associated with the nuisance parameter, then the choice of which uncertainty is used in the definition of the pull will depend on the sign of \\(\\nu-\\nu_{I}\\).

"},{"location":"part3/nonstandard/#normalizations","title":"Normalizations","text":"

For a certain class of models, like those made from datacards for shape-based analysis, the tool can also compute and save the best fit yields of all processes to the output ROOT file. If this feature is turned on with the option --saveNormalizations, the file will also contain three RooArgSet objects norm_prefit, norm_fit_s, and norm_fit_b. These each contain one RooConstVar for each channel xxx and process yyy with name xxx/yyy and value equal to the best fit yield. You can use RooRealVar::getVal and RooRealVar::getError to estimate both the post-fit (or pre-fit) values and uncertainties of these normalizations.

The sample pyROOT macro mlfitNormsToText.py can be used to convert the ROOT file into a text table with four columns: channel, process, yield from the signal+background fit, and yield from the background-only fit. To include the uncertainties in the table, add the option --uncertainties.

Warning

Note that when running with multiple toys, the norm_fit_s, norm_fit_b, and norm_prefit objects will be stored for the last toy dataset generated and so may not be useful to you.

Note that this procedure works only for \"extended likelihoods\" like the ones used in shape-based analysis, not for counting experiment datacards. You can however convert a counting experiment datacard to an equivalent shape-based one by adding a line shapes * * FAKE in the datacard after the imax, jmax, kmax lines. Alternatively, you can use combineCards.py countingcard.txt -S > shapecard.txt to do this conversion.

"},{"location":"part3/nonstandard/#per-bin-norms-for-shape-analyses","title":"Per-bin norms for shape analyses","text":"

If you have a shape-based analysis, you can include the option --savePredictionsPerToy. With this option, additional branches will be filled in the three output trees contained in fitDiagnostics.root.

The normalization values for each toy will be stored in the branches inside the TTrees named n_exp[_final]_binxxx_proc_yyy. The _final will only be there if there are systematic uncertainties affecting this process.

Additionally, there will be branches that provide the value of the expected bin content for each process, in each channel. These are named n_exp[_final]_binxxx_proc_yyy_i (where _final will only be in the name if there are systematic uncertainties affecting this process) for channel xxx, process yyy, bin number i. In the case of the post-fit trees (tree_fit_s/b), these will be the expectations from the fitted models, while for the pre-fit tree, they will be the expectation from the generated model (i.e if running toys with -t N and using --genNuisances, they will be randomized for each toy). These can be useful, for example, for calculating correlations/covariances between different bins, in different channels or processes, within the model from toys.

Info

Be aware that for unbinned models, a binning scheme is adopted based on the RooRealVar::getBinning for the observable defining the shape, if it exists, or Combine will adopt some appropriate binning for each observable.

"},{"location":"part3/nonstandard/#plotting","title":"Plotting","text":"

FitDiagnostics can also produce pre- and post-fit plots of the model along with the data. They will be stored in the same directory as fitDiagnostics.root. To obtain these, you have to specify the option --plots, and then optionally specify the names of the signal and background PDFs/templates, e.g. --signalPdfNames='ggH*,vbfH*' and --backgroundPdfNames='*DY*,*WW*,*Top*' (by default, the definitions of signal and background are taken from the datacard). For models with more than 1 observable, a separate projection onto each observable will be produced.

An alternative is to use the option --saveShapes. This will add additional folders in fitDiagnostics.root for each category, with pre- and post-fit distributions of the signals and backgrounds as TH1s, and the data as TGraphAsymmErrors (with Poisson intervals as error bars).

Info

If you want to save post-fit shapes at a specific r value, add the options --customStartingPoint and --skipSBFit, and set the r value. The result will appear in shapes_fit_b, as described below.

Three additional folders (shapes_prefit, shapes_fit_sb and shapes_fit_b ) will contain the following distributions:

Object Description data TGraphAsymmErrors containing the observed data (or toy data if using -t). The vertical error bars correspond to the 68% interval for a Poisson distribution centered on the observed count (Garwood intervals), following the recipe provided by the CMS Statistics Committee. $PROCESS (id <= 0) TH1F for each signal process in each channel, named as in the datacard $PROCESS (id > 0) TH1F for each background process in each channel, named as in the datacard total_signal TH1F Sum over the signal components total_background TH1F Sum over the background components total TH1F Sum over all of the signal and background components

The above distributions are provided for each channel included in the datacard, in separate subfolders, named as in the datacard: There will be one subfolder per channel.

Warning

The pre-fit signal is evaluated for r=1 by default, but this can be modified using the option --preFitValue.

The distributions and normalizations are guaranteed to give the correct interpretation:

  • For shape datacards whose inputs are TH1, the histograms/data points will have the bin number as the x-axis and the content of each bin will be a number of events.

  • For datacards whose inputs are RooAbsPdf/RooDataHists, the x-axis will correspond to the observable and the bin content will be the PDF density / events divided by the bin width. This means the absolute number of events in a given bin, i, can be obtained from h.GetBinContent(i)*h.GetBinWidth(i) or similar for the data graphs. Note that for unbinned analyses Combine will make a reasonable guess as to an appropriate binning.

Uncertainties on the shapes will be added with the option --saveWithUncertainties. These uncertainties are generated by re-sampling of the fit covariance matrix, thereby accounting for the full correlation between the parameters of the fit.

Warning

It may be tempting to sum up the uncertainties in each bin (in quadrature) to get the total uncertainty on a process. However, this is (usually) incorrect, as doing so would not account for correlations between the bins. Instead you can refer to the uncertainties which will be added to the post-fit normalizations described above.

Additionally, the covariance matrix between bin yields (or yields/bin-widths) in each channel will also be saved as a TH2F named total_covar. If the covariance between all bins across all channels is desired, this can be added using the option --saveOverallShapes. Each folder will now contain additional distributions (and covariance matrices) corresponding to the concatenation of the bins in each channel (and therefore the covaraince between every bin in the analysis). The bin labels should make it clear as to which bin corresponds to which channel.

"},{"location":"part3/nonstandard/#toy-by-toy-diagnostics","title":"Toy-by-toy diagnostics","text":"

FitDiagnostics can also be used to diagnose the fitting procedure in toy experiments to identify potentially problematic nuisance parameters when running the full limits/p-values. This can be done by adding the option -t <num toys>. The output file, fitDiagnostics.root the three TTrees will contain the value of the constraint fitted result in each toy, as a separate entry. It is recommended to use the following options when investigating toys to reduce the running time: --toysFrequentist --noErrors --minos none

The results can be plotted using the macro test/plotParametersFromToys.C

$ root -l\n.L plotParametersFromToys.C+\nplotParametersFromToys(\"fitDiagnosticsToys.root\",\"fitDiagnosticsData.root\",\"workspace.root\",\"r<0\")\n

The first argument is the name of the output file from running with toys, and the second and third (optional) arguments are the name of the file containing the result from a fit to the data and the workspace (created from text2workspace.py). The fourth argument can be used to specify a cut string applied to one of the branches in the tree, which can be used to correlate strange behaviour with specific conditions. The output will be 2 pdf files (tree_fit_(s)b.pdf) and 2 ROOT files (tree_fit_(s)b.root) containing canvases of the fit results of the tool. For details on the output plots, consult AN-2012/317.

"},{"location":"part3/nonstandard/#scaling-constraints","title":"Scaling constraints","text":"

It possible to scale the constraints on the nuisance parameters when converting the datacard to a workspace (see the section on physics models) with text2workspace.py. This can be useful for projection studies of the analysis to higher luminosities or with different assumptions about the sizes of certain systematics without changing the datacard by hand.

We consider two kinds of scaling;

  • A constant scaling factor to scale the constraints
  • A functional scale factor that depends on some other parameters in the workspace, eg a luminosity scaling parameter (as a rateParam affecting all processes).

In both cases these scalings can be introduced by adding some extra options at the text2workspace.py step.

To add a constant scaling factor we use the option --X-rescale-nuisance, eg

text2workspace.py datacard.txt --X-rescale-nuisance '[some regular expression]' 0.5\n

will create the workspace in which every nuisance parameter whose name matches the specified regular expression will have the width of the gaussian constraint scaled by a factor 0.5.

Multiple --X-rescale-nuisance options can be specified to set different scalings for different nuisances (note that you actually have to write --X-rescale-nuisance each time as in --X-rescale-nuisance 'theory.*' 0.5 --X-rescale-nuisance 'exp.*' 0.1).

To add a functional scaling factor we use the option --X-nuisance-function, which works in a similar way. Instead of a constant value you should specify a RooFit factory expression.

A typical case would be scaling by \\(1/\\sqrt{L}\\), where \\(L\\) is a luminosity scale factor. For example, assuming there is some parameter in the datacard/workspace called lumiscale,

text2workspace.py datacard.txt --X-nuisance-function '[some regular expression]' 'expr::lumisyst(\"1/sqrt(@0)\",lumiscale[1])'\n

This factory syntax is flexible, but for our use case the typical format will be: expr::[function name](\"[formula]\", [arg0], [arg1], ...). The arg0, arg1 ... are represented in the formula by @0, @1,... placeholders.

Warning

We are playing a slight trick here with the lumiscale parameter. At the point at which text2workspace.py is building these scaling terms the lumiscale for the rateParam has not yet been created. By writing lumiscale[1] we are telling RooFit to create this variable with an initial value of 1, and then later this will be re-used by the rateParam creation.

A similar option, --X-nuisance-group-function, can be used to scale whole groups of nuisances (see groups of nuisances). Instead of a regular expression just give the group name instead,

text2workspace.py datacard.txt --X-nuisance-group-function [group name] 'expr::lumisyst(\"1/sqrt(@0)\",lumiscale[1])'\n
"},{"location":"part3/nonstandard/#nuisance-parameter-impacts","title":"Nuisance parameter impacts","text":"

The impact of a nuisance parameter (NP) \u03b8 on a parameter of interest (POI) \u03bc is defined as the shift \u0394\u03bc that is induced as \u03b8 is fixed and brought to its +1\u03c3 or \u22121\u03c3 post-fit values, with all other parameters profiled as normal (see JHEP 01 (2015) 069 for a description of this method).

This is effectively a measure of the correlation between the NP and the POI, and is useful for determining which NPs have the largest effect on the POI uncertainty.

It is possible to use the FitDiagnostics method of Combine with the option --algo impact -P parameter to calculate the impact of a particular nuisance parameter on the parameter(s) of interest. We will use the combineTool.py script to automate the fits (see the combineTool section to check out the tool.

We will use an example workspace from the \\(H\\rightarrow\\tau\\tau\\) datacard,

$ cp HiggsAnalysis/CombinedLimit/data/tutorials/htt/125/htt_tt.txt .\n$ text2workspace.py htt_tt.txt -m 125\n

Calculating the impacts is done in a few stages. First we just fit for each POI, using the --doInitialFit option with combineTool.py, and adding the --robustFit 1 option that will be passed through to Combine,

combineTool.py -M Impacts -d htt_tt.root -m 125 --doInitialFit --robustFit 1\n

Have a look at the options as for likelihood scans when using robustFit 1.

Next we perform a similar scan for each nuisance parameter with the --doFits options,

combineTool.py -M Impacts -d htt_tt.root -m 125 --robustFit 1 --doFits\n

Note that this will run approximately 60 scans, and to speed things up the option --parallel X can be given to run X Combine jobs simultaneously. The batch and grid submission methods described in the combineTool for job submission section can also be used.

Once all jobs are completed, the output can be collected and written into a json file:

combineTool.py -M Impacts -d htt_tt.root -m 125 -o impacts.json\n

A plot summarizing the nuisance parameter values and impacts can be made with plotImpacts.py,

plotImpacts.py -i impacts.json -o impacts\n

The first page of the output is shown below. Note that in these figures, the nuisance parameters are labelled as \\(\\theta\\) instead of \\(\\nu\\).

The direction of the +1\u03c3 and -1\u03c3 impacts (i.e. when the NP is moved to its +1\u03c3 or -1\u03c3 values) on the POI indicates whether the parameter is correlated or anti-correlated with it.

For models with multiple POIs, the Combine option --redefineSignalPOIs X,Y,Z... should be specified in all three of the combineTool.py -M Impacts [...] steps above. The final step will produce the impacts.json file which will contain the impacts for all the specified POIs. In the plotImpacts.py script, a particular POI can be specified with --POI X.

Warning

The plot also shows the best fit value of the POI at the top and its uncertainty. You may wish to allow the range to go negative (i.e using --setParameterRanges or --rMin) to avoid getting one-sided impacts!

This script also accepts an optional json-file argument with -t, which can be used to provide a dictionary for renaming parameters. A simple example would be to create a file rename.json,

{\n  \"r\" : \"#mu\"\n}\n

that will rename the POI label on the plot.

Info

Since combineTool accepts the usual options for combine you can also generate the impacts on an Asimov or toy dataset.

The left panel in the summary plot shows the value of \\((\\nu-\\nu_{0})/\\Delta_{\\nu}\\) where \\(\\nu\\) and \\(\\nu_{0}\\) are the post and pre-fit values of the nuisance parameter and \\(\\Delta_{\\nu}\\) is the pre-fit uncertainty. The asymmetric error bars show the post-fit uncertainty divided by the pre-fit uncertainty meaning that parameters with error bars smaller than \\(\\pm 1\\) are constrained in the fit. The pull will additionally be shown. As with the diffNuisances.py script, the option --pullDef can be used (to modify the definition of the pull that is shown).

"},{"location":"part3/nonstandard/#breakdown-of-uncertainties","title":"Breakdown of uncertainties","text":"

Often you will want to report the breakdown of your total (systematic) uncertainty on a measured parameter due to one or more groups of nuisance parameters. For example, these groups could be theory uncertainties, trigger uncertainties, ... The prodecude to do this in Combine is to sequentially freeze groups of nuisance parameters and subtract (in quadrature) from the total uncertainty. Below are the steps to do so. We will use the data/tutorials/htt/125/htt_tt.txt datacard for this.

  1. Add groups to the datacard to group nuisance parameters. Nuisance parameters not in groups will be considered as \"rest\" in the later steps. The lines should look like the following and you should add them to the end of the datacard
theory      group = QCDscale_VH QCDscale_ggH1in QCDscale_ggH2in QCDscale_qqH UEPS pdf_gg pdf_qqbar\ncalibration group = CMS_scale_j_8TeV CMS_scale_t_tautau_8TeV CMS_htt_scale_met_8TeV\nefficiency  group = CMS_eff_b_8TeV   CMS_eff_t_tt_8TeV CMS_fake_b_8TeV\n
  1. Create the workspace with text2workspace.py data/tutorials/htt/125/htt_tt.txt -m 125.

  2. Run a fit with all nuisance parameters floating and store the workspace in an output file - combine data/tutorials/htt/125/htt_tt.root -M MultiDimFit --saveWorkspace -n htt.postfit

  3. Run a scan from the postfit workspace

combine higgsCombinehtt.postfit.MultiDimFit.mH120.root -M MultiDimFit -n htt.total --algo grid --snapshotName MultiDimFit --setParameterRanges r=0,4\n
  1. Run additional scans using the post-fit workspace, sequentially adding another group to the list of groups to freeze
combine higgsCombinehtt.postfit.MultiDimFit.mH120.root -M MultiDimFit --algo grid --snapshotName MultiDimFit --setParameterRanges r=0,4  --freezeNuisanceGroups theory -n htt.freeze_theory\n\ncombine higgsCombinehtt.postfit.MultiDimFit.mH120.root -M MultiDimFit --algo grid --snapshotName MultiDimFit --setParameterRanges r=0,4  --freezeNuisanceGroups theory,calibration -n htt.freeze_theory_calibration\n\ncombine higgsCombinehtt.postfit.MultiDimFit.mH120.root -M MultiDimFit --algo grid --snapshotName MultiDimFit --setParameterRanges r=0,4  --freezeNuisanceGroups theory,calibration,efficiency -n htt.freeze_theory_calibration_efficiency\n
  1. Run one last scan freezing all of the constrained nuisance parameters (this represents the statistical uncertainty only).
combine higgsCombinehtt.postfit.MultiDimFit.mH120.root -M MultiDimFit --algo grid --snapshotName MultiDimFit --setParameterRanges r=0,4  --freezeParameters allConstrainedNuisances -n htt.freeze_all\n
  1. Use the combineTool script plot1DScan.py to report the breakdown of uncertainties.
plot1DScan.py higgsCombinehtt.total.MultiDimFit.mH120.root --main-label \"Total Uncert.\"  --others higgsCombinehtt.freeze_theory.MultiDimFit.mH120.root:\"freeze theory\":4 higgsCombinehtt.freeze_theory_calibration.MultiDimFit.mH120.root:\"freeze theory+calibration\":7 higgsCombinehtt.freeze_theory_calibration_efficiency.MultiDimFit.mH120.root:\"freeze theory+calibration+efficiency\":2 higgsCombinehtt.freeze_all.MultiDimFit.mH120.root:\"stat only\":6  --output breakdown --y-max 10 --y-cut 40 --breakdown \"theory,calibration,efficiency,rest,stat\"\n

The final step calculates the contribution of each group of nuisance parameters as the subtraction in quadrature of each scan from the previous one. This procedure guarantees that the sum in quadrature of the individual components is the same as the total uncertainty.

The plot below is produced,

Warning

While the above procedure is guaranteed the have the effect that the sum in quadrature of the breakdown will equal the total uncertainty, the order in which you freeze the groups can make a difference due to correlations induced by the fit. You should check if the answers change significantly if changing the order and we recommend you start with the largest group (in terms of overall contribution to the uncertainty) first, working down the list in order of the size of the contribution.

"},{"location":"part3/nonstandard/#channel-masking","title":"Channel Masking","text":"

The Combine tool has a number of features for diagnostics and plotting results of fits. It can often be useful to turn off particular channels in a combined analysis to see how constraints/shifts in parameter values can vary. It can also be helpful to plot the post-fit shapes and uncertainties of a particular channel (for example a signal region) without including the constraints from the data in that region.

This can in some cases be achieved by removing a specific datacard when running combineCards.py. However, when doing so, the information of particular nuisance parameters and PDFs in that region will be lost. Instead, it is possible to mask that channel from the likelihood. This is achieved at the text2Workspace step using the option --channel-masks.

"},{"location":"part3/nonstandard/#example-removing-constraints-from-the-signal-region","title":"Example: removing constraints from the signal region","text":"

We will take the control region example from the rate parameters tutorial from data/tutorials/rate_params/.

The first step is to combine the cards combineCards.py signal=signal_region.txt dimuon=dimuon_control_region.txt singlemuon=singlemuon_control_region.txt > datacard.txt

Note that we use the directive CHANNELNAME=CHANNEL_DATACARD.txt so that the names of the channels are under our control and easier to interpret. Next, we make a workspace and tell Combine to create the parameters used to mask channels

text2workspace.py datacard.txt --channel-masks\n

Now we will try to do a fit ignoring the signal region. We can turn off the signal region by setting the corresponding channel mask parameter to 1: --setParameters mask_signal=1. Note that text2workspace has created a masking parameter for every channel with the naming scheme mask_CHANNELNAME. By default, every parameter is set to 0 so that the channel is unmasked by default.

combine datacard.root -M FitDiagnostics --saveShapes --saveWithUncertainties --setParameters mask_signal=1\n

Warning

There will be a lot of warnings from Combine. These are safe to ignore as they are due to the s+b fit not converging. This is expected as the free signal parameter cannot be constrained because the data in the signal region is being ignored.

We can compare the post-fit background and uncertainties with and without the signal region included by re-running with --setParameters mask_signal=0 (or just removing that option completely). Below is a comparison of the background in the signal region with and without masking the data in the signal region. We take these from the shapes folder shapes_fit_b/signal/total_background in the fitDiagnostics.root output.

Clearly the background shape is different and much less constrained without including the signal region, as expected. Channel masking can be used with any method in Combine.

"},{"location":"part3/nonstandard/#roomultipdf-conventional-bias-studies","title":"RooMultiPdf conventional bias studies","text":"

Several analyses in CMS use a functional form to describe the background. This functional form is fit to the data. Often however, there is some uncertainty associated with the choice of which background function to use, and this choice will impact the fit results. It is therefore often the case that in these analyses, a bias study is performed. This study will give an indication of the size of the potential bias in the result, given a certain choice of functional form. These studies can be conducted using Combine.

Below is an example script that will produce a workspace based on a simplified Higgs to diphoton (Hgg) analysis with a single category. It will produce the data and PDFs necessary for this example, and you can use it as a basis to construct your own studies.

void makeRooMultiPdfWorkspace(){\n\n   // Load the combine Library\n   gSystem->Load(\"libHiggsAnalysisCombinedLimit.so\");\n\n   // mass variable\n   RooRealVar mass(\"CMS_hgg_mass\",\"m_{#gamma#gamma}\",120,100,180);\n\n\n   // create 3 background pdfs\n   // 1. exponential\n   RooRealVar expo_1(\"expo_1\",\"slope of exponential\",-0.02,-0.1,-0.0001);\n   RooExponential exponential(\"exponential\",\"exponential pdf\",mass,expo_1);\n\n   // 2. polynomial with 2 parameters\n   RooRealVar poly_1(\"poly_1\",\"T1 of chebychev polynomial\",0,-3,3);\n   RooRealVar poly_2(\"poly_2\",\"T2 of chebychev polynomial\",0,-3,3);\n   RooChebychev polynomial(\"polynomial\",\"polynomial pdf\",mass,RooArgList(poly_1,poly_2));\n\n   // 3. A power law function\n   RooRealVar pow_1(\"pow_1\",\"exponent of power law\",-3,-6,-0.0001);\n   RooGenericPdf powerlaw(\"powerlaw\",\"TMath::Power(@0,@1)\",RooArgList(mass,pow_1));\n\n   // Generate some data (lets use the power lay function for it)\n   // Here we are using unbinned data, but binning the data is also fine\n   RooDataSet *data = powerlaw.generate(mass,RooFit::NumEvents(1000));\n\n   // First we fit the pdfs to the data (gives us a sensible starting value of parameters for, e.g - blind limits)\n   exponential.fitTo(*data);   // index 0\n   polynomial.fitTo(*data);   // index 1\n   powerlaw.fitTo(*data);     // index 2\n\n   // Make a plot (data is a toy dataset)\n   RooPlot *plot = mass.frame();   data->plotOn(plot);\n   exponential.plotOn(plot,RooFit::LineColor(kGreen));\n   polynomial.plotOn(plot,RooFit::LineColor(kBlue));\n   powerlaw.plotOn(plot,RooFit::LineColor(kRed));\n   plot->SetTitle(\"PDF fits to toy data\");\n   plot->Draw();\n\n   // Make a RooCategory object. This will control which of the pdfs is \"active\"\n   RooCategory cat(\"pdf_index\",\"Index of Pdf which is active\");\n\n   // Make a RooMultiPdf object. The order of the pdfs will be the order of their index, ie for below\n   // 0 == exponential\n   // 1 == polynomial\n   // 2 == powerlaw\n   RooArgList mypdfs;\n   mypdfs.add(exponential);\n   mypdfs.add(polynomial);\n   mypdfs.add(powerlaw);\n\n   RooMultiPdf multipdf(\"roomultipdf\",\"All Pdfs\",cat,mypdfs);\n   // By default the multipdf will tell combine to add 0.5 to the nll for each parameter (this is the penalty for the discrete profiling method)\n   // It can be changed with\n   //   multipdf.setCorrectionFactor(penalty)\n   // For bias-studies, this isn;t relevant however, so lets just leave the default\n\n   // As usual make an extended term for the background with _norm for freely floating yield\n   RooRealVar norm(\"roomultipdf_norm\",\"Number of background events\",1000,0,10000);\n\n   // We will also produce a signal model for the bias studies\n   RooRealVar sigma(\"sigma\",\"sigma\",1.2); sigma.setConstant(true);\n   RooRealVar MH(\"MH\",\"MH\",125); MH.setConstant(true);\n   RooGaussian signal(\"signal\",\"signal\",mass,MH,sigma);\n\n\n   // Save to a new workspace\n   TFile *fout = new TFile(\"workspace.root\",\"RECREATE\");\n   RooWorkspace wout(\"workspace\",\"workspace\");\n\n   data->SetName(\"data\");\n   wout.import(*data);\n   wout.import(cat);\n   wout.import(norm);\n   wout.import(multipdf);\n   wout.import(signal);\n   wout.Print();\n   wout.Write();\n}\n

The signal is modelled as a simple Gaussian with a width approximately that of the diphoton resolution. For the background there is a choice of 3 functions: an exponential, a power-law, and a 2nd order polynomial. This choice is accessible within Combine through the use of the RooMultiPdf object, which can switch between the functions by setting their associated indices (herein called pdf_index). This (as with all parameters in Combine) can be set via the --setParameters option.

To assess the bias, one can throw toys using one function and fit with another. To do this, only a single datacard is needed: hgg_toy_datacard.txt.

The bias studies are performed in two stages. The first is to generate toys using one of the functions, under some value of the signal strength r (or \\(\\mu\\)). This can be repeated for several values of r and also at different masses, but in this example the Higgs boson mass is fixed to 125 GeV.

    combine hgg_toy_datacard.txt -M GenerateOnly --setParameters pdf_index=0 --toysFrequentist -t 100 --expectSignal 1 --saveToys -m 125 --freezeParameters pdf_index\n

Warning

It is important to freeze pdf_index, otherwise Combine will try to iterate over the index in the frequentist fit.

Now we have 100 toys which, by setting pdf_index=0, sets the background PDF to the exponential function. This means we assume that the exponential is the true function. Note that the option --toysFrequentist is added; this first performs a fit of the PDF, assuming a signal strength of 1, to the data before generating the toys. This is the most obvious choice as to where to throw the toys from.

The next step is to fit the toys under a different background PDF hypothesis. This time we set the pdf_index to 1, which selects the powerlaw, and run fits with the FitDiagnostics method, again freezing pdf_index.

    combine hgg_toy_datacard.txt -M FitDiagnostics  --setParameters pdf_index=1 --toysFile higgsCombineTest.GenerateOnly.mH125.123456.root  -t 100 --rMin -10 --rMax 10 --freezeParameters pdf_index --cminDefaultMinimizerStrategy=0\n

Note how we add the option --cminDefaultMinimizerStrategy=0. This is because we do not need the Hessian, as FitDiagnostics will run MINOS to get the uncertainty on r. If we do not do this, Minuit will think the fit failed as we have parameters (those not attached to the current PDF) for which the likelihood is flat.

Warning

You may get warnings about non-accurate errors such as [WARNING]: Unable to determine uncertainties on all fit parameters in b-only fit - These can be ignored since they are related to the free parameters of the background PDFs which are not active.

In the output file fitDiagnostics.root there is a tree that contains the best fit results under the signal+background hypothesis. One measure of the bias is the pull defined as the difference between the measured value of \\(\\mu\\) and the generated value (here we used 1) relative to the uncertainty on \\(\\mu\\). The pull distribution can be drawn and the mean provides an estimate of the pull. In this example, we are averaging the positive and negative uncertainties, but we could do something smarter if the uncertainties are very asymmetric.

root -l fitDiagnostics.root\ntree_fit_sb->Draw(\"(r-1)/(0.5*(rHiErr+rLoErr))>>h(20,-5,5)\")\nh->Fit(\"gaus\")\n

From the fitted Gaussian, we see the mean is at -1.29, which would indicate a bias of 129% of the uncertainty on mu from choosing the polynomial when the true function is an exponential.

"},{"location":"part3/nonstandard/#discrete-profiling","title":"Discrete profiling","text":"

If the discrete nuisance is left floating, it will be profiled by looping through the possible index values and finding the PDF that gives the best fit. This allows for the discrete profiling method to be applied for any method which involves a profiled likelihood (frequentist methods).

Warning

You should be careful since MINOS knows nothing about the discrete nuisances and hence estimations of uncertainties will be incorrect via MINOS. Instead, uncertainties from scans and limits will correctly account for these nuisance parameters. Currently the Bayesian methods will not properly treat the nuisance parameters, so some care should be taken when interpreting Bayesian results.

As an example, we can peform a likelihood scan as a function of the Higgs boson signal strength in the toy Hgg datacard. By leaving the object pdf_index non-constant, at each point in the likelihood scan, the PDFs will be iterated over and the one that gives the lowest -2 times log-likelihood, including the correction factor \\(c\\) (as defined in the paper linked above) will be stored in the output tree. We can also check the scan when we fix at each PDF individually to check that the envelope is achieved. For this, you will need to include the option --X-rtd REMOVE_CONSTANT_ZERO_POINT=1. In this way, we can take a look at the absolute value to compare the curves, if we also include --saveNLL.

For example for a full scan, you can run

    combine -M MultiDimFit -d hgg_toy_datacard.txt --algo grid --setParameterRanges r=-1,3 --cminDefaultMinimizerStrategy 0 --saveNLL -n Envelope -m 125 --setParameters myIndex=-1 --X-rtd REMOVE_CONSTANT_ZERO_POINT=1\n

and for the individual pdf_index set to X,

    combine -M MultiDimFit -d hgg_toy_datacard.txt --algo grid --setParameterRanges r=-1,3 --cminDefaultMinimizerStrategy 0 --saveNLL --freezeParameters pdf_index --setParameters pdf_index=X -n fixed_pdf_X -m 125 --X-rtd REMOVE_CONSTANT_ZERO_POINT=1\n

for X=0,1,2

You can then plot the value of 2*(deltaNLL+nll+nll0) to plot the absolute value of (twice) the negative log-likelihood, including the correction term for extra parameters in the different PDFs.

The above output will produce the following scans.

As expected, the curve obtained by allowing the pdf_index to float (labelled \"Envelope\") picks out the best function (maximum corrected likelihood) for each value of the signal strength.

In general, the performance of Combine can be improved when using the discrete profiling method by including the option --X-rtd MINIMIZER_freezeDisassociatedParams. This will stop parameters not associated to the current PDF from floating in the fits. Additionally, you can include the following options:

  • --X-rtd MINIMIZER_multiMin_hideConstants: hide the constant terms in the likelihood when recreating the minimizer
  • --X-rtd MINIMIZER_multiMin_maskConstraints: hide the constraint terms during the discrete minimization process
  • --X-rtd MINIMIZER_multiMin_maskChannels=<choice> mask the channels that are not needed from the NLL:
  • <choice> 1: keeps unmasked all channels that are participating in the discrete minimization.
  • <choice> 2: keeps unmasked only the channel whose index is being scanned at the moment.

You may want to check with the Combine development team if you are using these options, as they are somewhat for expert use.

"},{"location":"part3/nonstandard/#roosplinend-multidimensional-splines","title":"RooSplineND multidimensional splines","text":"

RooSplineND can be used to interpolate from a tree of points to produce a continuous function in N-dimensions. This function can then be used as input to workspaces allowing for parametric rates/cross-sections/efficiencies. It can also be used to up-scale the resolution of likelihood scans (i.e like those produced from Combine) to produce smooth contours.

The spline makes use of a radial basis decomposition to produce a continous \\(N \\to 1\\) map (function) from \\(M\\) provided sample points. The function of the \\(N\\) variables \\(\\vec{x}\\) is assumed to be of the form,

\\[ f(\\vec{x}) = \\sum_{i=1}^{M}w_{i}\\phi(||\\vec{x}-\\vec{x}_{i}||), \\]

where \\(\\phi(||\\vec{z}||) = e^{-\\frac{||\\vec{z}||}{\\epsilon^{2}}}\\). The distance \\(||.||\\) between two points is given by,

\\[ ||\\vec{x}-\\vec{y}|| = \\sum_{j=1}^{N}(x_{j}-y_{j})^{2}, \\]

if the option rescale=false and,

\\[ ||\\vec{x}-\\vec{y}|| = \\sum_{j=1}^{N} M^{1/N} \\cdot \\left( \\frac{ x_{j}-y_{j} }{ \\mathrm{max_{i=1,M}}(x_{i,j})-\\mathrm{min_{i=1,M}}(x_{i,j}) }\\right)^{2}, \\]

if the option rescale=true. Given the sample points, it is possible to determine the weights \\(w_{i}\\) as the solution of the set of equations,

\\[ \\sum_{i=1}^{M}w_{i}\\phi(||\\vec{x}_{j}-\\vec{x}_{i}||) = f(\\vec{x}_{j}). \\]

The solution is obtained using the eigen c++ package.

The typical constructor of the object is as follows;

RooSplineND(const char *name, const char *title, RooArgList &vars, TTree *tree, const char* fName=\"f\", double eps=3., bool rescale=false, std::string cutstring=\"\" ) ;\n

where the arguments are:

  • vars: A RooArgList of RooRealVars representing the \\(N\\) dimensions of the spline. The length of this list determines the dimension \\(N\\) of the spline.
  • tree: a TTree pointer where each entry represents a sample point used to construct the spline. The branch names must correspond to the names of the variables in vars.
  • fName: is a string representing the name of the branch to interpret as the target function \\(f\\).
  • eps : is the value of \\(\\epsilon\\) and represents the width of the basis functions \\(\\phi\\).
  • rescale : is an option to rescale the input sample points so that each variable has roughly the same range (see above in the definition of \\(||.||\\)).
  • cutstring : a string to remove sample points from the tree. Can be any typical cut string (eg \"var1>10 && var2<3\").

The object can be treated as a RooAbsArg; its value for the current values of the parameters is obtained as usual by using the getVal() method.

Warning

You should not include more variable branches than contained in vars in the tree, as the spline will interpret them as additional sample points. You will get a warning if there are two nearby points in the input samples and this will cause a failure in determining the weights. If you cannot create a reduced tree, you can remove entries by using the cutstring.

The following script is an example that produces a 2D spline (N=2) from a set of 400 points (M=400) generated from a function.

Show script
void splinend(){\n   // library containing the RooSplineND\n   gSystem->Load(\"libHiggsAnalysisCombinedLimit.so\");\n\n   TTree *tree = new TTree(\"tree_vals\",\"tree_vals\");\n   float xb,yb,fb;\n\n   tree->Branch(\"f\",&fb,\"f/F\");\n   tree->Branch(\"x\",&xb,\"x/F\");\n   tree->Branch(\"y\",&yb,\"y/F\");\n\n   TRandom3 *r = new TRandom3();\n   int nentries = 20; // just use a regular grid of 20x20=400 points\n\n   double xmin = -3.2;\n   double xmax = 3.2;\n   double ymin = -3.2;\n   double ymax = 3.2;\n\n   for (int n=0;n<nentries;n++){\n    for (int k=0;k<nentries;k++){\n\n      xb=xmin+n*((xmax-xmin)/nentries);\n      yb=ymin+k*((ymax-ymin)/nentries);\n      // Gaussian * cosine function radial in \"F(x^2+y^2)\"\n      double R = (xb*xb)+(yb*yb);\n      fb = 0.1*TMath::Exp(-1*(R)/9)*TMath::Cos(2.5*TMath::Sqrt(R));\n      tree->Fill();\n     }\n   }\n\n   // 2D graph of points in tree\n   TGraph2D *p0 = new TGraph2D();\n   p0->SetMarkerSize(0.8);\n   p0->SetMarkerStyle(20);\n\n   int c0=0;\n   for (int p=0;p<tree->GetEntries();p++){\n        tree->GetEntry(p);\n        p0->SetPoint(c0,xb,yb,fb);\n        c0++;\n        }\n\n\n   // ------------------------------ THIS IS WHERE WE BUILD THE SPLINE ------------------------ //\n   // Create 2 Real-vars, one for each of the parameters of the spline\n   // The variables MUST be named the same as the corresponding branches in the tree\n   RooRealVar x(\"x\",\"x\",0.1,xmin,xmax);\n   RooRealVar y(\"y\",\"y\",0.1,ymin,ymax);\n\n\n   // And the spline - arguments are\n   // Required ->   name, title, arglist of dependants, input tree,\n   // Optional ->  function branch name, interpolation width (tunable parameter), rescale Axis bool, cutstring\n   // The tunable parameter gives the radial basis a \"width\", over which the interpolation will be effectively taken\n\n   // the reascale Axis bool (if true) will first try to rescale the points so that they are of order 1 in range\n   // This can be helpful if for example one dimension is in much larger units than another.\n\n   // The cutstring is just a ROOT string which can be used to apply cuts to the tree in case only a sub-set of the points should be used\n\n   RooArgList args(x,y);\n   RooSplineND *spline = new RooSplineND(\"spline\",\"spline\",args,tree,\"f\",1,true);\n      // ----------------------------------------------------------------------------------------- //\n\n\n   //TGraph *gr = spline->getGraph(\"x\",0.1); // Return 1D graph. Will be a slice of the spline for fixed y generated at steps of 0.1\n\n   // Plot the 2D spline\n   TGraph2D *gr = new TGraph2D();\n   int pt = 0;\n   for (double xx=xmin;xx<xmax;xx+=0.1){\n     for (double yy=xmin;yy<ymax;yy+=0.1){\n        x.setVal(xx);\n        y.setVal(yy);\n        gr->SetPoint(pt,xx,yy,spline->getVal());\n        pt++;\n     }\n   }\n\n   gr->SetTitle(\"\");\n\n   gr->SetLineColor(1);\n   //p0->SetTitle(\"0.1 exp(-(x{^2}+y{^2})/9) #times Cos(2.5#sqrt{x^{2}+y^{2}})\");\n   gr->Draw(\"surf\");\n   gr->GetXaxis()->SetTitle(\"x\");\n   gr->GetYaxis()->SetTitle(\"y\");\n   p0->Draw(\"Pcolsame\");\n\n   //p0->Draw(\"surfsame\");\n   TLegend *leg = new TLegend(0.2,0.82,0.82,0.98);\n   leg->SetFillColor(0);\n   leg->AddEntry(p0,\"0.1 exp(-(x{^2}+y{^2})/9) #times Cos(2.5#sqrt{x^{2}+y^{2}})\",\"p\");\n   leg->AddEntry(gr,\"RooSplineND (N=2) interpolation\",\"L\");\n   leg->Draw();\n}\n

Running the script will produce the following plot. The plot shows the sampled points and the spline produced from them.

"},{"location":"part3/nonstandard/#rooparametrichist-gamman-for-shapes","title":"RooParametricHist gammaN for shapes","text":"

Currently, there is no straightforward implementation of using per-bin gmN-like uncertainties with shape (histogram) analyses. Instead, it is possible to tie control regions (written as datacards) with the signal region using three methods.

For analyses that take the normalization of some process from a control region, it is possible to use either lnU or rateParam directives to float the normalization in a correlated way of some process between two regions. Instead if each bin is intended to be determined via a control region, one can use a number of RooFit histogram PDFs/functions to accomplish this. The example below shows a simple implementation of a RooParametricHist to achieve this.

Copy the script below into a file called examplews.C and create the input workspace using root -l examplews.C...

Show script
void examplews(){\n    // As usual, load the combine library to get access to the RooParametricHist\n    gSystem->Load(\"libHiggsAnalysisCombinedLimit.so\");\n\n    // Output file and workspace\n    TFile *fOut = new TFile(\"param_ws.root\",\"RECREATE\");\n    RooWorkspace wspace(\"wspace\",\"wspace\");\n\n    // better to create the bins rather than use the \"nbins,min,max\" to avoid spurious warning about adding bins with different\n    // ranges in combine - see https://root-forum.cern.ch/t/attempt-to-divide-histograms-with-different-bin-limits/17624/3 for why!\n    const int nbins = 4;\n    double xmin=200.;\n    double xmax=1000.;\n    double xbins[5] = {200.,400.,600.,800.,1000.};\n\n    // A search in a MET tail, define MET as our variable\n\n    RooRealVar met(\"met\",\"E_{T}^{miss}\",200,xmin,xmax);\n    RooArgList vars(met);\n\n\n    // ---------------------------- SIGNAL REGION -------------------------------------------------------------------//\n    // Make a dataset, this will be just four bins in MET.\n    // its easiest to make this from a histogram. Set the contents to \"somehting\"\n    TH1F data_th1(\"data_obs_SR\",\"Data observed in signal region\",nbins,xbins);\n\n    data_th1.SetBinContent(1,100);\n    data_th1.SetBinContent(2,50);\n    data_th1.SetBinContent(3,25);\n    data_th1.SetBinContent(4,10);\n    RooDataHist data_hist(\"data_obs_SR\",\"Data observed\",vars,&data_th1);\n    wspace.import(data_hist);\n\n    // In the signal region, our background process will be freely floating,\n    // Create one parameter per bin representing the yield. (note of course we can have multiple processes like this)\n    RooRealVar bin1(\"bkg_SR_bin1\",\"Background yield in signal region, bin 1\",100,0,500);\n    RooRealVar bin2(\"bkg_SR_bin2\",\"Background yield in signal region, bin 2\",50,0,500);\n    RooRealVar bin3(\"bkg_SR_bin3\",\"Background yield in signal region, bin 3\",25,0,500);\n    RooRealVar bin4(\"bkg_SR_bin4\",\"Background yield in signal region, bin 4\",10,0,500);\n    RooArgList bkg_SR_bins;\n    bkg_SR_bins.add(bin1);\n    bkg_SR_bins.add(bin2);\n    bkg_SR_bins.add(bin3);\n    bkg_SR_bins.add(bin4);\n\n    // Create a RooParametericHist which contains those yields, last argument is just for the binning,\n    // can use the data TH1 for that\n    RooParametricHist p_bkg(\"bkg_SR\", \"Background PDF in signal region\",met,bkg_SR_bins,data_th1);\n    // Always include a _norm term which should be the sum of the yields (thats how combine likes to play with pdfs)\n    RooAddition p_bkg_norm(\"bkg_SR_norm\",\"Total Number of events from background in signal region\",bkg_SR_bins);\n\n    // Every signal region needs a signal\n    TH1F signal_th1(\"signal_SR\",\"Signal expected in signal region\",nbins,xbins);\n\n    signal_th1.SetBinContent(1,1);\n    signal_th1.SetBinContent(2,2);\n    signal_th1.SetBinContent(3,3);\n    signal_th1.SetBinContent(4,8);\n    RooDataHist signal_hist(\"signal\",\"Data observed\",vars,&signal_th1);\n    wspace.import(signal_hist);\n\n    // -------------------------------------------------------------------------------------------------------------//\n    // ---------------------------- CONTROL REGION -----------------------------------------------------------------//\n    TH1F data_CRth1(\"data_obs_CR\",\"Data observed in control region\",nbins,xbins);\n\n    data_CRth1.SetBinContent(1,200);\n    data_CRth1.SetBinContent(2,100);\n    data_CRth1.SetBinContent(3,50);\n    data_CRth1.SetBinContent(4,20);\n\n    RooDataHist data_CRhist(\"data_obs_CR\",\"Data observed\",vars,&data_CRth1);\n    wspace.import(data_CRhist);\n\n    // This time, the background process will be dependent on the yields of the background in the signal region.\n    // The transfer factor TF must account for acceptance/efficiency etc differences in the signal to control\n    // In this example lets assume the control region is populated by the same process decaying to clean daughters with 2xBR\n    // compared to the signal region\n\n    // NB You could have a different transfer factor for each bin represented by a completely different RooRealVar\n\n    // We can imagine that the transfer factor could be associated with some uncertainty - lets say a 1% uncertainty due to efficiency and 2% due to acceptance.\n    // We need to make these nuisance parameters ourselves and give them a nominal value of 0\n\n\n    RooRealVar efficiency(\"efficiency\", \"efficiency nuisance parameter\",0);\n    RooRealVar acceptance(\"acceptance\", \"acceptance nuisance parameter\",0);\n\n    // We would need to make the transfer factor a function of those too. Here we've assumed Log-normal effects (i.e the same as putting lnN in the CR datacard)\n    // but note that we could use any function which could be used to parameterise the effect - eg if the systematic is due to some alternate template, we could\n    // use polynomials for example.\n\n\n    RooFormulaVar TF(\"TF\",\"Trasnfer factor\",\"2*TMath::Power(1.01,@0)*TMath::Power(1.02,@1)\",RooArgList(efficiency,acceptance) );\n\n    // Finally, we need to make each bin of the background in the control region a function of the background in the signal and the transfer factor\n    // N_CR = N_SR x TF\n\n    RooFormulaVar CRbin1(\"bkg_CR_bin1\",\"Background yield in control region, bin 1\",\"@0*@1\",RooArgList(TF,bin1));\n    RooFormulaVar CRbin2(\"bkg_CR_bin2\",\"Background yield in control region, bin 2\",\"@0*@1\",RooArgList(TF,bin2));\n    RooFormulaVar CRbin3(\"bkg_CR_bin3\",\"Background yield in control region, bin 3\",\"@0*@1\",RooArgList(TF,bin3));\n    RooFormulaVar CRbin4(\"bkg_CR_bin4\",\"Background yield in control region, bin 4\",\"@0*@1\",RooArgList(TF,bin4));\n\n    RooArgList bkg_CR_bins;\n    bkg_CR_bins.add(CRbin1);\n    bkg_CR_bins.add(CRbin2);\n    bkg_CR_bins.add(CRbin3);\n    bkg_CR_bins.add(CRbin4);\n    RooParametricHist p_CRbkg(\"bkg_CR\", \"Background PDF in control region\",met,bkg_CR_bins,data_th1);\n    RooAddition p_CRbkg_norm(\"bkg_CR_norm\",\"Total Number of events from background in control region\",bkg_CR_bins);\n    // -------------------------------------------------------------------------------------------------------------//\n\n\n    // we can also use the standard interpolation from combine by providing alternative shapes (as RooDataHists)\n    // here we're adding two of them (JES and ISR)\n    TH1F background_up(\"tbkg_CR_JESUp\",\"\",nbins,xbins);\n    background_up.SetBinContent(1,CRbin1.getVal()*1.01);\n    background_up.SetBinContent(2,CRbin2.getVal()*1.02);\n    background_up.SetBinContent(3,CRbin3.getVal()*1.03);\n    background_up.SetBinContent(4,CRbin4.getVal()*1.04);\n    RooDataHist bkg_CRhist_sysUp(\"bkg_CR_JESUp\",\"Bkg sys up\",vars,&background_up);\n    wspace.import(bkg_CRhist_sysUp);\n\n    TH1F background_down(\"bkg_CR_JESDown\",\"\",nbins,xbins);\n    background_down.SetBinContent(1,CRbin1.getVal()*0.90);\n    background_down.SetBinContent(2,CRbin2.getVal()*0.98);\n    background_down.SetBinContent(3,CRbin3.getVal()*0.97);\n    background_down.SetBinContent(4,CRbin4.getVal()*0.96);\n    RooDataHist bkg_CRhist_sysDown(\"bkg_CR_JESDown\",\"Bkg sys down\",vars,&background_down);\n    wspace.import(bkg_CRhist_sysDown);\n\n    TH1F background_2up(\"tbkg_CR_ISRUp\",\"\",nbins,xbins);\n    background_2up.SetBinContent(1,CRbin1.getVal()*0.85);\n    background_2up.SetBinContent(2,CRbin2.getVal()*0.9);\n    background_2up.SetBinContent(3,CRbin3.getVal()*0.95);\n    background_2up.SetBinContent(4,CRbin4.getVal()*0.99);\n    RooDataHist bkg_CRhist_sys2Up(\"bkg_CR_ISRUp\",\"Bkg sys 2up\",vars,&background_2up);\n    wspace.import(bkg_CRhist_sys2Up);\n\n    TH1F background_2down(\"bkg_CR_ISRDown\",\"\",nbins,xbins);\n    background_2down.SetBinContent(1,CRbin1.getVal()*1.15);\n    background_2down.SetBinContent(2,CRbin2.getVal()*1.1);\n    background_2down.SetBinContent(3,CRbin3.getVal()*1.05);\n    background_2down.SetBinContent(4,CRbin4.getVal()*1.01);\n    RooDataHist bkg_CRhist_sys2Down(\"bkg_CR_ISRDown\",\"Bkg sys 2down\",vars,&background_2down);\n    wspace.import(bkg_CRhist_sys2Down);\n\n    // import the pdfs\n    wspace.import(p_bkg);\n    wspace.import(p_bkg_norm,RooFit::RecycleConflictNodes());\n    wspace.import(p_CRbkg);\n    wspace.import(p_CRbkg_norm,RooFit::RecycleConflictNodes());\n    fOut->cd();\n    wspace.Write();\n\n    // Clean up\n    fOut->Close();\n    fOut->Delete();\n\n\n}\n

We will now discuss what the script is doing. First, the observable for the search is the missing energy, so we create a parameter to represent this observable.

   RooRealVar met(\"met\",\"E_{T}^{miss}\",xmin,xmax);\n

The following lines create a freely floating parameter for each of our bins (in this example, there are only 4 bins, defined for our observable met).

   RooRealVar bin1(\"bkg_SR_bin1\",\"Background yield in signal region, bin 1\",100,0,500);\n   RooRealVar bin2(\"bkg_SR_bin2\",\"Background yield in signal region, bin 2\",50,0,500);\n   RooRealVar bin3(\"bkg_SR_bin3\",\"Background yield in signal region, bin 3\",25,0,500);\n   RooRealVar bin4(\"bkg_SR_bin4\",\"Background yield in signal region, bin 4\",10,0,500);\n\n   RooArgList bkg_SR_bins;\n   bkg_SR_bins.add(bin1);\n   bkg_SR_bins.add(bin2);\n   bkg_SR_bins.add(bin3);\n   bkg_SR_bins.add(bin4);\n

They are put into a list so that we can create a RooParametricHist and its normalisation from that list

  RooParametricHist p_bkg(\"bkg_SR\", \"Background PDF in signal region\",met,bkg_SR_bins,data_th1);\n\n  RooAddition p_bkg_norm(\"bkg_SR_norm\",\"Total Number of events from background in signal region\",bkg_SR_bins);\n

For the control region, the background process will be dependent on the yields of the background in the signal region using a transfer factor. The transfer factor TF must account for acceptance/efficiency/etc differences between the signal region and the control regions.

In this example we will assume the control region is populated by the same process decaying to a different final state with twice as large branching fraction as the one in the signal region.

We could imagine that the transfer factor could be associated with some uncertainty - for example a 1% uncertainty due to efficiency and a 2% uncertainty due to acceptance differences. We need to make nuisance parameters ourselves to model this, and give them a nominal value of 0.

   RooRealVar efficiency(\"efficiency\", \"efficiency nuisance parameter\",0);\n   RooRealVar acceptance(\"acceptance\", \"acceptance nuisance parameter\",0);\n

We need to make the transfer factor a function of these parameters, since variations in these uncertainties will lead to variations of the transfer factor. Here we have assumed Log-normal effects (i.e the same as putting lnN in the CR datacard), but we could use any function which could be used to parameterize the effect - for example if the systematic uncertainty is due to some alternate template, we could use polynomials.

   RooFormulaVar TF(\"TF\",\"Trasnfer factor\",\"2*TMath::Power(1.01,@0)*TMath::Power(1.02,@1)\",RooArgList(efficiency,acceptance) );\n

Then, we need to make each bin of the background in the control region a function of the background in the signal region and the transfer factor - i.e $N{CR} = N{SR} \\times TF $.

   RooFormulaVar CRbin1(\"bkg_CR_bin1\",\"Background yield in control region, bin 1\",\"@0*@1\",RooArgList(TF,bin1));\n   RooFormulaVar CRbin2(\"bkg_CR_bin2\",\"Background yield in control region, bin 2\",\"@0*@1\",RooArgList(TF,bin2));\n   RooFormulaVar CRbin3(\"bkg_CR_bin3\",\"Background yield in control region, bin 3\",\"@0*@1\",RooArgList(TF,bin3));\n   RooFormulaVar CRbin4(\"bkg_CR_bin4\",\"Background yield in control region, bin 4\",\"@0*@1\",RooArgList(TF,bin4));\n

As before, we also need to create the RooParametricHist for this process in the control region but this time the bin yields will be the RooFormulaVars we just created instead of freely floating parameters.

   RooArgList bkg_CR_bins;\n   bkg_CR_bins.add(CRbin1);\n   bkg_CR_bins.add(CRbin2);\n   bkg_CR_bins.add(CRbin3);\n   bkg_CR_bins.add(CRbin4);\n\n   RooParametricHist p_CRbkg(\"bkg_CR\", \"Background PDF in control region\",met,bkg_CR_bins,data_th1);\n   RooAddition p_CRbkg_norm(\"bkg_CR_norm\",\"Total Number of events from background in control region\",bkg_CR_bins);\n

Finally, we can also create alternative shape variations (Up/Down) that can be fed to Combine as we do with TH1 or RooDataHist type workspaces. These need to be of type RooDataHist. The example below is for a Jet Energy Scale type shape uncertainty.

   TH1F background_up(\"tbkg_CR_JESUp\",\"\",nbins,xbins);\n   background_up.SetBinContent(1,CRbin1.getVal()*1.01);\n   background_up.SetBinContent(2,CRbin2.getVal()*1.02);\n   background_up.SetBinContent(3,CRbin3.getVal()*1.03);\n   background_up.SetBinContent(4,CRbin4.getVal()*1.04);\n   RooDataHist bkg_CRhist_sysUp(\"bkg_CR_JESUp\",\"Bkg sys up\",vars,&background_up);\n   wspace.import(bkg_CRhist_sysUp);\n\n   TH1F background_down(\"bkg_CR_JESDown\",\"\",nbins,xbins);\n   background_down.SetBinContent(1,CRbin1.getVal()*0.90);\n   background_down.SetBinContent(2,CRbin2.getVal()*0.98);\n   background_down.SetBinContent(3,CRbin3.getVal()*0.97);\n   background_down.SetBinContent(4,CRbin4.getVal()*0.96);\n   RooDataHist bkg_CRhist_sysDown(\"bkg_CR_JESDown\",\"Bkg sys down\",vars,&background_down);\n   wspace.import(bkg_CRhist_sysDown);\n

Below are datacards (for signal and control regions) which can be used in conjunction with the workspace built above. In order to \"use\" the control region, simply combine the two cards as usual using combineCards.py.

Show Signal Region Datacard
Signal Region Datacard -- signal category\n\nimax _ number of bins\njmax _ number of processes minus 1\nkmax \\* number of nuisance parameters\n\n---\n\nshapes data_obs signal param_ws.root wspace:data_obs_SR\nshapes background signal param_ws.root wspace:bkg_SR # the background model pdf which is freely floating, note other backgrounds can be added as usual\nshapes signal signal param_ws.root wspace:signal\n\n---\n\nbin signal\nobservation -1\n\n---\n\n# background rate must be taken from \\_norm param x 1\n\nbin signal signal\nprocess background signal\nprocess 1 0\nrate 1 -1\n\n---\n\n# Normal uncertainties in the signal region\n\n## lumi_8TeV lnN - 1.026\n\n# free floating parameters, we do not need to declare them, but its a good idea to\n\nbkg_SR_bin1 flatParam\nbkg_SR_bin2 flatParam\nbkg_SR_bin3 flatParam\nbkg_SR_bin4 flatParam\n\n
Show Control Region Datacard
\nControl Region Datacard -- control category\n\nimax _ number of bins\njmax _ number of processes minus 1\nkmax \\* number of nuisance parameters\n\n---\n\nshapes data*obs control param_ws.root wspace:data_obs_CR\nshapes background control param_ws.root wspace:bkg_CR wspace:bkg_CR*$SYSTEMATIC # the background model pdf which is dependant on that in the SR, note other backgrounds can be added as usual\n\n---\n\nbin control\nobservation -1\n\n---\n\n# background rate must be taken from \\_norm param x 1\n\nbin control\nprocess background\nprocess 1\nrate 1\n\n---\n\nJES shape 1\nISR shape 1\nefficiency param 0 1\nacceptance param 0 1\n\n

Note that for the control region, our nuisance parameters appear as param types, so that Combine will correctly constrain them.

If we combine the two cards and fit the result with -M MultiDimFit -v 3 we can see that the parameters that give the rate of background in each bin of the signal region, along with the nuisance parameters and signal strength, are determined by the fit - i.e we have properly included the constraint from the control region, just as with the 1-bin gmN.

\nacceptance = 0.00374312 +/- 0.964632 (limited)\nbkg_SR_bin1 = 99.9922 +/- 5.92062 (limited)\nbkg_SR_bin2 = 49.9951 +/- 4.13535 (limited)\nbkg_SR_bin3 = 24.9915 +/- 2.9267 (limited)\nbkg_SR_bin4 = 9.96478 +/- 2.1348 (limited)\nefficiency = 0.00109195 +/- 0.979334 (limited)\nlumi_8TeV = -0.0025911 +/- 0.994458\nr = 0.00716347 +/- 12.513 (limited)\n\n

The example given here is extremely basic and it should be noted that additional complexity in the transfer factors, as well as additional uncertainties/backgrounds etc in the cards are, as always, supported.

Danger

If trying to implement parametric uncertainties in this setup (eg on transfer factors) that are correlated with other channels and implemented separately, you MUST normalize the uncertainty effect so that the datacard line can read param name X 1. That is, the uncertainty on this parameter must be 1. Without this, there will be inconsistency with other nuisances of the same name in other channels implemented as shape or lnN.

"},{"location":"part3/nonstandard/#look-elsewhere-effect-for-one-parameter","title":"Look-elsewhere effect for one parameter","text":"

In case you see an excess somewhere in your analysis, you can evaluate the look-elsewhere effect (LEE) of that excess. For an explanation of the LEE, take a look at the CMS Statistics Committee Twiki here.

To calculate the look-elsewhere effect for a single parameter (in this case the mass of the resonance), you can follow the instructions below. Note that these instructions assume you have a workspace that is parametric in your resonance mass \\(m\\), otherwise you need to fit each background toy with separate workspaces. We will assume the local significance for your excess is \\(\\sigma\\).

  • Generate background-only toys combine ws.root -M GenerateOnly --toysFrequentist -m 16.5 -t 100 --saveToys --expectSignal=0. The output will be something like higgsCombineTest.GenerateOnly.mH16.5.123456.root.

  • For each toy, calculate the significance for a predefined range (e.g \\(m\\in [10,35]\\) GeV) in steps suitable to the resolution (e.g. 1 GeV). For toy_1 the procedure would be: for i in $(seq 10 35); do combine ws.root -M Significance --redefineSignalPOI r --freezeParameters MH --setParameter MH=$i -n $i -D higgsCombineTest.GenerateOnly.mH16.5.123456.root:toys/toy_1. Calculate the maximum significance over all of these mass points - call this \\(\\sigma_{max}\\).

  • Count how many toys have a maximum significance larger than the local one for your observed excess. This fraction of toys with \\(\\sigma_{max}>\\sigma\\) is the global p-value.

You can find more tutorials on the LEE here

"},{"location":"part3/regularisation/","title":"Unfolding & regularization","text":"

This section details how to perform an unfolded cross-section measurement, including regularization, within Combine.

There are many resources available that describe unfolding, including when to use it (or not), and what the common issues surrounding it are. For CMS users, useful summary is available in the CMS Statistics Committee pages on unfolding. You can also find an overview of unfolding and its usage in Combine in these slides.

The basic idea behind the unfolding technique is to describe smearing introduced through the reconstruction (e.g. of the particle energy) in a given truth level bin \\(x_{i}\\) through a linear relationship with the effects in the nearby truth-bins. We can make statements about the probability \\(p_{j}\\) that the event falling in the truth bin \\(x_{i}\\) is reconstructed in the bin \\(y_{i}\\) via the linear relationship,

\\[ y_{obs} = \\tilde{\\boldsymbol{R}}\\cdot x_{true} + b \\]

or, if the truth bins are expressed relative to some particular model, we use the usual signal strength terminology,

\\[ y_{obs} = \\boldsymbol{R}\\cdot \\mu + b \\]

Unfolding aims to find the distribution at truth level \\(x\\), given the observations \\(y\\) at reco-level.

"},{"location":"part3/regularisation/#likelihood-based-unfolding","title":"Likelihood-based unfolding","text":"

Since Combine has access to the full likelihood for any analysis written in the usual datacard format, we will use likelihood-based unfolding throughout - for other approaches, there are many other tools available (eg RooUnfold or TUnfold), which can be used instead.

The benefits of the likelihood-based approach are that,

  • Background subtraction is accounted for directly in the likelihood
  • Systematic uncertainties are accounted for directly during the unfolding as nuisance parameters
  • We can profile the nuisance parameters during the unfolding to make the most of the data available

In practice, one must construct the response matrix and unroll it in the reconstructed bins:

  • First, one derives the truth distribution, e.g. after the generator-level selection only, \\(x_{i}\\).
  • Each reconstructed bin (e.g. each datacard) should describe the contribution from each truth bin - this is how Combine knows about the response matrix \\(\\boldsymbol{R}\\) and folds in the acceptance/efficiency effects as usual.
  • The out-of-acceptance contributions can also be included in the above.

The model we use for this is then just the usual PhysicsModel:multiSignalModel, where each signal refers to a particular truth level bin. The results can be extracted through a simple maximum-likelihood fit with,

    text2workspace.py -m 125 --X-allow-no-background -o datacard.root datacard.txt\n       -P HiggsAnalysis.CombinedLimit.PhysicsModel:multiSignalModel --PO map='.*GenBin0.*:r_Bin0[1,-1,20]' --PO map='.*GenBin1.*:r_Bin1[1,-1,20]' --PO map='.*GenBin2.*:r_Bin2[1,-1,20]' --PO map='.*GenBin3.*:r_Bin3[1,-1,20]' --PO map='.*GenBin4.*:r_Bin4[1,-1,20]'\n\n    combine -M MultiDimFit --setParameters=r_Bin0=1,r_Bin1=1,r_Bin2=1,r_Bin3=1,r_Bin4=1 -t -1 -m 125 datacard.root\n    combine -M MultiDimFit --setParameters=r_Bin0=1,r_Bin1=1,r_Bin2=1,r_Bin3=1,r_Bin4=1 -t -1 -m 125 --algo=grid --points=100 -P r_Bin1 --setParameterRanges r_Bin1=0.5,1.5 --floatOtherPOIs=1 datacard.root\n

Notice that one can also perform the so called bin-by-bin unfolding (though it is strongly discouraged, except for testing) with,

    text2workspace.py -m 125 --X-allow-no-background -o datacard.root datacard.txt\n      -P HiggsAnalysis.CombinedLimit.PhysicsModel:multiSignalModel --PO map='.*RecoBin0.*:r_Bin0[1,-1,20]' --PO map='.*RecoBin1.*:r_Bin1[1,-1,20]' --PO map='.*RecoBin2.*:r_Bin2[1,-1,20]' --PO map='.*RecoBin3.*:r_Bin3[1,-1,20]' --PO map='.*RecoBin4.*:r_Bin4[1,-1,20]'\n

Nuisance parameters can be added to the likelihood function and profiled in the usual way via the datacards. Theory uncertainties on the inclusive cross section are typically not included in unfolded measurements.

The figure below shows a comparison of likelihood-based unfolding and a least-squares based unfolding as implemented in RooUnfold.

Show comparison

"},{"location":"part3/regularisation/#regularization","title":"Regularization","text":"

The main difference with respect to other models with multiple signal contributions is the introduction of Regularization, which is used to stabilize the unfolding process.

An example of unfolding in Combine with and without regularization, can be found under data/tutorials/regularization.

Running python createWs.py [-r] will create a simple datacard and perform a fit both with and without including regularization.

The simplest way to introduce regularization in the likelihood based approach, is to apply a penalty term, which depends on the values of the truth bins, in the likelihood function (so-called Tikhonov regularization):

\\[ -2\\ln L = -2\\ln L + P(\\vec{x}) \\]

Here, \\(P\\) is a linear operator. There are two different approaches that are supported to construct \\(P\\). If you run python makeModel.py, you will create a more complex datacard with the two regularization schemes implemented. You will need to uncomment the relevant sections of code to activate SVD or TUnfold-type regularization.

Warning

When using any unfolding method with regularization, you must perform studies of the potential bias/coverage properties introduced through the

inclusion of regularization, and how strong the associated regularization is. Advice on this can be found in the CMS Statistics Committee pages.

"},{"location":"part3/regularisation/#singular-value-decomposition-svd","title":"Singular Value Decomposition (SVD)","text":"

In the SVD approach - as described in the SVD paper - the penalty term is constructed directly based on the strengths (\\(\\vec{\\mu}=\\{\\mu_{i}\\}_{i=1}^{N}\\)),

\\[ P = \\tau\\left| A\\cdot \\vec{\\mu} \\right|^{2}, \\]

where \\(A\\) is typically the discrete curvature matrix, with

\\[ A = \\begin{bmatrix} 1 & -1 & ... \\\\ 1 & -2 & 1 & ... \\\\ ... \\end{bmatrix} \\]

Penalty terms on the derivatives can also be included. Such a penalty term is included by modifying the likelihood to include one constraint for each row of the product \\(A\\cdot\\vec{\\mu}\\), by including them as lines in the datacard of the form,

    name constr formula dependents delta\n

where the regularization strength is \\(\\delta=\\frac{1}{\\sqrt{\\tau}}\\) and can either be a fixed value (e.g. by directly putting 0.01) or as a modifiable parameter with e.g. delta[0.01].

For example, for 3 bins and a regularization strength of 0.03, the first line would be

    name constr @0-2*@2+@1 r_Bin0,r_Bin1,r_Bin2 0.03\n

Alternative valid syntaxes are

    constr1 constr r_bin0-r_bin1 0.01\n    constr1 constr r_bin0-r_bin1 delta[0.01]\n    constr1 constr r_bin0+r_bin1 r_bin0,r_bin1 0.01\n    constr1 constr r_bin0+r_bin1 {r_bin0,r_bin1} delta[0.01]\n

The figure below shows an example unfolding using the \"SVD regularization\" approach with the least squares method (as implemented by RooUnfold) and implemented as a penalty term added to the likelihood using the maximum likelihood approach in Combine.

Show comparison

"},{"location":"part3/regularisation/#tunfold-method","title":"TUnfold method","text":"

The Tikhonov regularization as implemented in TUnfold uses the MC information, or rather the density prediction, as a bias vector. In order to give this information to Combine, a single datacard for each reconstruction-level bin needs to be produced, so that we have access to the proper normalization terms during the minimization. In this case the bias vector is \\(\\vec{x}_{obs}-\\vec{x}_{true}\\)

Then one can write a constraint term in the datacard via, for example,

    constr1 constr (r_Bin0-1.)*(shapeSig_GenBin0_RecoBin0__norm+shapeSig_GenBin0_RecoBin1__norm+shapeSig_GenBin0_RecoBin2__norm+shapeSig_GenBin0_RecoBin3__norm+shapeSig_GenBin0_RecoBin4__norm)+(r_Bin2-1.)*(shapeSig_GenBin2_RecoBin0__norm+shapeSig_GenBin2_RecoBin1__norm+shapeSig_GenBin2_RecoBin2__norm+shapeSig_GenBin2_RecoBin3__norm+shapeSig_GenBin2_RecoBin4__norm)-2*(r_Bin1-1.)*(shapeSig_GenBin1_RecoBin0__norm+shapeSig_GenBin1_RecoBin1__norm+shapeSig_GenBin1_RecoBin2__norm+shapeSig_GenBin1_RecoBin3__norm+shapeSig_GenBin1_RecoBin4__norm) {r_Bin0,r_Bin1,r_Bin2,shapeSig_GenBin1_RecoBin0__norm,shapeSig_GenBin0_RecoBin0__norm,shapeSig_GenBin2_RecoBin0__norm,shapeSig_GenBin1_RecoBin1__norm,shapeSig_GenBin0_RecoBin1__norm,shapeSig_GenBin2_RecoBin1__norm,shapeSig_GenBin1_RecoBin2__norm,shapeSig_GenBin0_RecoBin2__norm,shapeSig_GenBin2_RecoBin2__norm,shapeSig_GenBin1_RecoBin3__norm,shapeSig_GenBin0_RecoBin3__norm,shapeSig_GenBin2_RecoBin3__norm,shapeSig_GenBin1_RecoBin4__norm,shapeSig_GenBin0_RecoBin4__norm,shapeSig_GenBin2_RecoBin4__norm} delta[0.03]\n
"},{"location":"part3/runningthetool/","title":"How to run the tool","text":"

The executable Combine provided by the package is used to invoke the tools via the command line. The statistical analysis method, as well as user settings, are also specified on the command line. To see the full list of available options, you can run:

combine --help\n

The option -M is used to choose the statistical evaluation method. There are several groups of statistical methods:

  • Asymptotic likelihood methods:
    • AsymptoticLimits: limits calculated according to the asymptotic formulae in arxiv:1007.1727.
    • Significance: simple profile likelihood approximation, for calculating significances.
  • Bayesian methods:
    • BayesianSimple: performing a classical numerical integration (for simple models only).
    • MarkovChainMC: performing Markov Chain integration, for arbitrarily complex models.
  • Frequentist or hybrid bayesian-frequentist methods:
    • HybridNew: compute modified frequentist limits, significance/p-values and confidence intervals according to several possible prescriptions with toys.
  • Fitting
    • FitDiagnostics: performs maximum likelihood fits to extract the signal rate, and provides diagnostic tools such as pre- and post-fit figures and correlations
    • MultiDimFit: performs maximum likelihood fits and likelihood scans with an arbitrary number of parameters of interest.
  • Miscellaneous other modules that do not compute limits or confidence intervals, but use the same framework:
    • GoodnessOfFit: perform a goodness of fit test for models including shape information. Several GoF tests are implemented.
    • ChannelConsistencyCheck: study the consistency between individual channels in a combination.
    • GenerateOnly: generate random or asimov toy datasets for use as input to other methods

The command help is organized into five parts:

  • The Main options section indicates how to pass the datacard as input to the tool (-d datacardName), how to choose the statistical method (-M MethodName), and how to set the verbosity level -v
  • Under Common statistics options, options common to different statistical methods are given. Examples are --cl, to specify the confidence level (default is 0.95), or -t, to give the number of toy MC extractions required.
  • The Common input-output options section includes, for example, the options to specify the mass hypothesis under study (-m) or to include a specific string in the output filename (--name).
  • Common miscellaneous options.
  • Further method-specific options are available for each method. By passing the method name via the -M option, along with --help, the options for that specific method are shown in addition to the common options.

Not all the available options are discussed in this online documentation; use --help to get the documentation of all options.

"},{"location":"part3/runningthetool/#common-command-line-options","title":"Common command-line options","text":"

There are a number of useful command-line options that can be used to alter the model (or parameters of the model) at run time. The most commonly used, generic options, are:

  • -H: first run a different, faster, algorithm (e.g. the ProfileLikelihood described below) to obtain an approximate indication of the limit, which will allow the precise chosen algorithm to converge more quickly. We strongly recommend to use this option when using the MarkovChainMC, HybridNew or FeldmanCousins calculators, unless you know in which range your limit lies and you set this range manually (the default is [0, 20])

  • --rMax, --rMin: manually restrict the range of signal strengths to consider. For Bayesian limits with MCMC, a rule of thumb is that rMax should be 3-5 times the limit (a too small value of rMax will bias your limit towards low values, since you are restricting the integration range, while a too large value will bias you to higher limits)

  • --setParameters name=value[,name2=value2,...] sets the starting values of the parameters, useful e.g. when generating toy MC or when setting the parameters as fixed. This option supports the use of regular expressions by replacing name with rgx{some regular expression}.

  • --setParameterRanges name=min,max[:name2=min2,max2:...] sets the ranges of the parameters (useful e.g. for scans in MultiDimFit, or for Bayesian integration). This option supports the use of regular expressions by replacing name with rgx{some regular expression}.

  • --redefineSignalPOIs name[,name2,...] redefines the set of parameters of interest.

    • If the parameters were constant in the input workspace, they are set to be floating.
    • Nuisance parameters promoted to parameters of interest are removed from the list of nuisances, and thus they are not randomized in methods that randomize nuisances (e.g. HybridNew in non-frequentist mode, or BayesianToyMC, or in toy generation with -t but without --toysFreq). This does not have any impact on algorithms that do not randomize nuisance parameters (e.g. fits, AsymptoticLimits, or HybridNew in fequentist mode) or on algorithms that treat all parameters in the same way (e.g. MarkovChainMC).
    • Note that constraint terms for the nuisances are dropped after promotion to a POI using --redefineSignalPOI. To produce a likelihood scan for a nuisance parameter, using MultiDimFit with --algo grid, you should instead use the --parameters (-P) option, which will not cause the loss of the constraint term when scanning.
    • Parameters of interest of the input workspace that are not selected by this command become unconstrained nuisance parameters, but they are not added to the list of nuisances so they will not be randomized (see above).
  • --freezeParameters name1[,name2,...] Will freeze the parameters with the given names to their set values. This option supports the use of regular expression by replacing name with rgx{some regular expression} for matching to constrained nuisance parameters or var{some regular expression} for matching to any parameter. For example --freezeParameters rgx{CMS_scale_j.*} will freeze all constrained nuisance parameters with the prefix CMS_scale_j, while --freezeParameters var{.*rate_scale} will freeze any parameter (constrained nuisance parameter or otherwise) with the suffix rate_scale.

    • Use the option --freezeParameters allConstrainedNuisances to freeze all nuisance parameters that have a constraint term (i.e not flatParams or rateParams or other freely floating parameters).
    • Similarly, the option --floatParameters name1[,name2,...] sets the parameter(s) floating and also accepts regular expressions.
    • Groups of nuisance parameters (constrained or otherwise), as defined in the datacard, can be frozen using --freezeNuisanceGroups. You can also freeze all nuisances that are not contained in a particular group using a ^ before the group name (--freezeNuisanceGroups=^group_name will freeze everything except nuisance parameters in the group \"group_name\".)
    • All constrained nuisance parameters (not flatParam or rateParam) can be set floating using --floatAllNuisances.

Warning

Note that the floating/freezing options have a priority ordering from lowest to highest as floatParameters < freezeParameters < freezeNuisanceGroups < floatAllNuisances. Options with higher priority will take precedence over those with lower priority.

  • --trackParameters name1[,name2,...] will add a branch to the output tree for each of the named parameters. This option supports the use of regular expressions by replacing name with rgx{some regular expression}

    • The name of the branch will be trackedParam_name.
    • The exact behaviour depends on the method used. For example, when using MultiDimFit with --algo scan, the value of the parameter at each point in the scan will be saved, while for FitDiagnostics, only the value at the end of the fit will be saved.
  • --trackErrors name1[,name2,...] will add a branch to the output tree for the error of each of the named parameters. This option supports the use of regular expressions by replacing name with rgx{some regular expression}

    • The name of the branch will be trackedError_name.
    • The behaviour, in terms of which values are saved, is the same as --trackParameters above.

By default, the data set used by Combine will be the one listed in the datacard. You can tell Combine to use a different data set (for example a toy data set that you generated) by using the option --dataset. The argument should be rootfile.root:workspace:location or rootfile.root:location. In order to use this option, you must first convert your datacard to a binary workspace and use this binary workspace as the input to Combine.

"},{"location":"part3/runningthetool/#generic-minimizer-options","title":"Generic Minimizer Options","text":"

Combine uses its own minimizer class, which is used to steer Minuit (via RooMinimizer), named the CascadeMinimizer. This allows for sequential minimization, which can help in case a particular setting or algorithm fails. The CascadeMinimizer also knows about extra features of Combine such as discrete nuisance parameters.

All of the fits that are performed in Combine's methods use this minimizer. This means that the fits can be tuned using these common options,

  • --cminPoiOnlyFit: First, perform a fit floating only the parameters of interest. This can be useful to find, roughly, where the global minimum is.
  • --cminPreScan: Do a scan before the first minimization.
  • --cminPreFit arg If set to a value N > 0, the minimizer will perform a pre-fit with strategy (N-1), with the nuisance parameters frozen.
    • --cminApproxPreFitTolerance arg: If non-zero, first do a pre-fit with this tolerance (or 10 times the final tolerance, whichever is largest)
    • --cminApproxPreFitStrategy arg: Strategy to use in the pre-fit. The default is strategy 0.
  • --cminDefaultMinimizerType arg: Set the default minimizer type. By default this is set to Minuit2.
  • --cminDefaultMinimizerAlgo arg: Set the default minimizer algorithm. The default algorithm is Migrad.
  • --cminDefaultMinimizerTolerance arg: Set the default minimizer tolerance, the default is 0.1.
  • --cminDefaultMinimizerStrategy arg: Set the default minimizer strategy between 0 (speed), 1 (balance - default), 2 (robustness). The Minuit documentation for this is pretty sparse but in general, 0 means evaluate the function less often, while 2 will waste function calls to get precise answers. An important note is that the Hesse algorithm (for error and correlation estimation) will be run only if the strategy is 1 or 2.
  • --cminFallbackAlgo arg: Provides a list of fallback algorithms, to be used in case the default minimizer fails. You can provide multiple options using the syntax Type[,algo],strategy[:tolerance]: eg --cminFallbackAlgo Minuit2,Simplex,0:0.1 will fall back to the simplex algorithm of Minuit2 with strategy 0 and a tolerance 0.1, while --cminFallbackAlgo Minuit2,1 will use the default algorithm (Migrad) of Minuit2 with strategy 1.
  • --cminSetZeroPoint (0/1): Set the reference of the NLL to 0 when minimizing, this can help faster convergence to the minimum if the NLL itself is large. The default is true (1), set to 0 to turn off.

The allowed combinations of minimizer types and minimizer algorithms are as follows:

Minimizer type Minimizer algorithm Minuit Migrad, Simplex, Combined, Scan Minuit2 Migrad, Simplex, Combined, Scan GSLMultiMin ConjugateFR, ConjugatePR, BFGS, BFGS2, SteepestDescent

You can find details about these in the Minuit2 documentation here.

More of these options can be found in the Cascade Minimizer options section when running --help.

"},{"location":"part3/runningthetool/#output-from-combine","title":"Output from combine","text":"

Most methods will print the results of the computation to the screen. However, in addition, Combine will also produce a root file containing a tree called limit with these results. The name of this file will be of the format,

higgsCombineTest.MethodName.mH$MASS.[word$WORD].root\n

where $WORD is any user defined keyword from the datacard which has been set to a particular value.

A few command-line options can be used to control this output:

  • The option -n allows you to specify part of the name of the root file. e.g. if you pass -n HWW the root file will be called higgsCombineHWW.... instead of higgsCombineTest
  • The option -m allows you to specify the (Higgs boson) mass hypothesis, which gets written in the filename and in the output tree. This simplifies the bookeeping, as it becomes possible to merge multiple trees corresponding to different (Higgs boson) masses using hadd. Quantities can then be plotted as a function of the mass. The default value is m=120.
  • The option -s can be used to specify the seed (eg -s 12345) used in toy generation. If this option is given, the name of the file will be extended by this seed, eg higgsCombineTest.AsymptoticLimits.mH120.12345.root
  • The option --keyword-value allows you to specify the value of a keyword in the datacard such that $WORD (in the datacard) will be given the value of VALUE in the command --keyword-value WORD=VALUE, eg higgsCombineTest.AsymptoticLimits.mH120.WORDVALUE.12345.root

The output file will contain a TDirectory named toys, which will be empty if no toys are generated (see below for details) and a TTree called limit with the following branches;

Branch name Type Description limit Double_t Main result of combine run, with method-dependent meaning limitErr Double_t Estimated uncertainty on the result mh Double_t Value of MH, specified with -m option iToy Int_t Toy number identifier if running with -t iSeed Int_t Seed specified with -s t_cpu Float_t Estimated CPU time for algorithm t_real Float_t Estimated real time for algorithm quantileExpected Float_t Quantile identifier for methods that calculated expected (quantiles) and observed results (eg conversions from \\(\\Delta\\ln L\\) values), with method-dependent meaning. Negative values are reserved for entries that do not relate to quantiles of a calculation, with the default being set to -1 (usually meaning the observed result).

The value of any user-defined keyword $WORD that is set using keyword-value described above will also be included as a branch with type string named WORD. The option can be repeated multiple times for multiple keywords.

In some cases, the precise meanings of the branches will depend on the method being used. In this case, it will be specified in this documentation.

"},{"location":"part3/runningthetool/#toy-data-generation","title":"Toy data generation","text":"

By default, each of the methods described so far will be run using the observed data as the input. In several cases (as detailed below), it is useful to run the tool using toy datasets, including Asimov data sets.

The option -t is used to tell Combine to first generate one or more toy data sets, which will be used instead of the observed data. There are two versions,

  • -t N with N > 0. Combine will generate N toy datasets from the model and re-run the method once per toy. The seed for the toy generation can be modified with the option -s (use -s -1 for a random seed). The output file will contain one entry in the tree for each of these toys.

  • -t -1 will produce an Asimov data set, in which statistical fluctuations are suppressed. The procedure for generating this Asimov data set depends on the type of analysis you are using. More details are given below.

Warning

The default values of the nuisance parameters (or any parameter) are used to generate the toy. This means that if, for example, you are using parametric shapes and the parameters inside the workspace are set to arbitrary values, those arbitrary values will be used to generate the toy. This behaviour can be modified through the use of the option --setParameters x=value_x,y=value_y..., which will set the values of the parameters (x and y) before toy generation. You can also load a snapshot from a previous fit to set the nuisance parameters to their post-fit values (see below).

The output file will contain the toys (as RooDataSets for the observables, including global observables) in the toys directory if the option --saveToys is provided. If you include this option, the limit TTree in the output will have an entry corresponding to the state of the POI used for the generation of the toy, with the value of quantileExpected set to -2.

The branches that are created by methods like MultiDimFit will not show the values used to generate the toy. If you also want the TTree to show the values of the POIs used to generate the toy, you should add additional branches using the --trackParameters option as described in the common command-line options section above. These branches will behave as expected when adding the option --saveToys.

Warning

For statistical methods that make use of toys (including HybridNew, MarkovChainMC and running with -t N), the results of repeated Combine commands will not be identical when using the datacard as the input. This is due to a feature in the tool that allows one to run concurrent commands that do not interfere with one another. In order to produce reproducible results with toy-based methods, you should first convert the datacard to a binary workspace using text2workspace.py and then use the resulting file as input to the Combine commands

"},{"location":"part3/runningthetool/#asimov-datasets","title":"Asimov datasets","text":"

If you are using either -t -1 or AsymptoticLimits, Combine will calculate results based on an Asimov data set.

  • For counting experiments, the Asimov data set will just be the total number of expected events (given the values of the nuisance parameters and POIs of the model)

  • For shape analyses with templates, the Asimov data set will be constructed as a histogram using the same binning that is defined for your analysis.

  • If your model uses parametric shapes, there are some options as to what Asimov data set to produce. By default, Combine will produce the Asimov data set as a histogram using the binning that is associated with each observable (ie as set using RooRealVar::setBins). If this binning does not exist, Combine will guess a suitable binning - it is therefore best to use RooRealVar::setBins to associate a binning with each observable, even if your data is unbinned, if you intend to use Asimov data sets.

You can also ask Combine to use a Pseudo-Asimov dataset, which is created from many weighted unbinned events.

Setting --X-rtd TMCSO_AdaptivePseudoAsimov=\\(\\beta\\) with \\(\\beta>0\\) will trigger the internal logic of whether to produce a Pseudo-Asimov dataset. This logic is as follows;

  1. For each observable in your dataset, the number of bins, \\(n_{b}\\) is determined either from the value of RooRealVar::getBins, if it exists, or assumed to be 100.

  2. If \\(N_{b}=\\prod_{b}n_{b}>5000\\), the number of expected events \\(N_{ev}\\) is determined. Note if you are combining multiple channels, \\(N_{ev}\\) refers to the number of expected events in a single channel. The logic is separate for each channel. If \\(N_{ev}/N_{b}<0.01\\) then a Pseudo-Asimov data set is created with the number of events equal to \\(\\beta \\cdot \\mathrm{max}\\{100*N_{ev},1000\\}\\). If \\(N_{ev}/N_{b}\\geq 0.01\\) , then a normal Asimov data set is produced.

  3. If \\(N_{b}\\leq 5000\\) then a normal Asimov data set will be produced

The production of a Pseudo-Asimov data set can be forced by using the option --X-rtd TMCSO_PseudoAsimov=X where X>0 will determine the number of weighted events for the Pseudo-Asimov data set. You should try different values of X, since larger values lead to more events in the Pseudo-Asimov data set, resulting in higher precision. However, in general, the fit will be slower.

You can turn off the internal logic by setting --X-rtd TMCSO_AdaptivePseudoAsimov=0 --X-rtd TMCSO_PseudoAsimov=0, thereby forcing histograms to be generated.

Info

If you set --X-rtd TMCSO_PseudoAsimov=X with X>0 and also turn on --X-rtd TMCSO_AdaptivePseudoAsimov=\\(\\beta\\), with \\(\\beta>0\\), the internal logic will be used, but this time the default will be to generate Pseudo-Asimov data sets, rather than the standard Asimov ones.

"},{"location":"part3/runningthetool/#nuisance-parameter-generation","title":"Nuisance parameter generation","text":"

The default method of handling systematics is to generate random values (around their nominal values, see above) for the nuisance parameters, according to their prior PDFs centred around their default values, before generating the data. The unconstrained nuisance parameters (eg flatParam or rateParam), or those with flat priors are not randomized before the data generation. If you wish to also randomize these parameters, you must declare them as flatParam in your datacard and, when running text2workspace, you must add the option --X-assign-flatParam-prior to the command line.

The following options define how the toys will be generated,

  • --toysNoSystematics the nuisance parameters in each toy are not randomized when generating the toy data sets - i.e their nominal values are used to generate the data. Note that for methods which profile (fit) the nuisances, the parameters are still floating when evaluating the likelihood.

  • --toysFrequentist the nuisance parameters in each toy are set to their nominal values which are obtained after first fitting to the observed data, with the POIs fixed, before generating the toy data sets. For evaluating likelihoods, the constraint terms are instead randomized within their PDFs around the post-fit nuisance parameter values.

If you are using toysFrequentist, be aware that the values set by --setParameters will be ignored for the toy generation as the post-fit values will instead be used (except for any parameter that is also a parameter of interest). You can override this behaviour and choose the nominal values for toy generation for any parameter by adding the option --bypassFrequentistFit, which will skip the initial fit to data, or by loading a snapshot (see below).

Warning

For methods such as AsymptoticLimits and HybridNew --LHCmode LHC-limits, the \"nominal\" nuisance parameter values are taken from fits to the data and are, therefore, not \"blind\" to the observed data by default (following the fully frequentist paradigm). See the detailed documentation on these methods for how to run in fully \"blinded\" mode.

"},{"location":"part3/runningthetool/#generate-only","title":"Generate only","text":"

It is also possible to generate the toys first, and then feed them to the methods in Combine. This can be done using -M GenerateOnly --saveToys. The toys can then be read and used with the other methods by specifying --toysFile=higgsCombineTest.GenerateOnly... and using the same options for the toy generation.

Warning

Some methods also use toys within the method itself (eg AsymptoticLimits and HybridNew). For these, you should not specify the toy generation with -t or the options above. Instead, you should follow the method-specific instructions.

"},{"location":"part3/runningthetool/#loading-snapshots","title":"Loading snapshots","text":"

Snapshots from workspaces can be loaded and used in order to generate toys using the option --snapshotName <name of snapshot>. This will first set the parameters to the values in the snapshot, before any other parameter options are set and toys are generated.

See the section on saving post-fit workspaces for creating workspaces with post-fit snapshots from MultiDimFit.

Here are a few examples of calculations with toys from post-fit workspaces using a workspace with \\(r, m_{H}\\) as parameters of interest

  • Throw post-fit toy with b from s+b(floating \\(r,m_{H}\\)) fit, s with r=1.0, m=best fit MH, using nuisance parameter values and constraints re-centered on s+b(floating \\(r,m_{H}\\)) fit values (aka frequentist post-fit expected) and compute post-fit expected r uncertainty profiling MH combine higgsCombinemumhfit.MultiDimFit.mH125.root --snapshotName MultiDimFit -M MultiDimFit --verbose 9 -n randomtest --toysFrequentist --bypassFrequentistFit -t -1 --expectSignal=1 -P r --floatOtherPOIs=1 --algo singles

  • Throw post-fit toy with b from s+b(floating \\(r,m_{H}\\)) fit, s with r=1.0, m=128.0, using nuisance parameter values and constraints re-centered on s+b(floating \\(r,m_{H}\\)) fit values (aka frequentist post-fit expected) and compute post-fit expected significance (with MH fixed at 128 implicitly) combine higgsCombinemumhfit.MultiDimFit.mH125.root -m 128 --snapshotName MultiDimFit -M ProfileLikelihood --significance --verbose 9 -n randomtest --toysFrequentist --bypassFrequentistFit --overrideSnapshotMass -t -1 --expectSignal=1 --redefineSignalPOIs r --freezeParameters MH

  • Throw post-fit toy with b from s+b(floating \\(r,m_{H}\\)) fit, s with r=0.0, using nuisance parameter values and constraints re-centered on s+b(floating \\(r,m_{H}\\)) fit values (aka frequentist post-fit expected) and compute post-fit expected and observed asymptotic limit (with MH fixed at 128 implicitly) combine higgsCombinemumhfit.MultiDimFit.mH125.root -m 128 --snapshotName MultiDimFit -M AsymptoticLimits --verbose 9 -n randomtest --bypassFrequentistFit --overrideSnapshotMass--redefineSignalPOIs r --freezeParameters MH

"},{"location":"part3/runningthetool/#combinetool-for-job-submission","title":"combineTool for job submission","text":"

For longer tasks that cannot be run locally, several methods in Combine can be split to run on a batch system or on the Grid. The splitting and submission is handled using the combineTool (see this getting started section to check out the tool)

"},{"location":"part3/runningthetool/#submission-to-condor","title":"Submission to Condor","text":"

The syntax for running on condor with the tool is

combineTool.py -M ALGO [options] --job-mode condor --sub-opts='CLASSADS' --task-name NAME [--dry-run]\n

with options being the usual list of Combine options. The help option -h will give a list of both Combine and combineTool options. It is possible to use this tool with several different methods from Combine.

The --sub-opts option takes a string with the different ClassAds that you want to set, separated by \\n as argument (e.g. '+JobFlavour=\"espresso\"\\nRequestCpus=1').

The --dry-run option will show what will be run without actually doing so / submitting the jobs.

For example, to generate toys (eg for use with limit setting) users running on lxplus at CERN can use the condor mode:

combineTool.py -d workspace.root -M HybridNew --LHCmode LHC-limits --clsAcc 0  -T 2000 -s -1 --singlePoint 0.2:2.0:0.05 --saveHybridResult -m 125 --job-mode condor --task-name condor-test --sub-opts='+JobFlavour=\"tomorrow\"'\n

The --singlePoint option is over-ridden, so that this will produce a script for each value of the POI in the range 0.2 to 2.0 in steps of 0.05. You can merge multiple points into a script using --merge - e.g adding --merge 10 to the above command will mean that each job contains at most 10 of the values. The scripts are labelled by the --task-name option. They will be submitted directly to condor, adding any options in --sub-opts to the condor submit script. Make sure multiple options are separated by \\n. The jobs will run and produce output in the current directory.

Below is an example for splitting points in a multi-dimensional likelihood scan.

"},{"location":"part3/runningthetool/#splitting-jobs-for-a-multi-dimensional-likelihood-scan","title":"Splitting jobs for a multi-dimensional likelihood scan","text":"

The option --split-points issues the command to split the jobs for MultiDimFit when using --algo grid. The following example will split the jobs such that there are 10 points in each of the jobs, which will be submitted to the workday queue.

combineTool.py datacard.txt -M MultiDimFit --algo grid --points 50 --rMin 0 --rMax 1 --job-mode condor --split-points 10 --sub-opts='+JobFlavour=\"workday\"' --task-name mytask -n mytask\n

Remember, any usual options (such as redefining POIs or freezing parameters) are passed to Combine and can be added to the command line for combineTool.

Info

The option -n NAME should be included to avoid overwriting output files, as the jobs will be run inside the directory from which the command is issued.

"},{"location":"part3/runningthetool/#grid-submission-with-combinetool","title":"Grid submission with combineTool","text":"

For more CPU-intensive tasks, for example determining limits for complex models using toys, it is generally not feasible to compute all the results interactively. Instead, these jobs can be submitted to the Grid.

In this example we will use the HybridNew method of Combine to determine an upper limit for a sub-channel of the Run 1 SM \\(H\\rightarrow\\tau\\tau\\) analysis. For full documentation, see the section on computing limits with toys.

With this model it would take too long to find the limit in one go, so instead we create a set of jobs in which each one throws toys and builds up the test statistic distributions for a fixed value of the signal strength. These jobs can then be submitted to a batch system or to the Grid using crab3. From the set of output distributions it is possible to extract the expected and observed limits.

For this we will use combineTool.py

First we need to build a workspace from the \\(H\\rightarrow\\tau\\tau\\) datacard,

$ text2workspace.py data/tutorials/htt/125/htt_mt.txt -m 125\n$ mv data/tutorials/htt/125/htt_mt.root ./\n

To get an idea of the range of signal strength values we will need to build test-statistic distributions for, we will first use the AsymptoticLimits method of Combine,

$ combine -M Asymptotic htt_mt.root -m 125\n << Combine >>\n[...]\n -- AsymptoticLimits (CLs) --\nObserved Limit: r < 1.7384\nExpected  2.5%: r < 0.4394\nExpected 16.0%: r < 0.5971\nExpected 50.0%: r < 0.8555\nExpected 84.0%: r < 1.2340\nExpected 97.5%: r < 1.7200\n

Based on this, a range of 0.2 to 2.0 should be suitable.

We can use the same command for generating the distribution of test statistics with combineTool. The --singlePoint option is now enhanced to support expressions that generate a set of calls to Combine with different values. The accepted syntax is of the form MIN:MAX:STEPSIZE, and multiple comma-separated expressions can be specified.

The script also adds an option --dry-run, which will not actually call comCombinebine but just prints out the commands that would be run, e.g,

combineTool.py -M HybridNew -d htt_mt.root --LHCmode LHC-limits --singlePoint 0.2:2.0:0.2 -T 2000 -s -1 --saveToys --saveHybridResult -m 125 --dry-run\n...\n[DRY-RUN]: combine -d htt_mt.root --LHCmode LHC-limits -T 2000 -s -1 --saveToys --saveHybridResult -M HybridNew -m 125 --singlePoint 0.2 -n .Test.POINT.0.2\n[DRY-RUN]: combine -d htt_mt.root --LHCmode LHC-limits -T 2000 -s -1 --saveToys --saveHybridResult -M HybridNew -m 125 --singlePoint 0.4 -n .Test.POINT.0.4\n[...]\n[DRY-RUN]: combine -d htt_mt.root --LHCmode LHC-limits -T 2000 -s -1 --saveToys --saveHybridResult -M HybridNew -m 125 --singlePoint 2.0 -n .Test.POINT.2.0\n

When the --dry-run option is removed each command will be run in sequence.

"},{"location":"part3/runningthetool/#grid-submission-with-crab3","title":"Grid submission with crab3","text":"

Submission to the grid with crab3 works in a similar way. Before doing so, ensure that the crab3 environment has been sourced in addition to the CMSSW environment. We will use the example of generating a grid of test-statistic distributions for limits.

$ cmsenv; source /cvmfs/cms.cern.ch/crab3/crab.sh\n$ combineTool.py -d htt_mt.root -M HybridNew --LHCmode LHC-limits --clsAcc 0 -T 2000 -s -1 --singlePoint 0.2:2.0:0.05 --saveToys --saveHybridResult -m 125 --job-mode crab3 --task-name grid-test --custom-crab custom_crab.py\n

The option --custom-crab should point to a python file python containing a function of the form custom_crab(config) that will be used to modify the default crab configuration. You can use this to set the output site to your local grid site, or modify other options such as the voRole, or the site blacklist/whitelist.

For example

def custom_crab(config):\n  print '>> Customising the crab config'\n  config.Site.storageSite = 'T2_CH_CERN'\n  config.Site.blacklist = ['SOME_SITE', 'SOME_OTHER_SITE']\n

Again it is possible to use the option --dry-run to see what the complete crab config will look like before actually submitting it.

Once submitted, the progress can be monitored using the standard crab commands. When all jobs are completed, copy the output from your site's storage element to the local output folder.

$ crab getoutput -d crab_grid-test\n# Now we have to un-tar the output files\n$ cd crab_grid-test/results/\n$ for f in *.tar; do tar xf $f; done\n$ mv higgsCombine*.root ../../\n$ cd ../../\n

These output files should be combined with hadd, after which we invoke Combine as usual to calculate observed and expected limits from the merged grid, as usual.

"},{"location":"part3/simplifiedlikelihood/","title":"Procedure for creating and validating simplified likelihood inputs","text":"

This page is to give a brief outline for the creation of (potentially aggregated) predictions and their covariance to facilitate external reinterpretation using the simplified likelihood (SL) approach. Instructions for validating the simplified likelihood method (detailed in the CMS note here and \"The Simplified Likelihood Framework\" paper) are also given.

"},{"location":"part3/simplifiedlikelihood/#requirements","title":"Requirements","text":"

You need an up to date version of Combine. Note You should use the latest release of Combine for the exact commands on this page. You should be using Combine tag v9.0.0 or higher or the latest version of the 112x branch to follow these instructions.

You will find the python scripts needed to convert Combine outputs into simplified likelihood inputs under test/simplifiedLikelihood

If you're using the 102x branch (not recommended), then you can obtain these scripts from here by running:

curl -s https://raw.githubusercontent.com/nucleosynthesis/work-tools/master/sparse-checkout-SL-ssh.sh > checkoutSL.sh\nbash checkoutSL.sh\nls work-tools/stats-tools\n

If you also want to validate your inputs and perform fits/scans using them, you can use the package SLtools from The Simplified Likelihood Framework paper for this.

git clone https://gitlab.cern.ch/SimplifiedLikelihood/SLtools.git\n
"},{"location":"part3/simplifiedlikelihood/#producing-covariance-for-recasting","title":"Producing covariance for recasting","text":"

Producing the necessary predictions and covariance for recasting varies depending on whether or not control regions are explicitly included in the datacard when running fits. Instructions for cases where the control regions are and are not included are detailed below.

Warning

The instructions below will calculate moments based on the assumption that \\(E[x]=\\hat{x}\\), i.e it will use the maximum likelihood estimators for the yields as the expectation values. If instead you want to use the full definition of the moments, you can run the FitDiagnostics method with the -t option and include --savePredictionsPerToy and remove the other options, which will produce a tree of the toys in the output from which moments can be calculated.

"},{"location":"part3/simplifiedlikelihood/#type-a-control-regions-included-in-datacard","title":"Type A - Control regions included in datacard","text":"

For an example datacard 'datacard.txt' including two signal channels 'Signal1' and 'Signal2', make the workspace including the masking flags

text2workspace.py --channel-masks --X-allow-no-signal --X-allow-no-background datacard.txt -o datacard.root\n

Run the fit making the covariance (output saved as fitDiagnostics.root) masking the signal channels. Note that all signal channels must be masked!

combine datacard.root -M FitDiagnostics --saveShapes --saveWithUnc --numToysForShape 2000 --setParameters mask_Signal1=1,mask_Signal2=1 --saveOverall  -N Name\n

Where \"Name\" can be specified by you.

Outputs, including predictions and covariance, will be saved in fitDiagnosticsName.root folder shapes_fit_b

"},{"location":"part3/simplifiedlikelihood/#type-b-control-regions-not-included-in-datacard","title":"Type B - Control regions not included in datacard","text":"

For an example datacard 'datacard.txt' including two signal channels 'Signal1' and 'Signal2', make the workspace

text2workspace.py --X-allow-no-signal --X-allow-no-background datacard.txt -o datacard.root\n

Run the fit making the covariance (output saved as fitDiagnosticsName.root) setting no pre-fit signal contribution. Note we must set --preFitValue 0 in this case since, we will be using the pre-fit uncertainties for the covariance calculation and we do not want to include the uncertainties on the signal.

combine datacard.root -M FitDiagnostics --saveShapes --saveWithUnc --numToysForShape 2000 --saveOverall --preFitValue 0   -n Name\n

Where \"Name\" can be specified by you.

Outputs, including predictions and covariance, will be saved in fitDiagnosticsName.root folder shapes_prefit

In order to also extract the signal yields corresponding to r=1 (in case you want to run the validation step later), you also need to produce a second file with the pre-fit value set to 1. For this you do not need to run many toys. To save time you can set --numToysForShape to a low value.

combine datacard.root -M FitDiagnostics --saveShapes --saveWithUnc --numToysForShape 1 --saveOverall --preFitValue 1   -n Name2\n

You should check that the order of the bins in the covariance matrix is as expected.

"},{"location":"part3/simplifiedlikelihood/#produce-simplified-likelihood-inputs","title":"Produce simplified likelihood inputs","text":"

Head over to the test/simplifiedLikelihoods directory inside your Combine area. The following instructions depend on whether you are aggregating or not aggregating your signal regions. Choose the instructions for your case.

"},{"location":"part3/simplifiedlikelihood/#not-aggregating","title":"Not Aggregating","text":"

Run the makeLHInputs.py script to prepare the inputs for the simplified likelihood. The filter flag can be used to select only signal regions based on the channel names. To include all channels do not include the filter flag.

The SL input must NOT include any control regions that were not masked in the fit.

If your analysis is Type B (i.e everything in the datacard is a signal region), then you can just run

python makeLHInputs.py -i fitDiagnosticsName.root -o SLinput.root \n

If necessary (i.e as in Type B analyses) you may also need to run the same on the output of the run where the pre-fit value was set to 1.

python makeLHInputs.py -i fitDiagnosticsName2.root -o SLinput2.root \n

If you instead have a Type A analysis (some of the regions are control regions that were used to fit but not masked) then you should add the option --filter SignalName where SignalName is some string that defines the signal regions in your datacards (for example, \"SR\" is a common name for these).

Note: If your signal regions cannot be easily identified by a string, follow the instructions below for aggregating, but define only one channel for each aggregate region. This will maintain the full information and will not actually aggregate any regions.

"},{"location":"part3/simplifiedlikelihood/#aggregating","title":"Aggregating","text":"

If aggregating based on covariance, edit the config file aggregateCFG.py to define aggregate regions based on channel names. Note that wildcards are supported. You can then make likelihood inputs using

python makeLHInputs.py -i fitDiagnosticsName.root -o SLinput.root --config aggregateCFG.py\n

At this point you have the inputs as ROOT files necessary to publish and run the simplified likelihood.

"},{"location":"part3/simplifiedlikelihood/#validating-the-simplified-likelihood-approach","title":"Validating the simplified likelihood approach","text":"

The simplified likelihood relies on several assumptions (detailed in the documentation at the top). To test the validity for your analysis, statistical results between Combine and the simplified likelihood can be compared.

We will use the package SLtools from the Simplified Likelihood Paper for this. The first step is to convert the ROOT files into python configs to run in the tool.

"},{"location":"part3/simplifiedlikelihood/#convert-root-to-python","title":"Convert ROOT to Python","text":"

If you followed the steps above, you have all of the histograms already necessary to generate the python configs. The script test/simplifiedLikelihoods/convertSLRootToPython.py can be used to do the conversion. Just provide the following options when running with python.

  • -O/--outname : The output python file containing the model (default is test.py)
  • -s/--signal : The signal histogram, should be of format file.root:location/to/histogram
  • -b/--background : The background histogram, should be of format file.root:location/to/histogram
  • -d/--data : The data TGraph, should be of format file.root:location/to/graph
  • -c/--covariance : The covariance TH2 histogram, should be of format file.root:location/to/histogram

For example, to get the correct output from a Type B analysis with no aggregating, you can run

python test/simplifiedLikelihoods/convertSLRootToPython.py -O mymodel.py -s SLinput.root:shapes_prefit/total_signal  -b SLinput.root:shapes_prefit/total_M2 d -d SLinput.root:shapes_prefit/total_data -c SLinput.root:shapes_prefit/total_M2\n

The output will be a python file with the right format for the SL tool. You can mix different ROOT files for these inputs. Note that the SLtools package also has some tools to covert .yaml-based inputs into the python config for you.

"},{"location":"part3/simplifiedlikelihood/#run-a-likelihood-scan-with-the-sl","title":"Run a likelihood scan with the SL","text":"

If you have checked out the SLtools, you can create a simple python script as the one below to produce a scan of the simplified likelihood from your inputs.

#! /usr/bin/env python\nimport simplike as sl\n\nexec(open(\"mymodel.py\").read())\nslp1 = sl.SLParams(background, covariance, obs=data, sig=signal)\n\nimport numpy as np\nnpoints = 50\nmus = np.arange(-0.5, 2, (2+0.5)/npoints)\ntmus1 = [slp1.tmu(mu) for mu in mus]\nfrom matplotlib import pyplot as plt\nplt.plot(mus,tmus1)\nplt.show()\n

Where the mymodel.py config is a simple python file defined as;

  • data : A python array of observed data, one entry per bin.
  • background : A python array of expected background, one entry per bin.
  • covariance : A python array of the covariance between expected backgrounds. The format is a flat array which is converted into a 2D array inside the tool
  • signal : A python array of the expected signal, one entry per bin. This should be replaced with whichever signal model you are testing.

This model.py can also just be the output of the previous section converted from the ROOT files for you.

The example below is from the note CMS-NOTE-2017-001

Show example
\nimport numpy\nimport array\n\nname = \"CMS-NOTE-2017-001 dummy model\"\nnbins = 8\ndata = array.array('d',[1964,877,354,182,82,36,15,11])\nbackground = array.array('d',[2006.4,836.4,350.,147.1,62.0,26.2,11.1,4.7])\nsignal = array.array('d',[47,29.4,21.1,14.3,9.4,7.1,4.7,4.3])\ncovariance = array.array('d', [ 18774.2, -2866.97, -5807.3, -4460.52, -2777.25, -1572.97, -846.653, -442.531, -2866.97, 496.273, 900.195, 667.591, 403.92, 222.614, 116.779, 59.5958, -5807.3, 900.195, 1799.56, 1376.77, 854.448, 482.435, 258.92, 134.975, -4460.52, 667.591, 1376.77, 1063.03, 664.527, 377.714, 203.967, 106.926, -2777.25, 403.92, 854.448, 664.527, 417.837, 238.76, 129.55, 68.2075, -1572.97, 222.614, 482.435, 377.714, 238.76, 137.151, 74.7665, 39.5247, -846.653, 116.779, 258.92, 203.967, 129.55, 74.7665, 40.9423, 21.7285, -442.531, 59.5958, 134.975, 106.926, 68.2075, 39.5247, 21.7285, 11.5732])\n"},{"location":"part3/simplifiedlikelihood/#example-using-tutorial-datacard","title":"Example using tutorial datacard","text":"

For this example, we will use the tutorial datacard data/tutorials/longexercise/datacard_part3.txt. This datacard is of Type B since there are no control regions (all regions are signal regions).

\n

First, we will create the binary file (run text2workspace)

\n
text2workspace.py --X-allow-no-signal --X-allow-no-background data/tutorials/longexercise/datacard_part3.txt  -m 200\n
\n

And next, we will generate the covariance between the bins of the background model.

\n
combine data/tutorials/longexercise/datacard_part3.root -M FitDiagnostics --saveShapes --saveWithUnc --numToysForShape 10000 --saveOverall --preFitValue 0   -n SimpleTH1 -m 200\n\ncombine data/tutorials/longexercise/datacard_part3.root -M FitDiagnostics --saveShapes --saveWithUnc --numToysForShape 1 --saveOverall --preFitValue 1   -n SimpleTH1_Signal1 -m 200\n
\n

We will also want to compare our scan to that from the full likelihood, which we can get as usual from Combine.

\n
combine -M MultiDimFit data/tutorials/longexercise/datacard_part3.root --rMin -0.5 --rMax 2 --algo grid -n SimpleTH1 -m 200\n
\n

Next, since we do not plan to aggregate any of the bins, we will follow the instructions for this and pick out the right covariance matrix.

\n
python test/simplifiedLikelihoods/makeLHInputs.py -i fitDiagnosticsSimpleTH1.root -o SLinput.root \n\npython test/simplifiedLikelihoods/makeLHInputs.py -i fitDiagnosticsSimpleTH1_Signal1.root -o SLinput_Signal1.root \n
\n

We now have everything we need to provide the simplified likelihood inputs:

\n
$ root -l SLinput.root\nroot [0] .ls\n\nAttaching file SLinput.root as _file0...\n(TFile *) 0x3667820\nroot [1] .ls\nTFile**         SLinput.root\n TFile*         SLinput.root\n  KEY: TDirectoryFile   shapes_fit_b;1  shapes_fit_b\n  KEY: TDirectoryFile   shapes_prefit;1 shapes_prefit\n  KEY: TDirectoryFile   shapes_fit_s;1  shapes_fit_s\n
\n

We can convert this to a python module that we can use to run a scan with the SLtools package. Note, since we have a Type B datacard, we will be using the pre-fit covariance matrix. Also, this means we want to take the signal from the file where the prefit value of r was 1.

\n
python test/simplifiedLikelihoods/convertSLRootToPython.py -O mymodel.py -s SLinput_Signal1.root:shapes_prefit/total_signal  -b SLinput.root:shapes_prefit/total_M1-d SLinput.root:shapes_prefit/total_data -c SLinput.root:shapes_prefit/total_M2\n
\n

We can compare the profiled likelihood scans from our simplified likelihood (using the python file we just created) and from the full likelihood (that we created with Combine.). For the former, we need to first checkout the SLtools package

\n
git clone https://gitlab.cern.ch/SimplifiedLikelihood/SLtools.git\nmv higgsCombineSimpleTH1.MultiDimFit.mH200.root SLtools/ \nmv mymodel.py SLtools/\ncd SLtools\n
\n

The script below will create a plot of the comparison for us.

\n
#! /usr/bin/env python\nimport simplike as sl\n\nexec(open(\"mymodel.py\").read())\n\nslp1 = sl.SLParams(background, covariance, obs=data, sig=signal)\n\nimport ROOT \nfi = ROOT.TFile.Open(\"higgsCombineSimpleTH1.MultiDimFit.mH200.root\")\ntr = fi.Get(\"limit\")\n\npoints = []\nfor i in range(tr.GetEntries()):\n  tr.GetEntry(i)\n  points.append([tr.r,2*tr.deltaNLL])\npoints.sort()\n\nmus2=[pt[0] for pt in points]\ntmus2=[pt[1] for pt in points]\n\nimport numpy as np\nnpoints = 50\nmus1 = np.arange(-0.5, 2, (2+0.5)/npoints)\ntmus1 = [slp1.tmu(mu) for mu in mus1]\n\nfrom matplotlib import pyplot as plt\nplt.plot(mus1,tmus1,label='simplified likelihood')\nplt.plot(mus2,tmus2,label='full likelihood')\nplt.legend()\nplt.xlabel(\"$\\mu$\")\nplt.ylabel(\"$-2\\Delta \\ln L$\")\n\nplt.savefig(\"compareLH.pdf\")\n
\n

This will produce a figure like the one below.

\n

\n

It is also possible to include the third moment of each bin to improve the precision of the simplified likelihood [ JHEP 64 2019 ]. The necessary information is stored in the outputs from Combine, therefore you just need to include the option -t SLinput.root:shapes_prefit/total_M3 in the options list for convertSLRootToPython.py to include this in the model file. The third moment information can be included in SLtools by using sl.SLParams(background, covariance, third_moment, obs=data, sig=signal)

"},{"location":"part3/validation/","title":"Validating datacards","text":"

This section covers the main features of the datacard validation tool that helps you spot potential problems with your datacards at an early stage. The tool is implemented in the CombineHarvester/CombineTools subpackage. See the combineTool section of the documentation for checkout instructions.

The datacard validation tool contains a number of checks. It is possible to call subsets of these checks when creating datacards within CombineHarvester. However, for now we will only describe the usage of the validation tool on already existing datacards. If you create your datacards with CombineHarvester and would like to include the checks at the datacard creation stage, please contact us via https://cms-talk.web.cern.ch/c/physics/cat/cat-stats/279.

"},{"location":"part3/validation/#how-to-use-the-tool","title":"How to use the tool","text":"

The basic syntax is:

ValidateDatacards.py datacard.txt\n

This will write the results of the checks to a json file (default: validation.json), and will print a summary to the screen, for example:

================================\n=======Validation results=======\n================================\n>>>There were  7800 warnings of type  'up/down templates vary the yield in the same direction'\n>>>There were  5323 warnings of type  'up/down templates are identical'\n>>>There were no warnings of type  'At least one of the up/down systematic uncertainty templates is empty'\n>>>There were  4406 warnings of type  'Uncertainty has normalisation effect of more than 10.0%'\n>>>There were  8371 warnings of type  'Uncertainty probably has no genuine shape effect'\n>>>There were no warnings of type 'Empty process'\n>>>There were no warnings of type 'Bins of the template empty in background'\n>>>INFO: there were  169  alerts of type  'Small signal process'\n

The meaning of each of these warnings/alerts is discussed below.

The following arguments are possible:

usage: ValidateDatacards.py [-h] [--printLevel PRINTLEVEL] [--readOnly]\n                            [--checkUncertOver CHECKUNCERTOVER]\n                            [--reportSigUnder REPORTSIGUNDER]\n                            [--jsonFile JSONFILE] [--mass MASS]\n                            cards\n\npositional arguments:\n  cards                 Specifies the full path to the datacards to check\n\noptional arguments:\n  -h, --help            show this help message and exit\n  --printLevel PRINTLEVEL, -p PRINTLEVEL\n                        Specify the level of info printing (0-3, default:1)\n  --readOnly            If this is enabled, skip validation and only read the\n                        output json\n  --checkUncertOver CHECKUNCERTOVER, -c CHECKUNCERTOVER\n                        Report uncertainties which have a normalization effect\n                        larger than this fraction (default:0.1)\n  --reportSigUnder REPORTSIGUNDER, -s REPORTSIGUNDER\n                        Report signals contributing less than this fraction of\n                        the total in a channel (default:0.001)\n  --jsonFile JSONFILE   Path to the json file to read/write results from\n                        (default:validation.json)\n  --mass MASS           Signal mass to use (default:*)\n

printLevel adjusts how much information is printed to the screen. When set to 0, the results are only written to the json file, but not to the screen. When set to 1 (default), the number of warnings/alerts of a given type is printed to the screen. Setting this option to 2 prints the same information as level 1, and additionally prints which uncertainties are affected (if the check is related to uncertainties) or which processes are affected (if the check is related only to processes). When printLevel is set to 3, the information from level 2 is printed, and additionaly for checks related to uncertainties it prints which processes are affected.

To print information to screen, the script parses the json file that contains the results of the validation checks. Therefore, if you have already run the validation tool and produced this json file, you can simply change the printLevel by re-running the tool with printLevel set to a different value, and enabling the --readOnly option.

The options --checkUncertOver and --reportSigUnder will be described in more detail in the section that discusses the checks for which they are relevant.

Note: the --mass argument should only be set if you normally use it when running Combine, otherwise you can leave it at the default.

The datacard validation tool is primarily intended for shape (histogram) based analyses. However, when running on a parametric model or counting experiment the checks for small signal processes, empty processes, and uncertainties with large normalization effects can still be performed.

"},{"location":"part3/validation/#details-on-checks","title":"Details on checks","text":""},{"location":"part3/validation/#uncertainties-with-large-normalization-effect","title":"Uncertainties with large normalization effect","text":"

This check highlights nuisance parameters that have a normalization effect larger than the fraction set by the option --checkUncertOver. The default value is 0.1, meaning that any uncertainties with a normalization effect larger than 10% are flagged up.

The output file contains the following information for this check:

largeNormEff: {\n  <Uncertainty name>: {\n    <analysis category>: {\n      <process>: {\n        \"value_d\":<value>\n        \"value_u\":<value>\n      } \n    }\n  }\n}\n

Where value_u and value_d are the values of the 'up' and 'down' normalization effects.

"},{"location":"part3/validation/#at-least-one-of-the-updown-systematic-templates-is-empty","title":"At least one of the Up/Down systematic templates is empty","text":"

For shape uncertainties, this check reports all cases where the up and/or down template(s) are empty, when the nominal template is not.

The output file contains the following information for this check:

emptySystematicShape: {\n  <Uncertainty name>: {\n    <analysis category>: {\n      <process>: {\n        \"value_d\":<value>\n        \"value_u\":<value>\n      } \n    }\n  }\n}\n

Where value_u and value_d are the values of the 'up' and 'down' normalization effects.

"},{"location":"part3/validation/#identical-updown-templates","title":"Identical Up/Down templates","text":"

This check applies to shape uncertainties only, and will highlight cases where the shape uncertainties have identical Up and Down templates (identical in shape and in normalization).

The information given in the output file for this check is:

uncertTemplSame: {\n  <Uncertainty name>: {\n    <analysis category>: {\n      <process>: {\n        \"value_d\":<value>\n        \"value_u\":<value>\n      } \n    }\n  }\n}\n

Where value_u and value_d are the values of the 'up' and 'down' normalization effects.

"},{"location":"part3/validation/#up-and-down-templates-vary-the-yield-in-the-same-direction","title":"Up and Down templates vary the yield in the same direction","text":"

Again, this check only applies to shape uncertainties - it highlights cases where the 'Up' template and the 'Down' template both have the effect of increasing or decreasing the normalization of a process.

The information given in the output file for this check is:

uncertVarySameDirect: {\n  <Uncertainty name>: {\n    <analysis category>: {\n      <process>: {\n        \"value_d\":<value>\n        \"value_u\":<value>\n      } \n    }\n  }\n}\n

Where value_u and value_d are the values of the 'up' and 'down' normalization effects.

"},{"location":"part3/validation/#uncertainty-probably-has-no-genuine-shape-effect","title":"Uncertainty probably has no genuine shape effect","text":"

In this check, applying only to shape uncertainties, the normalized nominal templates are compared with the normalized templates for the 'up' and 'down' systematic variations. The script calculates $$ \\Sigma_i \\frac{2|\\text{up}(i) - \\text{nominal}(i)|}{|\\text{up}(i)| + |\\text{nominal}(i)|}$$ and $$ \\Sigma_i \\frac{2|\\text{down}(i) - \\text{nominal}(i)|}{|\\text{down}(i)| + |\\text{nominal}(i)|} $$

where the sums run over all bins in the histograms, and 'nominal', 'up', and 'down' are the central template and up and down varied templates, all normalized.

If both sums are smaller than 0.001, the uncertainty is flagged up as probably not having a genuine shape effect. This means a 0.1% variation in one bin is enough to avoid being reported, but many smaller variations can also sum to be large enough to pass the threshold. It should be noted that the chosen threshold is somewhat arbitrary: if an uncertainty is flagged up as probably having no genuine shape effect you should take this as a starting point to investigate.

The information given in the output file for this check is:

smallShapeEff: {\n  <Uncertainty name>: {\n    <analysis category>: {\n      <process>: {\n        \"diff_d\":<value>\n        \"diff_u\":<value>\n      } \n    }\n  }\n}\n

Where diff_d and diff_u are the values of the sums described above for the 'down' variation and the 'up' variation.

"},{"location":"part3/validation/#empty-process","title":"Empty process","text":"

If a process is listed in the datacard, but the yield is 0, it is flagged up by this check.

The information given in the output file for this check is:

emptyProcessShape: {\n  <analysis category>: {\n    <process1>,\n    <process2>,\n    <process3>\n  }\n}\n
"},{"location":"part3/validation/#bins-that-have-signal-but-no-background","title":"Bins that have signal but no background","text":"

For shape-based analyses, this checks whether there are any bins in the nominal templates that have signal contributions, but no background contributions.

The information given in the output file for this check is:

emptyBkgBin: {\n  <analysis category>: {\n    <bin_nr1>,\n    <bin_nr2>,\n    <bin_nr3>\n  }\n}\n
"},{"location":"part3/validation/#small-signal-process","title":"Small signal process","text":"

This reports signal processes that contribute less than the fraction specified by --reportSigUnder (default 0.001 = 0.1%) of the total signal in a given category. This produces an alert, not a warning, as it does not hint at a potential problem. However, in analyses with many signal contributions and with long fitting times, it can be helpful to remove signals from a category in which they do not contribute a significant amount.

The information given in the output file for this check is:

smallSignalProc: {\n  <analysis category>: {\n    <process>: {\n      \"sigrate_tot\":<value>\n      \"procrate\":<value>\n    } \n  }\n}\n

Where sigrate_tot is the total signal yield in the analysis category and procrate is the yield of signal process <process>.

"},{"location":"part3/validation/#what-to-do-in-case-of-a-warning","title":"What to do in case of a warning","text":"

These checks are mostly a tool to help you investigate your datacards: a warning does not necessarily mean there is a mistake in your datacard, but you should use it as a starting point to investigate. Empty processes and emtpy shape uncertainties connected to nonempty processes will most likely be unintended. The same holds for cases where the 'up' and 'down' shape templates are identical. If there are bins that contain signal but no background contributions, this should be corrected. See the FAQ for more information on that point.

For other checks it depends on the situation whether there is a problem or not. Some examples:

  • An analysis-specific nonclosure uncertainty could be larger than 10%. A theoretical uncertainty in the ttbar normalization probably not.
  • In an analysis with a selection that requires the presence of exactly 1 jet, 'up' and 'down' variations in the jet energy uncertainty could both change the process normalization in the same direction. (But they do not have to!)

As always: think about whether you expect a check to yield a warning in case of your analysis, and if not, investigate to make sure there are no issues.

"},{"location":"part4/usefullinks/","title":"Useful links and further reading","text":""},{"location":"part4/usefullinks/#tutorials-and-reading-material","title":"Tutorials and reading material","text":"

There are several tutorials that have been run over the last few years with instructions and examples for running the Combine tool.

Tutorial Sessions:

  • 1st tutorial 17th Nov 2015.
  • 2nd tutorial 30th Nov 2016.
  • 3rd tutorial 29th Nov 2017
  • 4th tutorial 31st Oct 2018 - Latest for 81x-root606 branch.
  • 5th tutorial 2nd-4th Dec 2019
  • 6th tutorial 14th-16th Dec 2020 - Latest for 102x branch
  • 7th tutorial 3rd Feb 2023 - Uses 113x branch

Worked examples from Higgs analyses using Combine:

  • The CMS DAS at CERN 2014
  • The CMS DAS at DESY 2018

Higgs combinations procedures

  • Conventions to be used when preparing inputs for Higgs combinations

  • CMS AN-2011/298 Procedure for the LHC Higgs boson search combination in summer 2011. This describes in more detail some of the methods used in Combine.

"},{"location":"part4/usefullinks/#citations","title":"Citations","text":"

There is no document currently which can be cited for using the Combine tool, however, you can use the following publications for the procedures we use,

  • Summer 2011 public ATLAS-CMS note for any Frequentist limit setting procedures with toys or Bayesian limits, constructing likelihoods, descriptions of nuisance parameter options (like log-normals (lnN) or gamma (gmN), and for definitions of test-statistics.

  • CCGV paper if you use any of the asymptotic (eg with -M AsymptoticLimits or -M Significance approximations for limits/p-values.

  • If you use the Barlow-Beeston approach to MC stat (bin-by-bin) uncertainties, please cite their paper Barlow-Beeston. You should also cite this note if you use the autoMCStats directive to produce a single parameter per bin.

  • If you use shape uncertainties for template (TH1 or RooDataHist) based datacards, you can cite this note from J. Conway.

  • If you are extracting uncertainties from LH scans - i.e using \\(-2\\Delta Log{L}=1\\) etc for the 1\\(\\sigma\\) intervals, you can cite either the ATLAS+CMS or CMS Higgs paper.

  • There is also a long list of citation recommendations from the CMS Statistics Committee pages.

"},{"location":"part4/usefullinks/#combine-based-packages","title":"Combine based packages","text":"
  • SWGuideHiggs2TauLimits (Deprecated)

  • ATGCRooStats

  • CombineHarvester

"},{"location":"part4/usefullinks/#contacts","title":"Contacts","text":"
  • CMStalk forum: https://cms-talk.web.cern.ch/c/physics/cat/cat-stats/279
"},{"location":"part4/usefullinks/#cms-statistics-committee","title":"CMS Statistics Committee","text":"
  • You can find much more statistics theory and reccomendations on various statistical procedures in the CMS Statistics Committee Twiki Pages
"},{"location":"part4/usefullinks/#faq","title":"FAQ","text":"
  • Why does Combine have trouble with bins that have zero expected contents?
    • If you are computing only upper limits, and your zero-prediction bins are all empty in data, then you can just set the background to a very small value instead of zero as the computation is regular for background going to zero (e.g. a counting experiment with \\(B\\leq1\\) will have essentially the same expected limit and observed limit as one with \\(B=0\\)). If you are computing anything else, e.g. p-values, or if your zero-prediction bins are not empty in data, you're out of luck, and you should find a way to get a reasonable background prediction there (and set an uncertainty on it, as per the point above)
  • How can an uncertainty be added to a zero quantity?
    • You can put an uncertainty even on a zero event yield if you use a gamma distribution. That is in fact the more proper way of doing it if the prediction of zero comes from the limited size of your MC or data sample used to compute it.
  • Why does changing the observation in data affect my expected limit?
    • The expected limit (if using either the default behaviour of -M AsymptoticLimits or using the LHC-limits style limit setting with toys) uses the post-fit expectation of the background model to generate toys. This means that first the model is fit to the observed data before toy generation. See the sections on blind limits and toy generation to avoid this behavior.
  • How can I deal with an interference term which involves a negative contribution?
    • You will need to set up a specific PhysicsModel to deal with this, however you can see this section to implement such a model that can incorperate a negative contribution to the physics process
  • How does Combine work?
    • That is not a question that can be answered without someone's head exploding; please try to formulate something specific.
  • What does fit status XYZ mean?
    • Combine reports the fit status in some routines (for example in the FitDiagnostics method). These are typically the status of the last call from Minuit. For details on the meanings of these status codes see the Minuit2Minimizer documentation page.
  • Why does my fit not converge?
    • There are several reasons why some fits may not converge. Often some indication can be obtained from the RooFitResult or status that you will see information from when using the --verbose X (with \\(X>2\\)) option. Sometimes however, it can be that the likelihood for your data is very unusual. You can get a rough idea about what the likelihood looks like as a function of your parameters (POIs and nuisances) using combineTool.py -M FastScan -w myworkspace.root (use --help for options).
    • We have often seen that fits in Combine using RooCBShape as a parametric function will fail. This is related to an optimization that fails. You can try to fix the problem as described in this issue: issues#347 (i.e add the option --X-rtd ADDNLL_CBNLL=0).
  • Why does the fit/fits take so long?
    • The minimization routines are common to many methods in Combine. You can tune the fits using the generic optimization command line options described here. For example, setting the default minimizer strategy to 0 can greatly improve the speed, since this avoids running HESSE. In calculations such as AsymptoticLimits, HESSE is not needed and hence this can be done, however, for FitDiagnostics the uncertainties and correlations are part of the output, so using strategy 0 may not be particularly accurate.
  • Why are the results for my counting experiment so slow or unstable?
    • There is a known issue with counting experiments with large numbers of events that will cause unstable fits or even the fit to fail. You can avoid this by creating a \"fake\" shape datacard (see this section from the setting up the datacards page). The simplest way to do this is to run combineCards.py -S mycountingcard.txt > myshapecard.txt. You may still find that your parameter uncertainties are not correct when you have large numbers of events. This can be often fixed using the --robustHesse option. An example of this issue is detailed here.
  • Why do some of my nuisance parameters have uncertainties > 1?
    • When running -M FitDiagnostics you may find that the post-fit uncertainties of the nuisances are \\(> 1\\) (or larger than their pre-fit values). If this is the case, you should first check if the same is true when adding the option --minos all, which will invoke MINOS to scan the likelihood as a function of these parameters to determine the crossing at \\(-2\\times\\Delta\\log\\mathcal{L}=1\\) rather than relying on the estimate from HESSE. However, this is not guaranteed to succeed, in which case you can scan the likelihood yourself using MultiDimFit (see here ) and specifying the option --poi X where X is your nuisance parameter.
  • How can I avoid using the data?
    • For almost all methods, you can use toy data (or an Asimov dataset) in place of the real data for your results to be blind. You should be careful however as in some methods, such as -M AsymptoticLimits or -M HybridNew --LHCmode LHC-limits or any other method using the option --toysFrequentist, the data will be used to determine the most likely nuisance parameter values (to determine the so-called a-posteriori expectation). See the section on toy data generation for details on this.
  • What if my nuisance parameters have correlations which are not 0 or 1?
    • Combine is designed under the assumption that each source of nuisance parameter is uncorrelated with the other sources. If you have a case where some pair (or set) of nuisances have some known correlation structure, you can compute the eigenvectors of their correlation matrix and provide these diagonalised nuisances to Combine. You can also model partial correlations, between different channels or data taking periods, of a given nuisance parameter using the combineTool as described in this page.
  • My nuisances are (artificially) constrained and/or the impact plot show some strange behaviour, especially after including MC statistical uncertainties. What can I do?
    • Depending on the details of the analysis, several solutions can be adopted to mitigate these effects. We advise to run the validation tools at first, to identify possible redundant shape uncertainties that can be safely eliminated or replaced with lnN ones. Any remaining artificial constraints should be studies. Possible mitigating strategies can be to (a) smooth the templates or (b) adopt some rebinning in order to reduce statistical fluctuations in the templates. A description of possible strategies and effects can be found in this talk by Margaret Eminizer
  • What do CLs, CLs+b and CLb in the code mean?
    • The names CLs+b and CLb what are found within some of the RooStats tools are rather outdated and should instead be referred to as p-values - \\(p_{\\mu}\\) and \\(1-p_{b}\\), respectively. We use the CLs (which itself is not a p-value) criterion often in High energy physics as it is designed to avoid excluding a signal model when the sensitivity is low (and protects against excluding due to underfluctuations in the data). Typically, when excluding a signal model the p-value \\(p_{\\mu}\\) often refers to the p-value under the signal+background hypothesis, assuming a particular value of the signal strength (\\(\\mu\\)) while \\(p_{b}\\) is the p-value under the background only hypothesis. You can find more details and definitions of the CLs criterion and \\(p_{\\mu}\\) and \\(p_{b}\\) in section 39.4.2.4 of the 2016 PDG review.
"},{"location":"part5/longexercise/","title":"Main Features of Combine (Long Exercises)","text":"

This exercise is designed to give a broad overview of the tools available for statistical analysis in CMS using the combine tool. Combine is a high-level tool for building RooFit/RooStats models and running common statistical methods. We will cover the typical aspects of setting up an analysis and producing the results, as well as look at ways in which we can diagnose issues and get a deeper understanding of the statistical model. This is a long exercise - expect to spend some time on it especially if you are new to Combine. If you get stuck while working through this exercise or have questions specifically about the exercise, you can ask them on this mattermost channel. Finally, we also provide some solutions to some of the questions that are asked as part of the exercise. These are available here.

For the majority of this course we will work with a simplified version of a real analysis, that nonetheless will have many features of the full analysis. The analysis is a search for an additional heavy neutral Higgs boson decaying to tau lepton pairs. Such a signature is predicted in many extensions of the standard model, in particular the minimal supersymmetric standard model (MSSM). You can read about the analysis in the paper here. The statistical inference makes use of a variable called the total transverse mass (\\(M_{\\mathrm{T}}^{\\mathrm{tot}}\\)) that provides good discrimination between the resonant high-mass signal and the main backgrounds, which have a falling distribution in this high-mass region. The events selected in the analysis are split into a several categories which target the main di-tau final states as well as the two main production modes: gluon-fusion (ggH) and b-jet associated production (bbH). One example is given below for the fully-hadronic final state in the b-tag category which targets the bbH signal:

Initially we will start with the simplest analysis possible: a one-bin counting experiment using just the high \\(M_{\\mathrm{T}}^{\\mathrm{tot}}\\) region of this distribution, and from there each section of this exercise will expand on this, introducing a shape-based analysis and adding control regions to constrain the backgrounds.

"},{"location":"part5/longexercise/#background","title":"Background","text":"

You can find a presentation with some more background on likelihoods and extracting confidence intervals here. A presentation that discusses limit setting in more detail can be found here. If you are not yet familiar with these concepts, or would like to refresh your memory, we recommend that you have a look at these presentations before you start with the exercise.

"},{"location":"part5/longexercise/#getting-started","title":"Getting started","text":"

We need to set up a new CMSSW area and checkout the Combine package:

cmsrel CMSSW_11_3_4\ncd CMSSW_11_3_4/src\ncmsenv\ngit clone https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit.git HiggsAnalysis/CombinedLimit\ncd HiggsAnalysis/CombinedLimit\n\ncd $CMSSW_BASE/src/HiggsAnalysis/CombinedLimit\ngit fetch origin\ngit checkout v9.0.0\n

We will also make use another package, CombineHarvester, which contains some high-level tools for working with Combine. The following command will download the repository and checkout just the parts of it we need for this tutorial:

bash <(curl -s https://raw.githubusercontent.com/cms-analysis/CombineHarvester/main/CombineTools/scripts/sparse-checkout-https.sh)\n

Now make sure the CMSSW area is compiled:

scramv1 b clean; scramv1 b\n

Now we will move to the working directory for this tutorial, which contains all the inputs needed to run the exercises below:

cd $CMSSW_BASE/src/HiggsAnalysis/CombinedLimit/data/tutorials/longexercise/\n
"},{"location":"part5/longexercise/#part-1-a-one-bin-counting-experiment","title":"Part 1: A one-bin counting experiment","text":"

Topics covered in this section:

  • A: Computing limits using the asymptotic approximation
  • Advanced section: B: Computing limits with toys

We will begin with a simplified version of a datacard from the MSSM \\(\\phi\\rightarrow\\tau\\tau\\) analysis that has been converted to a one-bin counting experiment, as described above. While the full analysis considers a range of signal mass hypotheses, we will start by considering just one: \\(m_{\\phi}\\)=800GeV. Click the text below to study the datacard (datacard_part1.txt in the longexercise directory):

Show datacard
imax    1 number of bins\njmax    4 number of processes minus 1\nkmax    * number of nuisance parameters\n--------------------------------------------------------------------------------\n--------------------------------------------------------------------------------\nbin          signal_region\nobservation  10.0\n--------------------------------------------------------------------------------\nbin                      signal_region   signal_region   signal_region   signal_region   signal_region\nprocess                  ttbar           diboson         Ztautau         jetFakes        bbHtautau\nprocess                  1               2               3               4               0\nrate                     4.43803         3.18309         3.7804          1.63396         0.711064\n--------------------------------------------------------------------------------\nCMS_eff_b          lnN   1.02            1.02            1.02            -               1.02\nCMS_eff_t          lnN   1.12            1.12            1.12            -               1.12\nCMS_eff_t_highpt   lnN   1.1             1.1             1.1             -               1.1\nacceptance_Ztautau lnN   -               -               1.08            -               -\nacceptance_bbH     lnN   -               -               -               -               1.05\nacceptance_ttbar   lnN   1.005           -               -               -               -\nnorm_jetFakes      lnN   -               -               -               1.2             -\nxsec_diboson       lnN   -               1.05            -               -               -\n

The layout of the datacard is as follows:

  • At the top are the numbers imax, jmax and kmax representing the number of bins, processes and nuisance parameters respectively. Here a \"bin\" can refer to a literal single event count as in this example, or a full distribution we are fitting, in general with many histogram bins, as we will see later. We will refer to both as \"channels\" from now on. It is possible to replace these numbers with * and they will be deduced automatically.
  • The first line starting with bin gives a unique label to each channel, and the following line starting with observation gives the number of events observed in data.
  • In the remaining part of the card there are several columns: each one represents one process in one channel. The first four lines labelled bin, process, process and rate give the channel label, the process label, a process identifier (<=0 for signal, >0 for background) and the number of expected events respectively.
  • The remaining lines describe sources of systematic uncertainty. Each line gives the name of the uncertainty, (which will become the name of the nuisance parameter inside our RooFit model), the type of uncertainty (\"lnN\" = log-normal normalisation uncertainty) and the effect on each process in each channel. E.g. a 20% uncertainty on the yield is written as 1.20.
  • It is also possible to add a hash symbol (#) at the start of a line, which Combine will then ignore when it reads the card.

We can now run Combine directly using this datacard as input. The general format for running Combine is:

combine -M [method] [datacard] [additional options...]\n
"},{"location":"part5/longexercise/#a-computing-limits-using-the-asymptotic-approximation","title":"A: Computing limits using the asymptotic approximation","text":"

As we are searching for a signal process that does not exist in the standard model, it's natural to set an upper limit on the cross section times branching fraction of the process (assuming our dataset does not contain a significant discovery of new physics). Combine has dedicated method for calculating upper limits. The most commonly used one is AsymptoticLimits, which implements the CLs criterion and uses the profile likelihood ratio as the test statistic. As the name implies, the test statistic distributions are determined analytically in the asymptotic approximation, so there is no need for more time-intensive toy throwing and fitting. Try running the following command:

combine -M AsymptoticLimits datacard_part1.txt -n .part1A\n

You should see the results of the observed and expected limit calculations printed to the screen. Here we have added an extra option, -n .part1A, which is short for --name, and is used to label the output file Combine produces, which in this case will be called higgsCombine.part1A.AsymptoticLimits.mH120.root. The file name depends on the options we ran with, and is of the form: higgsCombine[name].[method].mH[mass].root. The file contains a TTree called limit which stores the numerical values returned by the limit computation. Note that in our case we did not set a signal mass when running Combine (i.e. -m 800), so the output file just uses the default value of 120. This does not affect our result in any way though, just the label that is used on the output file.

The limits are given on a parameter called r. This is the default parameter of interest (POI) that is added to the model automatically. It is a linear scaling of the normalization of all signal processes given in the datacard, i.e. if \\(s_{i,j}\\) is the nominal number of signal events in channel \\(i\\) for signal process \\(j\\), then the normalization of that signal in the model is given as \\(r\\cdot s_{i,j}(\\vec{\\theta})\\), where \\(\\vec{\\theta}\\) represents the set of nuisance parameters which may also affect the signal normalization. We therefore have some choice in the interpretation of r: for the measurement of a process with a well-defined SM prediction we may enter this as the nominal yield in the datacard, such that \\(r=1\\) corresponds to this SM expectation, whereas for setting limits on BSM processes we may choose the nominal yield to correspond to some cross section, e.g. 1 pb, such that we can interpret the limit as a cross section limit directly. In this example the signal has been normalised to a cross section times branching fraction of 1 fb.

The expected limit is given under the background-only hypothesis. The median value under this hypothesis as well as the quantiles needed to give the 68% and 95% intervals are also calculated. These are all the ingredients needed to produce the standard limit plots you will see in many CMS results, for example the \\(\\sigma \\times \\mathcal{B}\\) limits for the \\(\\text{bb}\\phi\\rightarrow\\tau\\tau\\) process:

In this case we only computed the values for one signal mass hypothesis, indicated by a red dashed line.

Tasks and questions:

  • There are some important uncertainties missing from the datacard above. Add the uncertainty on the luminosity (name: lumi_13TeV) which has a 2.5% effect on all processes (except the jetFakes, which are taken from data), and uncertainties on the inclusive cross sections of the Ztautau and ttbar processes (with names xsec_Ztautau and xsec_ttbar) which are 4% and 6% respectively.
  • Try changing the values of some uncertainties (up or down, or removing them altogether) - how do the expected and observed limits change?
  • Now try changing the number of observed events. The observed limit will naturally change, but the expected does too - why might this be?

There are other command line options we can supply to Combine which will change its behaviour when run. You can see the full set of supported options by doing combine -h. Many options are specific to a given method, but others are more general and are applicable to all methods. Throughout this tutorial we will highlight some of the most useful options you may need to use, for example:

  • The range on the signal strength modifier: --rMin=X and --rMax=Y: In RooFit parameters can optionally have a range specified. The implication of this is that their values cannot be adjusted beyond the limits of this range. The min and max values can be adjusted though, and we might need to do this for our POI r if the order of magnitude of our measurement is different from the default range of [0, 20]. This will be discussed again later in the tutorial.
  • Verbosity: -v X: By default combine does not usually produce much output on the screen other the main result at the end. However, much more detailed information can be printed by setting the -v N with N larger than zero. For example at -v 3 the logs from the minimizer, Minuit, will also be printed. These are very useful for debugging problems with the fit.
"},{"location":"part5/longexercise/#advanced-section-b-computing-limits-with-toys","title":"Advanced section: B: Computing limits with toys","text":"

Now we will look at computing limits without the asymptotic approximation, so instead using toy datasets to determine the test statistic distributions under the signal+background and background-only hypotheses. This can be necessary if we are searching for signal in bins with a small number of events expected. In Combine we will use the HybridNew method to calculate limits using toys. This mode is capable of calculating limits with several different test statistics and with fine-grained control over how the toy datasets are generated internally. To calculate LHC-style profile likelihood limits (i.e. the same as we did with the asymptotic) we set the option --LHCmode LHC-limits. You can read more about the different options in the Combine documentation.

Run the following command:

combine -M HybridNew datacard_part1.txt --LHCmode LHC-limits -n .part1B --saveHybridResult\n

In contrast to AsymptoticLimits this will only determine the observed limit, and will take a few minutes. There will not be much output to the screen while combine is running. You can add the option -v 1 to get a better idea of what is going on. You should see Combine stepping around in r, trying to find the value for which CLs = 0.05, i.e. the 95% CL limit. The --saveHybridResult option will cause the test statistic distributions that are generated at each tested value of r to be saved in the output ROOT file.

To get an expected limit add the option --expectedFromGrid X, where X is the desired quantile, e.g. for the median:

combine -M HybridNew datacard_part1.txt --LHCmode LHC-limits -n .part1B --saveHybridResult --expectedFromGrid 0.500\n

Calculate the median expected limit and the 68% range. The 95% range could also be done, but note it will take much longer to run the 0.025 quantile. While Combine is running you can move on to the next steps below.

Tasks and questions: - In contrast to AsymptoticLimits, with HybridNew each limit comes with an uncertainty. What is the origin of this uncertainty? - How good is the agreement between the asymptotic and toy-based methods? - Why does it take longer to calculate the lower expected quantiles (e.g. 0.025, 0.16)? Think about how the statistical uncertainty on the CLs value depends on Pmu and Pb.

Next plot the test statistic distributions stored in the output file:

python3 $CMSSW_BASE/src/HiggsAnalysis/CombinedLimit/test/plotTestStatCLs.py --input higgsCombine.part1B.HybridNew.mH120.root --poi r --val all --mass 120\n

This produces a new ROOT file cls_qmu_distributions.root containing the plots, to save them as pdf/png files run this small script and look at the resulting figures:

python3 printTestStatPlots.py cls_qmu_distributions.root\n
"},{"location":"part5/longexercise/#advanced-section-b-asymptotic-approximation-limitations","title":"Advanced section: B: Asymptotic approximation limitations","text":"

These distributions can be useful in understanding features in the CLs limits, especially in the low statistics regime. To explore this, try reducing the observed and expected yields in the datacard by a factor of 10, and rerun the above steps to compare the observed and expected limits with the asymptotic approach, and plot the test statistic distributions.

Tasks and questions:

  • Is the asymptotic limit still a good approximation?
  • You might notice that the test statistic distributions are not smooth but rather have several \"bump\" structures? Where might this come from? Try reducing the size of the systematic uncertainties to make them more pronounced.

Note that for more complex models the fitting time can increase significantly, making it infeasible to run all the toy-based limits interactively like this. An alternative strategy is documented here

"},{"location":"part5/longexercise/#part-2-a-shape-based-analysis","title":"Part 2: A shape-based analysis","text":"

Topics covered in this section:

  • A: Setting up the datacard
  • B: Running Combine for a blind analysis
  • C: Using FitDiagnostics
  • D: MC statistical uncertainties
"},{"location":"part5/longexercise/#a-setting-up-the-datacard","title":"A: Setting up the datacard","text":"

Now we move to the next step: instead of a one-bin counting experiment we will fit a binned distribution. In a typical analysis we will produce TH1 histograms of some variable sensitive to the presence of signal: one for the data and one for each signal and background processes. Then we add a few extra lines to the datacard to link the declared processes to these shapes which are saved in a ROOT file, for example:

Show datacard
imax 1\njmax 1\nkmax *\n---------------\nshapes * * simple-shapes-TH1_input.root $PROCESS $PROCESS_$SYSTEMATIC\nshapes signal * simple-shapes-TH1_input.root $PROCESS$MASS $PROCESS$MASS_$SYSTEMATIC\n---------------\nbin bin1\nobservation 85\n------------------------------\nbin             bin1       bin1\nprocess         signal     background\nprocess         0          1\nrate            10         100\n--------------------------------\nlumi     lnN    1.10       1.0\nbgnorm   lnN    1.00       1.3\nalpha  shape    -          1\n

Note that as with the one-bin card, the total nominal rate of a given process must be specified in the rate line of the datacard. This should agree with the value returned by TH1::Integral. However, we can also put a value of -1 and the Integral value will be substituted automatically.

There are two other differences with respect to the one-bin card:

  • A new block of lines at the top defining how channels and processes are mapped to the histograms (more than one line can be used)
  • In the list of systematic uncertainties some are marked as shape instead of lnN

The syntax of the \"shapes\" line is: shapes [process] [channel] [file] [histogram] [histogram_with_systematics]. It is possible to use the * wildcard to map multiple processes and/or channels with one line. The histogram entries can contain the $PROCESS, $CHANNEL and $MASS place-holders which will be substituted when searching for a given (process, channel) combination. The value of $MASS is specified by the -m argument when combine. By default the observed data process name will be data_obs.

Shape uncertainties can be added by supplying two additional histograms for a process, corresponding to the distribution obtained by shifting that parameter up and down by one standard deviation. These shapes will be interpolated (see the template shape uncertainties section for details) for shifts within \\(\\pm1\\sigma\\) and linearly extrapolated beyond. The normalizations are interpolated linearly in log scale just like we do for log-normal uncertainties.

The final argument of the \"shapes\" line above should contain the $SYSTEMATIC place-holder which will be substituted by the systematic name given in the datacard.

In the list of uncertainties the interpretation of the values for shape lines is a bit different from lnN. The effect can be \"-\" or 0 for no effect, 1 for normal effect, and possibly something different from 1 to test larger or smaller effects (in that case, the unit Gaussian is scaled by that factor before using it as parameter for the interpolation).

In this section we will use a datacard corresponding to the full distribution that was shown at the start of section 1, not just the high mass region. Have a look at datacard_part2.txt: this is still currently a one-bin counting experiment, however the yields are much higher since we now consider the full range of \\(M_{\\mathrm{T}}^{\\mathrm{tot}}\\). If you run the asymptotic limit calculation on this you should find the sensitivity is significantly worse than before.

The first task is to convert this to a shape analysis: the file datacard_part2.shapes.root contains all the necessary histograms, including those for the relevant shape systematic uncertainties. Add the relevant shapes lines to the top of the datacard (after the kmax line) to map the processes to the correct TH1s in this file. Hint: you will need a different line for the signal process.

Compared to the counting experiment we must also consider the effect of uncertainties that change the shape of the distribution. Some, like CMS_eff_t_highpt, were present before, as it has both a shape and normalisation effect. Others are primarily shape effects so were not included before.

Add the following shape uncertainties: top_pt_ttbar_shape affecting ttbar,the tau energy scale uncertainties CMS_scale_t_1prong0pi0_13TeV, CMS_scale_t_1prong1pi0_13TeV and CMS_scale_t_3prong0pi0_13TeV affecting all processes except jetFakes, and CMS_eff_t_highpt also affecting the same processes.

Once this is done you can run the asymptotic limit calculation on this datacard. From now on we will convert the text datacard into a RooFit workspace ourselves instead of combine doing it internally every time we run. This is a good idea for more complex analyses since the conversion step can take a notable amount of time. For this we use the text2workspace.py command:

text2workspace.py datacard_part2.txt -m 800 -o workspace_part2.root\n

And then we can use this as input to combine instead of the text datacard:

combine -M AsymptoticLimits workspace_part2.root -m 800\n

Tasks and questions:

  • Verify that the sensitivity of the shape analysis is indeed improved over the counting analysis in the first part.
  • Advanced task: You can open the workspace ROOT file interactively and print the contents: w->Print();. Each process is represented by a PDF object that depends on the shape morphing nuisance parameters. From the workspace, choose a process and shape uncertainty, and make a plot overlaying the nominal shape with different values of the shape morphing nuisance parameter. You can change the value of a parameter with w->var(\"X\")->setVal(Y), and access a particular pdf with w->pdf(\"Z\"). PDF objects in RooFit have a createHistogram method that requires the name of the observable (the variable defining the x-axis) - this is called CMS_th1x in combine datacards. Feel free to ask for help with this!
"},{"location":"part5/longexercise/#b-running-combine-for-a-blind-analysis","title":"B: Running combine for a blind analysis","text":"

Most analyses are developed and optimised while we are \"blind\" to the region of data where we expect our signal to be. With AsymptoticLimits we can choose just to run the expected limit (--run expected), so as not to calculate the observed. However the data is still used, even for the expected, since in the frequentist approach a background-only fit to the data is performed to define the Asimov dataset used to calculate the expected limits. To skip this fit to data and use the pre-fit state of the model the option --run blind or --noFitAsimov can be used. Task: Compare the expected limits calculated with --run expected and --run blind. Why are they different?

A more general way of blinding is to use combine's toy and Asimov dataset generating functionality. You can read more about this here. These options can be used with any method in combine, not just AsymptoticLimits.

Task: Calculate a blind limit by generating a background-only Asimov with the -t -1 option instead of using the AsymptoticLimits specific options. You should find the observed limit is the same as the expected. Then see what happens if you inject a signal into the Asimov dataset using the --expectSignal [X] option.

"},{"location":"part5/longexercise/#c-using-fitdiagnostics","title":"C: Using FitDiagnostics","text":"

We will now explore one of the most commonly used modes of Combine: FitDiagnostics . As well as allowing us to make a measurement of some physical quantity (as opposed to just setting a limit on it), this method is useful to gain additional information about the model and the behaviour of the fit. It performs two fits:

  • A \"background-only\" (b-only) fit: first POI (usually \"r\") fixed to zero
  • A \"signal+background\" (s+b) fit: all POIs are floating

With the s+b fit Combine will report the best-fit value of our signal strength modifier r. As well as the usual output file, a file named fitDiagnosticsTest.root is produced which contains additional information. In particular it includes two RooFitResult objects, one for the b-only and one for the s+b fit, which store the fitted values of all the nuisance parameters (NPs) and POIs as well as estimates of their uncertainties. The covariance matrix from both fits is also included, from which we can learn about the correlations between parameters. Run the FitDiagnostics method on our workspace:

combine -M FitDiagnostics workspace_part2.root -m 800 --rMin -20 --rMax 20\n

Open the resulting fitDiagnosticsTest.root interactively and print the contents of the s+b RooFitResult:

root [1] fit_s->Print()\n
Show output
RooFitResult: minimized FCN value: -2.55338e-05, estimated distance to minimum: 7.54243e-06\n                covariance matrix quality: Full, accurate covariance matrix\n                Status : MINIMIZE=0 HESSE=0\n\n    Floating Parameter    FinalValue +/-  Error\n  --------------------  --------------------------\n             CMS_eff_b   -4.5380e-02 +/-  9.93e-01\n             CMS_eff_t   -2.6311e-01 +/-  7.33e-01\n      CMS_eff_t_highpt   -4.7146e-01 +/-  9.62e-01\n  CMS_scale_t_1prong0pi0_13TeV   -1.5989e-01 +/-  5.93e-01\n  CMS_scale_t_1prong1pi0_13TeV   -1.6426e-01 +/-  4.94e-01\n  CMS_scale_t_3prong0pi0_13TeV   -3.0698e-01 +/-  6.06e-01\n    acceptance_Ztautau   -3.1262e-01 +/-  8.62e-01\n        acceptance_bbH   -2.8676e-05 +/-  1.00e+00\n      acceptance_ttbar    4.9981e-03 +/-  1.00e+00\n            lumi_13TeV   -5.6366e-02 +/-  9.89e-01\n         norm_jetFakes   -9.3327e-02 +/-  2.56e-01\n                     r   -2.7220e+00 +/-  2.59e+00\n    top_pt_ttbar_shape    1.7586e-01 +/-  7.00e-01\n          xsec_Ztautau   -1.6007e-01 +/-  9.66e-01\n          xsec_diboson    3.9758e-02 +/-  1.00e+00\n            xsec_ttbar    5.7794e-02 +/-  9.46e-01\n

There are several useful pieces of information here. At the top the status codes from the fits that were performed is given. In this case we can see that two algorithms were run: MINIMIZE and HESSE, both of which returned a successful status code (0). Both of these are routines in the Minuit2 minimization package - the default minimizer used in RooFit. The first performs the main fit to the data, and the second calculates the covariance matrix at the best-fit point. It is important to always check this second step was successful and the message \"Full, accurate covariance matrix\" is printed, otherwise the parameter uncertainties can be very inaccurate, even if the fit itself was successful.

Underneath this the best-fit values (\\(\\theta\\)) and symmetrised uncertainties for all the floating parameters are given. For all the constrained nuisance parameters a convention is used by which the nominal value (\\(\\theta_I\\)) is zero, corresponding to the mean of a Gaussian constraint PDF with width 1.0, such that the parameter values \\(\\pm 1.0\\) correspond to the \\(\\pm 1\\sigma\\) input uncertainties.

A more useful way of looking at this is to compare the pre- and post-fit values of the parameters, to see how much the fit to data has shifted and constrained these parameters with respect to the input uncertainty. The script diffNuisances.py can be used for this:

python diffNuisances.py fitDiagnosticsTest.root --all\n
Show output
name                                              b-only fit            s+b fit         rho\nCMS_eff_b                                        -0.04, 0.99        -0.05, 0.99       +0.01\nCMS_eff_t                                     * -0.24, 0.73*     * -0.26, 0.73*       +0.06\nCMS_eff_t_highpt                              * -0.56, 0.94*     * -0.47, 0.96*       +0.02\nCMS_scale_t_1prong0pi0_13TeV                  * -0.17, 0.58*     * -0.16, 0.59*       -0.04\nCMS_scale_t_1prong1pi0_13TeV                  ! -0.12, 0.45!     ! -0.16, 0.49!       +0.20\nCMS_scale_t_3prong0pi0_13TeV                  * -0.31, 0.61*     * -0.31, 0.61*       +0.02\nacceptance_Ztautau                            * -0.31, 0.86*     * -0.31, 0.86*       -0.05\nacceptance_bbH                                   +0.00, 1.00        -0.00, 1.00       +0.05\nacceptance_ttbar                                 +0.01, 1.00        +0.00, 1.00       +0.00\nlumi_13TeV                                       -0.05, 0.99        -0.06, 0.99       +0.01\nnorm_jetFakes                                 ! -0.09, 0.26!     ! -0.09, 0.26!       -0.05\ntop_pt_ttbar_shape                            * +0.24, 0.69*     * +0.18, 0.70*       +0.22\nxsec_Ztautau                                     -0.16, 0.97        -0.16, 0.97       -0.02\nxsec_diboson                                     +0.03, 1.00        +0.04, 1.00       -0.02\nxsec_ttbar                                       +0.08, 0.95        +0.06, 0.95       +0.02\n

The numbers in each column are respectively \\(\\frac{\\theta-\\theta_I}{\\sigma_I}\\) (This is often called the pull, but note that this is a misnomer. In this tutorial we will refer to it as the fitted value of the nuisance parameter relative to the input uncertainty. The true pull is defined as discussed under diffPullAsym here ), where \\(\\sigma_I\\) is the input uncertainty; and the ratio of the post-fit to the pre-fit uncertainty \\(\\frac{\\sigma}{\\sigma_I}\\).

Tasks and questions:

  • Which parameter has the largest shift from the nominal value (0) in the fitted value of the nuisance parameter relative to the input uncertainty? Which has the tightest constraint?
  • Should we be concerned when a parameter is more strongly constrained than the input uncertainty (i.e. \\(\\frac{\\sigma}{\\sigma_I}<1.0\\))?
  • Check the fitted values of the nuisance parameters and constraints on a b-only and s+b asimov dataset instead. This check is required for all analyses in the Higgs PAG. It serves both as a closure test (do we fit exactly what signal strength we input?) and a way to check whether there are any infeasibly strong constraints while the analysis is still blind (typical example: something has probably gone wrong if we constrain the luminosity uncertainty to 10% of the input!)
  • Advanced task: Sometimes there are problems in the fit model that aren't apparent from only fitting the Asimov dataset, but will appear when fitting randomised data. Follow the exercise on toy-by-toy diagnostics here to explore the tools available for this.
"},{"location":"part5/longexercise/#d-mc-statistical-uncertainties","title":"D: MC statistical uncertainties","text":"

So far there is an important source of uncertainty we have neglected. Our estimates of the backgrounds come either from MC simulation or from sideband regions in data, and in both cases these estimates are subject to a statistical uncertainty on the number of simulated or data events. In principle we should include an independent statistical uncertainty for every bin of every process in our model. It's important to note that Combine/RooFit does not take this into account automatically - statistical fluctuations of the data are implicitly accounted for in the likelihood formalism, but statistical uncertainties in the model must be specified by us.

One way to implement these uncertainties is to create a shape uncertainty for each bin of each process, in which the up and down histograms have the contents of the bin shifted up and down by the \\(1\\sigma\\) uncertainty. However this makes the likelihood evaluation computationally inefficient, and can lead to a large number of nuisance parameters in more complex models. Instead we will use a feature in Combine called autoMCStats that creates these automatically from the datacard, and uses a technique called \"Barlow-Beeston-lite\" to reduce the number of systematic uncertainties that are created. This works on the assumption that for high MC event counts we can model the uncertainty with a Gaussian distribution. Given the uncertainties in different bins are independent, the total uncertainty of several processes in a particular bin is just the sum of \\(N\\) individual Gaussians, which is itself a Gaussian distribution. So instead of \\(N\\) nuisance parameters we need only one. This breaks down when the number of events is small and we are not in the Gaussian regime. The autoMCStats tool has a threshold setting on the number of events below which the the Barlow-Beeston-lite approach is not used, and instead a Poisson PDF is used to model per-process uncertainties in that bin.

After reading the full documentation on autoMCStats here, add the corresponding line to your datacard. Start by setting a threshold of 0, i.e. [channel] autoMCStats 0, to force the use of Barlow-Beeston-lite in all bins.

Tasks and questions:

  • Check how much the cross section measurement and uncertainties change using FitDiagnostics.
  • It is also useful to check how the expected uncertainty changes using an Asimov dataset, say with r=10 injected.
  • Advanced task: See what happens if the Poisson threshold is increased. Based on your results, what threshold would you recommend for this analysis?
"},{"location":"part5/longexercise/#part-3-adding-control-regions","title":"Part 3: Adding control regions","text":"

Topics covered in this section:

  • A: Use of rateParams
  • B: Nuisance parameter impacts
  • C: Post-fit distributions
  • D: Calculating the significance
  • E: Signal strength measurement and uncertainty breakdown
  • F: Use of channel masking

In a modern analysis it is typical for some or all of the backgrounds to be estimated using the data, instead of relying purely on MC simulation. This can take many forms, but a common approach is to use \"control regions\" (CRs) that are pure and/or have higher statistics for a given process. These are defined by event selections that are similar to, but non-overlapping with, the signal region. In our \\(\\phi\\rightarrow\\tau\\tau\\) example the \\(\\text{Z}\\rightarrow\\tau\\tau\\) background normalisation can be calibrated using a \\(\\text{Z}\\rightarrow\\mu\\mu\\) CR, and the \\(\\text{t}\\bar{\\text{t}}\\) background using an \\(e+\\mu\\) CR. By comparing the number of data events in these CRs to our MC expectation we can obtain scale factors to apply to the corresponding backgrounds in the signal region (SR). The idea is that the data will gives us a more accurate prediction of the background with less systematic uncertainties. For example, we can remove the cross section and acceptance uncertainties in the SR, since we are no longer using the MC prediction (with a caveat discussed below). While we could simply derive these correction factors and apply them to our signal region datacard and better way is to include these regions in our fit model and tie the normalisations of the backgrounds in the CR and SR together. This has a number of advantages:

  • Automatically handles the statistical uncertainty due to the number of data events in the CR
  • Allows for the presence of some signal contamination in the CR to be handled correctly
  • The CRs are typically not 100% pure in the background they're meant to control - other backgrounds may be present, with their own systematic uncertainties, some of which may be correlated with the SR or other CRs. Propagating these effects through to the SR \"by hand\" can become very challenging.

In this section we will continue to use the same SR as in the previous one, however we will switch to a lower signal mass hypothesis, \\(m_{\\phi}=200\\)GeV, as its sensitivity depends more strongly on the background prediction than the high mass signal, so is better for illustrating the use of CRs. Here the nominal signal (r=1) has been normalised to a cross section of 1 pb.

The SR datacard for the 200 GeV signal is datacard_part3.txt. Two further datacards are provided: datacard_part3_ttbar_cr.txt and datacard_part3_DY_cr.txt which represent the CRs for the Drell-Yan and \\(\\text{t}\\bar{\\text{t}}\\) processes as described above. The cross section and acceptance uncertainties for these processes have pre-emptively been removed from the SR card. However we cannot get away with neglecting acceptance effects altogether. We are still implicitly using the MC simulation to predict to the ratio of events in the CR and SR, and this ratio will in general carry a theoretical acceptance uncertainty. If the CRs are well chosen then this uncertainty should be smaller than the direct acceptance uncertainty in the SR however. The uncertainties acceptance_ttbar_cr and acceptance_DY_cr have been added to these datacards cover this effect. Task: Calculate the ratio of CR to SR events for these two processes, as well as their CR purity to verify that these are useful CRs.

The next step is to combine these datacards into one, which is done with the combineCards.py script:

combineCards.py signal_region=datacard_part3.txt ttbar_cr=datacard_part3_ttbar_cr.txt DY_cr=datacard_part3_DY_cr.txt &> part3_combined.txt\n

Each argument is of the form [new channel name]=[datacard.txt]. The new datacard is written to the screen by default, so we redirect the output into our new datacard file. The output looks like:

Show datacard
imax 3 number of bins\njmax 8 number of processes minus 1\nkmax 15 number of nuisance parameters\n----------------------------------------------------------------------------------------------------------------------------------\nshapes *              DY_cr          datacard_part3_DY_cr.shapes.root DY_control_region/$PROCESS DY_control_region/$PROCESS_$SYSTEMATIC\nshapes *              signal_region  datacard_part3.shapes.root signal_region/$PROCESS signal_region/$PROCESS_$SYSTEMATIC\nshapes bbHtautau      signal_region  datacard_part3.shapes.root signal_region/bbHtautau$MASS signal_region/bbHtautau$MASS_$SYSTEMATIC\nshapes *              ttbar_cr       datacard_part3_ttbar_cr.shapes.root tt_control_region/$PROCESS tt_control_region/$PROCESS_$SYSTEMATIC\n----------------------------------------------------------------------------------------------------------------------------------\nbin          signal_region  ttbar_cr       DY_cr        \nobservation  3416           79251          365754       \n----------------------------------------------------------------------------------------------------------------------------------\nbin                                               signal_region  signal_region  signal_region  signal_region  signal_region  ttbar_cr       ttbar_cr       ttbar_cr       ttbar_cr       ttbar_cr       DY_cr          DY_cr          DY_cr          DY_cr          DY_cr          DY_cr        \nprocess                                           bbHtautau      ttbar          diboson        Ztautau        jetFakes       W              QCD            ttbar          VV             Ztautau        W              QCD            Zmumu          ttbar          VV             Ztautau      \nprocess                                           0              1              2              3              4              5              6              1              7              3              5              6              8              1              7              3            \nrate                                              198.521        683.017        96.5185        742.649        2048.94        597.336        308.965        67280.4        10589.6        150.025        59.9999        141.725        305423         34341.1        5273.43        115.34       \n----------------------------------------------------------------------------------------------------------------------------------\nCMS_eff_b               lnN                       1.02           1.02           1.02           1.02           -              -              -              -              -              -              -              -              -              -              -              -            \nCMS_eff_e               lnN                       -              -              -              -              -              1.02           -              -              1.02           1.02           -              -              -              -              -              -            \n...\n

The [new channel name]= part of the input arguments is not required, but it gives us control over how the channels in the combined card will be named, otherwise default values like ch1, ch2 etc will be used.

"},{"location":"part5/longexercise/#a-use-of-rateparams","title":"A: Use of rateParams","text":"

We now have a combined datacard that we can run text2workspace.py on and start doing fits, however there is still one important ingredient missing. Right now the yields of the Ztautau process in the SR and Zmumu in the CR are not connected to each other in any way, and similarly for the ttbar processes. In the fit both would be adjusted by the nuisance parameters only, and constrained to the nominal yields. To remedy this we introduce rateParam directives to the datacard. A rateParam is a new free parameter that multiples the yield of a given process, just in the same way the signal strength r multiplies the signal yield. The syntax of a rateParam line in the datacard is

[name] rateParam [channel] [process] [init] [min,max]\n

where name is the chosen name for the parameter, channel and process specify which (channel, process) combination it should affect, init gives the initial value, and optionally [min,max] specifies the ranges on the RooRealVar that will be created. The channel and process arguments support the use of the wildcard * to match multiple entries. Task: Add two rateParams with nominal values of 1.0 to the end of the combined datacard named rate_ttbar and rate_Zll. The former should affect the ttbar process in all channels, and the latter should affect the Ztautau and Zmumu processes in all channels. Set ranges of [0,5] to both. Note that a rateParam name can be repeated to apply it to multiple processes, e.g.:

rateScale rateParam * procA 1.0\nrateScale rateParam * procB 1.0\n

is perfectly valid and only one rateParam will be created. These parameters will allow the yields to float in the fit without prior constraint (unlike a regular lnN or shape systematic), with the yields in the CRs and SR tied together.

Tasks and questions:

  • Run text2workspace.py on this combined card (don't forget to set the mass and output name -m 200 -o workspace_part3.root) and then use FitDiagnostics on an Asimov dataset with r=1 to get the expected uncertainty. Suggested command line options: --rMin 0 --rMax 2
  • Using the RooFitResult in the fitDiagnosticsTest.root file, check the post-fit value of the rateParams. To what level are the normalisations of the DY and ttbar processes constrained?
  • To compare to the previous approach of fitting the SR only, with cross section and acceptance uncertainties restored, an additional card is provided: datacard_part3_nocrs.txt. Run the same fit on this card to verify the improvement of the SR+CR approach
"},{"location":"part5/longexercise/#b-nuisance-parameter-impacts","title":"B: Nuisance parameter impacts","text":"

It is often useful to examine in detail the effects the systematic uncertainties have on the signal strength measurement. This is often referred to as calculating the \"impact\" of each uncertainty. What this means is to determine the shift in the signal strength, with respect to the best-fit, that is induced if a given nuisance parameter is shifted by its \\(\\pm1\\sigma\\) post-fit uncertainty values. If the signal strength shifts a lot, it tells us that it has a strong dependency on this systematic uncertainty. In fact, what we are measuring here is strongly related to the correlation coefficient between the signal strength and the nuisance parameter. The MultiDimFit method has an algorithm for calculating the impact for a given systematic: --algo impact -P [parameter name], but it is typical to use a higher-level script, combineTool.py (part of the CombineHarvester package you checked out at the beginning) to automatically run the impacts for all parameters. Full documentation on this is given here. There is a three step process for running this. First we perform an initial fit for the signal strength and its uncertainty:

combineTool.py -M Impacts -d workspace_part3.root -m 200 --rMin -1 --rMax 2 --robustFit 1 --doInitialFit\n

Then we run the impacts for all the nuisance parameters:

combineTool.py -M Impacts -d workspace_part3.root -m 200 --rMin -1 --rMax 2 --robustFit 1 --doFits\n

This will take a little bit of time. When finished we collect all the output and convert it to a json file:

combineTool.py -M Impacts -d workspace_part3.root -m 200 --rMin -1 --rMax 2 --robustFit 1 --output impacts.json\n

We can then make a plot showing the fitted values of the nuisance parameters, relative to the input uncertainty, and parameter impacts, sorted by the largest impact:

plotImpacts.py -i impacts.json -o impacts\n

Tasks and questions:

  • Identify the most important uncertainties using the impacts tool.
  • In the plot, some parameters do not show a fitted value of the nuisance parameter relative to the input uncertainty, but rather just a numerical value - why?
"},{"location":"part5/longexercise/#c-post-fit-distributions","title":"C: Post-fit distributions","text":"

Another thing the FitDiagnostics mode can help us with is visualising the distributions we are fitting, and the uncertainties on those distributions, both before the fit is performed (\"pre-fit\") and after (\"post-fit\"). The pre-fit can give us some idea of how well our uncertainties cover any data-MC discrepancy, and the post-fit if discrepancies remain after the fit to data (as well as possibly letting us see the presence of a significant signal!).

To produce these distributions add the --saveShapes and --saveWithUncertainties options when running FitDiagnostics:

combine -M FitDiagnostics workspace_part3.root -m 200 --rMin -1 --rMax 2 --saveShapes --saveWithUncertainties -n .part3B\n

Combine will produce pre- and post-fit distributions (for fit_s and fit_b) in the fitDiagnosticsTest.root output file:

Tasks and questions:

  • Make a plot showing the expected background and signal contributions using the output from FitDiagnostics - do this for both the pre-fit and post-fit. You will find a script postFitPlot.py in the longexercise directory that can help you get started. The bin errors on the TH1s in the fitDiagnostics file are determined from the systematic uncertainties. In the post-fit these take into account the additional constraints on the nuisance parameters as well as any correlations.

  • Why is the uncertainty on the post-fit so much smaller than on the pre-fit?

"},{"location":"part5/longexercise/#d-calculating-the-significance","title":"D: Calculating the significance","text":"

In the event that you observe a deviation from your null hypothesis, in this case the b-only hypothesis, Combine can be used to calculate the p-value or significance. To do this using the asymptotic approximation simply do:

combine -M Significance workspace_part3.root -m 200 --rMin -1 --rMax 2\n

To calculate the expected significance for a given signal strength we can just generate an Asimov dataset first:

combine -M Significance workspace_part3.root -m 200 --rMin -1 --rMax 5 -t -1 --expectSignal 1.5\n

Note that the Asimov dataset generated this way uses the nominal values of all model parameters to define the dataset. Another option is to add --toysFrequentist, which causes a fit to the data to be performed first (with r frozen to the --expectSignal value) and then any subsequent Asimov datasets or toys are generated using the post-fit values of the model parameters. In general this will result in a different value for the expected significance due to changes in the background normalisation and shape induced by the fit to data:

combine -M Significance workspace_part3.root -m 200 --rMin -1 --rMax 5 -t -1 --expectSignal 1.5 --toysFrequentist\n

Tasks and questions:

  • Note how much the expected significance changes with the --toysFrequentist option. Does the change make sense given the difference in the post-fit and pre-fit distributions you looked at in the previous section?
  • Advanced task It is also possible to calculate the significance using toys with HybridNew (details here) if we are in a situation where the asymptotic approximation is not reliable or if we just want to verify the result. Why might this be challenging for a high significance, say larger than \\(5\\sigma\\)?
"},{"location":"part5/longexercise/#e-signal-strength-measurement-and-uncertainty-breakdown","title":"E: Signal strength measurement and uncertainty breakdown","text":"

We have seen that with FitDiagnostics we can make a measurement of the best-fit signal strength and uncertainty. In the asymptotic approximation we find an interval at the \\(\\alpha\\) CL around the best fit by identifying the parameter values at which our test statistic \\(q=\u22122\\Delta \\ln L\\) equals a critical value. This value is the \\(\\alpha\\) quantile of the \\(\\chi^2\\) distribution with one degree of freedom. In the expression for q we calculate the difference in the profile likelihood between some fixed point and the best-fit.

Depending on what we want to do with the measurement, e.g. whether it will be published in a journal, we may want to choose a more precise method for finding these intervals. There are a number of ways that parameter uncertainties are estimated in combine, and some are more precise than others:

  • Covariance matrix: calculated by the Minuit HESSE routine, this gives a symmetric uncertainty by definition and is only accurate when the profile likelihood for this parameter is symmetric and parabolic.
  • Minos error: calculated by the Minuit MINOS route - performs a search for the upper and lower values of the parameter that give the critical value of \\(q\\) for the desired CL. Return an asymmetric interval. This is what FitDiagnostics does by default, but only for the parameter of interest. Usually accurate but prone to fail on more complex models and not easy to control the tolerance for terminating the search.
  • RobustFit error: a custom implementation in combine similar to Minos that returns an asymmetric interval, but with more control over the precision. Enabled by adding --robustFit 1 when running FitDiagnostics.
  • Explicit scan of the profile likelihood on a chosen grid of parameter values. Interpolation between points to find parameter values corresponding to appropriate d. It is a good idea to use this for important measurements since we can see by eye that there are no unexpected features in the shape of the likelihood curve.

In this section we will look at the last approach, using the MultiDimFit mode of combine. By default this mode just performs a single fit to the data:

combine -M MultiDimFit workspace_part3.root -n .part3E -m 200 --rMin -1 --rMax 2\n

You should see the best-fit value of the signal strength reported and nothing else. By adding the --algo X option combine will run an additional algorithm after this best fit. Here we will use --algo grid, which performs a scan of the likelihood with r fixed to a set of different values. The set of points will be equally spaced between the --rMin and --rMax values, and the number of points is controlled with --points N:

combine -M MultiDimFit workspace_part3.root -n .part3E -m 200 --rMin -1 --rMax 2 --algo grid --points 30\n

The results of the scan are written into the output file, if opened interactively should see:

Show output
root [1] limit->Scan(\"r:deltaNLL\")\n************************************\n*    Row   *         r *  deltaNLL *\n************************************\n*        0 * 0.5399457 *         0 *\n*        1 * -0.949999 * 5.6350698 *\n*        2 * -0.850000 * 4.9482779 *\n*        3 *     -0.75 * 4.2942519 *\n*        4 * -0.649999 * 3.6765284 *\n*        5 * -0.550000 * 3.0985388 *\n*        6 * -0.449999 * 2.5635135 *\n*        7 * -0.349999 * 2.0743820 *\n*        8 *     -0.25 * 1.6337506 *\n*        9 * -0.150000 * 1.2438088 *\n*       10 * -0.050000 * 0.9059833 *\n*       11 * 0.0500000 * 0.6215767 *\n*       12 * 0.1500000 * 0.3910581 *\n*       13 *      0.25 * 0.2144184 *\n*       14 * 0.3499999 * 0.0911308 *\n*       15 * 0.4499999 * 0.0201983 *\n*       16 * 0.5500000 * 0.0002447 *\n*       17 * 0.6499999 * 0.0294311 *\n*       18 *      0.75 * 0.1058298 *\n*       19 * 0.8500000 * 0.2272539 *\n*       20 * 0.9499999 * 0.3912534 *\n*       21 * 1.0499999 * 0.5952836 *\n*       22 * 1.1499999 * 0.8371513 *\n*       23 *      1.25 * 1.1142146 *\n*       24 * 1.3500000 * 1.4240909 *\n*       25 * 1.4500000 * 1.7644306 *\n*       26 * 1.5499999 * 2.1329684 *\n*       27 * 1.6499999 * 2.5273966 *\n*       28 *      1.75 * 2.9458723 *\n*       29 * 1.8500000 * 3.3863399 *\n*       30 * 1.9500000 * 3.8469560 *\n************************************\n

To turn this into a plot run:

python plot1DScan.py higgsCombine.part3E.MultiDimFit.mH200.root -o single_scan\n

This script will also perform a spline interpolation of the points to give accurate values for the uncertainties.

In the next step we will split this total uncertainty into two components. It is typical to separate the contribution from statistics and systematics, and sometimes even split the systematic part into different components. This gives us an idea of which aspects of the uncertainty dominate. The statistical component is usually defined as the uncertainty we would have if all the systematic uncertainties went to zero. We can emulate this effect by freezing all the nuisance parameters when we do the scan in r, such that they do not vary in the fit. This is achieved by adding the --freezeParameters allConstrainedNuisances option. It would also work if the parameters are specified explicitly, e.g. --freezeParameters CMS_eff_t,lumi_13TeV,..., but the allConstrainedNuisances option is more concise. Run the scan again with the systematics frozen, and use the plotting script to overlay this curve with the previous one:

combine -M MultiDimFit workspace_part3.root -n .part3E.freezeAll -m 200 --rMin -1 --rMax 2 --algo grid --points 30 --freezeParameters allConstrainedNuisances\npython plot1DScan.py higgsCombine.part3E.MultiDimFit.mH200.root --others 'higgsCombine.part3E.freezeAll.MultiDimFit.mH200.root:FreezeAll:2' -o freeze_first_attempt\n

This doesn't look quite right - the best-fit has been shifted because unfortunately the --freezeParameters option acts before the initial fit, whereas we only want to add it for the scan after this fit. To remedy this we can use a feature of Combine that lets us save a \"snapshot\" of the best-fit parameter values, and reuse this snapshot in subsequent fits. First we perform a single fit, adding the --saveWorkspace option:

combine -M MultiDimFit workspace_part3.root -n .part3E.snapshot -m 200 --rMin -1 --rMax 2 --saveWorkspace\n

The output file will now contain a copy of our workspace from the input, and this copy will contain a snapshot of the best-fit parameter values. We can now run the frozen scan again, but instead using this copy of the workspace as input, and restoring the snapshot that was saved:

combine -M MultiDimFit higgsCombine.part3E.snapshot.MultiDimFit.mH200.root -n .part3E.freezeAll -m 200 --rMin -1 --rMax 2 --algo grid --points 30 --freezeParameters allConstrainedNuisances --snapshotName MultiDimFit\npython plot1DScan.py higgsCombine.part3E.MultiDimFit.mH200.root --others 'higgsCombine.part3E.freezeAll.MultiDimFit.mH200.root:FreezeAll:2' -o freeze_second_attempt --breakdown Syst,Stat\n

Now the plot should look correct:

We added the --breakdown Syst,Stat option to the plotting script to make it calculate the systematic component, which is defined simply as \\(\\sigma_{\\text{syst}} = \\sqrt{\\sigma^2_{\\text{tot}} - \\sigma^2_{\\text{stat}}}\\). To split the systematic uncertainty into different components we just need to run another scan with a subset of the systematics frozen. For example, say we want to split this into experimental and theoretical uncertainties, we would calculate the uncertainties as:

\\(\\sigma_{\\text{theory}} = \\sqrt{\\sigma^2_{\\text{tot}} - \\sigma^2_{\\text{fr.theory}}}\\)

\\(\\sigma_{\\text{expt}} = \\sqrt{\\sigma^2_{\\text{fr.theory}} - \\sigma^2_{\\text{fr.theory+expt}}}\\)

\\(\\sigma_{\\text{stat}} = \\sigma_{\\text{fr.theory+expt}}\\)

where fr.=freeze.

While it is perfectly fine to just list the relevant nuisance parameters in the --freezeParameters argument for the \\(\\sigma_{\\text{fr.theory}}\\) scan, a convenient way can be to define a named group of parameters in the text datacard and then freeze all parameters in this group with --freezeNuisanceGroups. The syntax for defining a group is:

[group name] group = uncertainty_1 uncertainty_2 ... uncertainty_N\n

** Tasks and questions: **

  • Take our stat+syst split one step further and separate the systematic part into two: one part for hadronic tau uncertainties and one for all others.
  • Do this by defining a tauID group in the datacard including the following parameters: CMS_eff_t, CMS_eff_t_highpt, and the three CMS_scale_t_X uncertainties.
  • To plot this and calculate the split via the relations above you can just add further arguments to the --others option in the plot1DScan.py script. Each is of the form: '[file]:[label]:[color]'. The --breakdown argument should also be extended to three terms.
  • How important are these tau-related uncertainties compared to the others?
"},{"location":"part5/longexercise/#f-use-of-channel-masking","title":"F: Use of channel masking","text":"

We will now return briefly to the topic of blinding. We've seen that we can compute expected results by performing any Combine method on an Asimov dataset generated using -t -1. This is useful, because we can optimise our analysis without introducing any accidental bias that might come from looking at the data in the signal region. However our control regions have been chosen specifically to be signal-free, and it would be useful to use the data here to set the normalisation of our backgrounds even while the signal region remains blinded. Unfortunately there's no easy way to generate a partial Asimov dataset just for the signal region, but instead we can use a feature called \"channel masking\" to remove specific channels from the likelihood evaluation. One useful application of this feature is to make post-fit plots of the signal region from a control-region-only fit.

To use the masking we first need to rerun text2workspace.py with an extra option that will create variables named like mask_[channel] in the workspace:

text2workspace.py part3_combined.txt -m 200 -o workspace_part3_with_masks.root --channel-masks\n

These parameters have a default value of 0 which means the channel is not masked. By setting it to 1 the channel is masked from the likelihood evaluation. Task: Run the same FitDiagnostics command as before to save the post-fit shapes, but add an option --setParameters mask_signal_region=1. Note that the s+b fit will probably fail in this case, since we are no longer fitting a channel that contains signal, however the b-only fit should work fine. Task: Compare the expected background distribution and uncertainty to the pre-fit, and to the background distribution from the full fit you made before.

"},{"location":"part5/longexercise/#part-4-physics-models","title":"Part 4: Physics models","text":"

Topics covered in this section:

  • A: Writing a simple physics model
  • B: Performing and plotting 2D likelihood scans

With Combine we are not limited to parametrising the signal with a single scaling parameter r. In fact we can define any arbitrary scaling using whatever functions and parameters we would like. For example, when measuring the couplings of the Higgs boson to the different SM particles we would introduce a POI for each coupling parameter, for example \\(\\kappa_{\\text{W}}\\), \\(\\kappa_{\\text{Z}}\\), \\(\\kappa_{\\tau}\\) etc. We would then generate scaling terms for each \\(i\\rightarrow \\text{H}\\rightarrow j\\) process in terms of how the cross section (\\(\\sigma_i(\\kappa)\\)) and branching ratio (\\(\\frac{\\Gamma_i(\\kappa)}{\\Gamma_{\\text{tot}}(\\kappa)}\\)) scale relative to the SM prediction.

This parametrisation of the signal (and possibly backgrounds too) is specified in a physics model. This is a python class that is used by text2workspace.py to construct the model in terms of RooFit objects. There is documentation on using phyiscs models here.

"},{"location":"part5/longexercise/#a-writing-a-simple-physics-model","title":"A: Writing a simple physics model","text":"

An example physics model that just implements a single parameter r is given in DASModel.py:

Show DASModel.py
from HiggsAnalysis.CombinedLimit.PhysicsModel import PhysicsModel\n\n\nclass DASModel(PhysicsModel):\n    def doParametersOfInterest(self):\n        \"\"\"Create POI and other parameters, and define the POI set.\"\"\"\n        self.modelBuilder.doVar(\"r[0,0,10]\")\n        self.modelBuilder.doSet(\"POI\", \",\".join([\"r\"]))\n\n    def getYieldScale(self, bin, process):\n        \"Return the name of a RooAbsReal to scale this yield by or the two special values 1 and 0 (don't scale, and set to zero)\"\n        if self.DC.isSignal[process]:\n            print(\"Scaling %s/%s by r\" % (bin, process))\n            return \"r\"\n        return 1\n\n\ndasModel = DASModel()\n

In this we override two methods of the basic PhysicsModel class: doParametersOfInterest and getYieldScale. In the first we define our POI variables, using the doVar function which accepts the RooWorkspace factory syntax for creating variables, and then define all our POIs in a set via the doSet function. The second function will be called for every process in every channel (bin), and using the corresponding strings we have to specify how that process should be scaled. Here we check if the process was declared as signal in the datacard, and if so scale it by r, otherwise if it is a background no scaling is applied (1). To use the physics model with text2workspace.py first copy it to the python directory in the Combine package:

cp DASModel.py $CMSSW_BASE/src/HiggsAnalysis/CombinedLimit/python/\n

In this section we will use the full datacards from the MSSM analysis. Have a look in part4/200/combined.txt. You will notice that there are now two signal processes declared: ggH and bbH. In the MSSM these cross sections can vary independently depending on the exact parameters of the model, so it is useful to be able to measure them independently too. First run text2workspace.py as follows, adding the -P option to specify the physics model, then verify the result of the fit:

text2workspace.py part4/200/combined.txt -P HiggsAnalysis.CombinedLimit.DASModel:dasModel -m 200 -o workspace_part4.root\ncombine -M MultiDimFit workspace_part4.root -n .part4A -m 200 --rMin 0 --rMax 2\n

Tasks and questions:

  • Modify the physics model to scale the ggH and bbH processes by r_ggH and r_bbH separately.
  • Then rerun the MultiDimFit command - you should see the result for both signal strengths printed.
"},{"location":"part5/longexercise/#b-performing-and-plotting-2d-likelihood-scans","title":"B: Performing and plotting 2D likelihood scans","text":"

For a model with two POIs it is often useful to look at the how well we are able to measure both simultaneously. A natural extension of determining 1D confidence intervals on a single parameter like we did in part 3D is to determine confidence level regions in 2D. To do this we also use combine in a similar way, with -M MultiDimFit --algo grid. When two POIs are found, Combine will scan a 2D grid of points instead of a 1D array.

Tasks and questions:

  • Run a 2D likelihood scan in r_ggH and r_bbH. You can start with around 100 points but may need to increase this later too see more detail in the resulting plot.
  • Have a look at the output limit tree, it should have branches for each POI as well as the usual deltaNLL value. You can use TTree::Draw to plot a 2D histogram of deltaNLL with r_ggH and r_bbH on the axes.
"},{"location":"part5/longexerciseanswers/","title":"Answers to tasks and questions","text":""},{"location":"part5/longexerciseanswers/#part-1-a-one-bin-counting-experiment","title":"Part 1: A one-bin counting experiment","text":""},{"location":"part5/longexerciseanswers/#a-computing-limits-using-the-asymptotic-approximation","title":"A: Computing limits using the asymptotic approximation","text":"

Tasks and questions:

  • There are some important uncertainties missing from the datacard above. Add the uncertainty on the luminosity (name: lumi_13TeV) which has a 2.5% effect on all processes (except the jetFakes, which are taken from data), and uncertainties on the inclusive cross sections of the Ztautau and ttbar processes (with names xsec_Ztautau and xsec_diboson) which are 4% and 6% respectively.
  • Try changing the values of some uncertainties (up or down, or removing them altogether) - how do the expected and observed limits change?
Show answer Larger uncertainties make the limits worse (ie, higher values of the limit); smaller uncertainties improve the limit (lower values of the limit).
  • Now try changing the number of observed events. The observed limit will naturally change, but the expected does too - why might this be?
Show answer This is because the expected limit relies on a background-only Asimov dataset that is created after a background-only fit to the data. By changing the observed the pulls on the NPs in this fit also change, and therefore so does the expected sensitivity."},{"location":"part5/longexerciseanswers/#advanced-section-b-computing-limits-with-toys","title":"Advanced section: B: Computing limits with toys","text":"

Tasks and questions:

  • In contrast to AsymptoticLimits, with HybridNew each limit comes with an uncertainty. What is the origin of this uncertainty?
Show answer The uncertainty is caused by the limited number of toys: the values of Pmu and Pb come from counting the number of toys in the tails of the test statistic distributions. The number of toys used can be adjusted with the option --toysH
  • How good is the agreement between the asymptotic and toy-based methods?
Show answer The agreement should be pretty good in this example, but will generally break down once we get to the level of 0-5 events.
  • Why does it take longer to calculate the lower expected quantiles (e.g. 0.025, 0.16)? Think about how the statistical uncertainty on the CLs value depends on Pmu and Pb.
Show answer For this we need the definition of CLs = Pmu / (1-Pb). The 0.025 expected quantile is by definition where Pb = 0.025, so for a 95% CL limit we have CLs = 0.05, implying we are looking for the value of r where Pmu = 0.00125. With 1000 s+b toys we would then only expect `1000 * 0.00125 = 1.25` toys in the tail region we have to integrate over. Contrast this to the median limit where 25 toys would be in this region. This means we have to generate a much larger numbers of toys to get the same statistical power."},{"location":"part5/longexerciseanswers/#advanced-section-b-asymptotic-approximation-limitations","title":"Advanced section: B: Asymptotic approximation limitations","text":"

Tasks and questions:

  • Is the asymptotic limit still a good approximation?
Show answer A \"good\" approximation is not well defined, but the difference is clearly larger here.
  • You might notice that the test statistic distributions are not smooth but rather have several \"bump\" structures? Where might this come from? Try reducing the size of the systematic uncertainties to make them more pronounced.
Show answer This bump structure comes from the discrete-ness of the Poisson sampling of the toy datasets. Systematic uncertainties then smear these bumps out, but without systematics we would see delta functions corresponding to the possible integer number of events that could be observed. Once we go to more typical multi-bin analyses with more events and systematic uncertainties these discrete-ness washes out very quickly."},{"location":"part5/longexerciseanswers/#part-2-a-shape-based-analysis","title":"Part 2: A shape-based analysis","text":""},{"location":"part5/longexerciseanswers/#a-setting-up-the-datacard","title":"A: Setting up the datacard","text":"

Only tasks, no questions in this section

"},{"location":"part5/longexerciseanswers/#b-running-combine-for-a-blind-analysis","title":"B: Running combine for a blind analysis","text":"

Tasks and questions:

  • Compare the expected limits calculated with --run expected and --run blind. Why are they different?
Show answer When using --run blind combine will create a background-only Asimov dataset without performing a fit to data first. With --run expected, the observed limit isn't shown, but the background-only Asimov dataset used for the limit calculation is still created after a background-only fit to the data.
  • Calculate a blind limit by generating a background-only Asimov with the -t option instead of using the AsymptoticLimits specific options. You should find the observed limit is the same as the expected. Then see what happens if you inject a signal into the Asimov dataset using the --expectSignal [X] option.
Show answer You should see that with a signal injected the observed limit is worse (has a higher value) than the expected limit: for the expected limit the b-only Asimov dataset is still used, but the observed limit is now calculated on the signal + background Asimov dataset, with a signal at the specified cross section [X]."},{"location":"part5/longexerciseanswers/#c-using-fitdiagnostics","title":"C: Using FitDiagnostics","text":"

Tasks and questions:

  • Which parameter has the largest shift from the nominal value? Which has the tightest constraint?
Show answer CMS_eff_t_highpt should have the largest shift from the nominal value (around 0.47), norm_jetFakes has the tightest constraint (to 25% of the input uncertainty).
  • Should we be concerned when a parameter is more strongly constrained than the input uncertainty (i.e. \\(\\frac{\\sigma}{\\sigma_I}<1.0\\))?
Show answer This is still a hot topic in CMS analyses today, and there isn't a right or wrong answer. Essentially we have to judge if our analysis should really be able to provide more information about this parameter than the external measurement that gave us the input uncertainty. So we would not expect to be able to constrain the luminosity uncertainty for example, but uncertainties specific to the analysis might legitimately be constrained."},{"location":"part5/longexerciseanswers/#d-mc-statistical-uncertainties","title":"D: MC statistical uncertainties","text":"

Tasks and questions:

  • Check how much the cross section measurement and uncertainties change using FitDiagnostics.
Show answer Without autoMCStats we find: Best fit r: -2.73273 -2.13428/+3.38185, with autoMCStats: Best fit r: -3.07825 -3.17742/+3.7087
  • It is also useful to check how the expected uncertainty changes using an Asimov dataset, say with r=10 injected.
Show answer Without autoMCStats we find: Best fit r: 9.99978 -4.85341/+6.56233 , with autoMCStats: Best fit r: 9.99985 -5.24634/+6.98266
  • Advanced task: See what happens if the Poisson threshold is increased. Based on your results, what threshold would you recommend for this analysis?
Show answer At first the uncertainties increase, as the threshold increases, and at some point they stabilise. A Poisson threshold at 10 is probably reasonable for this analysis."},{"location":"part5/longexerciseanswers/#part-3-adding-control-regions","title":"Part 3: Adding control regions","text":""},{"location":"part5/longexerciseanswers/#a-use-of-rateparams","title":"A: Use of rateParams","text":"

Tasks and questions:

  • Run text2workspace.py on this combined card and then use FitDiagnostics on an Asimov dataset with r=1 to get the expected uncertainty. Suggested command line options: --rMin 0 --rMax 2
Show answer As expected uncertainty you should get -0.417238/+0.450593
  • Using the RooFitResult in the fitDiagnosticsTest.root file, check the post-fit value of the rateParams. To what level are the normalisations of the DY and ttbar processes constrained?
Show answer They are constrained to around 1-2%
  • To compare to the previous approach of fitting the SR only, with cross section and acceptance uncertainties restored, an additional card is provided: datacard_part3_nocrs.txt. Run the same fit on this card to verify the improvement of the SR+CR approach
Show answer The expected uncertainty is larger with only the SR: -0.465799/+0.502088 compared with -0.417238/+0.450593 in the SR+CR approach."},{"location":"part5/longexerciseanswers/#b-nuisance-parameter-impacts","title":"B: Nuisance parameter impacts","text":"

Tasks and questions:

  • Identify the most important uncertainties using the impacts tool.
Show answer The most important uncertainty is norm_jetFakes, followed by two MC statistical uncerainties (prop_binsignal_region_bin8 and prop_binsignal_region_bin9).
  • In the plot, some parameters do not show a plotted point for the fitted value, but rather just a numerical value - why?
Show answer These are freely floating parameters ( rate_ttbar and rate_Zll ). They have no prior constraint (and so no shift from the nominal value relative to the input uncertainty) - we show the best-fit value + uncertainty directly."},{"location":"part5/longexerciseanswers/#c-post-fit-distributions","title":"C: Post-fit distributions","text":"

Tasks and questions:

The bin errors on the TH1s in the fitdiagnostics file are determined from the systematic uncertainties. In the post-fit these take into account the additional constraints on the nuisance parameters as well as any correlations.

  • Why is the uncertainty on the post-fit so much smaller than on the pre-fit?
Show answer There are two effects at play here: the nuisance parameters get constrained, and there are anti-correlations between the parameters which also have the effect of reducing the total uncertainty. Note: the post-fit uncertainty could become larger when rateParams are present as they are not taken into account in the pre-fit uncertainty but do enter in the post-fit uncertainty."},{"location":"part5/longexerciseanswers/#d-calculating-the-significance","title":"D: Calculating the significance","text":"

Tasks and questions:

  • Advanced task It is also possible to calculate the significance using toys with HybridNew (details here) if we are in a situation where the asymptotic approximation is not reliable or if we just want to verify the result. Why might this be challenging for a high significance, say larger than \\(5\\sigma\\)?
Show answer A significance of $5\\sigma$ corresponds to a p-value of around $3\\cdot 10^{-7}$ - so we need to populate the very tail of the test statistic distribution and this requires generating a large number of toys."},{"location":"part5/longexerciseanswers/#e-signal-strength-measurement-and-uncertainty-breakdown","title":"E: Signal strength measurement and uncertainty breakdown","text":"

** Tasks and questions: **

  • Take our stat+syst split one step further and separate the systematic part into two: one part for hadronic tau uncertainties and one for all others. Do this by defining a tauID group in the datacard including the following parameters: CMS_eff_t, CMS_eff_t_highpt, and the three CMS_scale_t_X uncertainties.
Show datacard line You should add this line to the end of the datacard:
tauID group = CMS_eff_t CMS_eff_t_highpt CMS_scale_t_1prong0pi0_13TeV CMS_scale_t_1prong1pi0_13TeV CMS_scale_t_3prong0pi0_13TeV\n
  • To plot this and calculate the split via the relations above you can just add further arguments to the --others option in the plot1DScan.py script. Each is of the form: '[file]:[label]:[color]'. The --breakdown argument should also be extended to three terms.
Show code This can be done as:
python plot1DScan.py higgsCombine.part3E.MultiDimFit.mH200.root --others 'higgsCombine.part3E.freezeTauID.MultiDimFit.mH200.root:FreezeTauID:4' 'higgsCombine.part3E.freezeAll.MultiDimFit.mH200.root:FreezeAll:2' -o freeze_third_attempt --breakdown TauID,OtherSyst,Stat\n\n
  • How important are these tau-related uncertainties compared to the others?
Show answer They are smaller than both the statistical uncertainty and the remaining systematic uncertainties"},{"location":"part5/roofit/","title":"RooFit Basics","text":"

RooFit is a OO analysis environment built on ROOT. It has a collection of classes designed to augment root for data modeling.

This section covers a few of the basics of RooFit. There are many more tutorials available at this link: https://root.cern.ch/root/html600/tutorials/roofit/index.html

"},{"location":"part5/roofit/#objects","title":"Objects","text":"

In RooFit, any variable, data point, function, PDF (etc.) is represented by a c++ object The most basic of these is the RooRealVar. We will create one that will represent the mass of some hypothetical particle, we name it and give it an initial starting value and range.

RooRealVar MH(\"MH\",\"mass of the Hypothetical Boson (H-boson) in GeV\",125,120,130);\nMH.Print();\n
RooRealVar::MH = 125  L(120 - 130)\n

Ok, great. This variable is now an object we can play around with. We can access this object and modify its properties, such as its value.

MH.setVal(130);\nMH.getVal();\n

In particle detectors we typically do not observe this particle mass, but usually define some observable which is sensitive to this mass. We will assume we can detect and reconstruct the decay products of the H-boson and measure the invariant mass of those particles. We need to make another variable that represents that invariant mass.

RooRealVar mass(\"m\",\"m (GeV)\",100,80,200);\n

In the perfect world we would perfectly measure the exact mass of the particle in every single event. However, our detectors are usually far from perfect so there will be some resolution effect. We will assume the resolution of our measurement of the invariant mass is 10 GeV and call it \"sigma\"

RooRealVar sigma(\"resolution\",\"#sigma\",10,0,20);\n

More exotic variables can be constructed out of these RooRealVars using RooFormulaVars. For example, suppose we wanted to make a function out of the variables that represented the relative resolution as a function of the hypothetical mass MH.

RooFormulaVar func(\"R\",\"@0/@1\",RooArgList(sigma,mass));\nfunc.Print(\"v\");\n
Show
--- RooAbsArg ---\n  Value State: DIRTY\n  Shape State: DIRTY\n  Attributes: \n  Address: 0x10e878068\n  Clients: \n  Servers: \n    (0x10dcd47b0,V-) RooRealVar::resolution \"#sigma\"\n    (0x10dcd4278,V-) RooRealVar::m \"m (GeV)\"\n  Proxies: \n    actualVars -> \n      1)  resolution\n      2)           m\n--- RooAbsReal ---\n\n  Plot label is \"R\"\n    --- RooFormula ---\n    Formula: \"@0/@1\"\n    (resolution,m)\n

Notice how there is a list of the variables we passed (the servers or \"actual vars\"). We can now plot the function. RooFit has a special plotting object RooPlot which keeps track of the objects (and their normalisations) that we want to draw. Since RooFit does not know the difference between objects that are and are not dependent, we need to tell it.

Right now, we have the relative resolution as \\(R(m,\\sigma)\\), whereas we want to plot \\(R(m,\\sigma(m))\\)!

TCanvas *can = new TCanvas();\n\n//make the x-axis the \"mass\"\nRooPlot *plot = mass.frame(); \nfunc.plotOn(plot);\n\nplot->Draw();\ncan->Draw();\n

The main objects we are interested in using from RooFit are probability denisty functions or (PDFs). We can construct the PDF,

\\[ f(m|M_{H},\\sigma) \\]

as a simple Gaussian shape for example or a RooGaussian in RooFit language (think McDonald's logic, everything is a RooSomethingOrOther)

RooGaussian gauss(\"gauss\",\"f(m|M_{H},#sigma)\",mass,MH,sigma);\ngauss.Print(\"V\");\n
Show
--- RooAbsArg ---\n  Value State: DIRTY\n  Shape State: DIRTY\n  Attributes: \n  Address: 0x10ecf4188\n  Clients: \n  Servers: \n    (0x10dcd4278,V-) RooRealVar::m \"m (GeV)\"\n    (0x10a08a9d8,V-) RooRealVar::MH \"mass of the Hypothetical Boson (H-boson) in GeV\"\n    (0x10dcd47b0,V-) RooRealVar::resolution \"#sigma\"\n  Proxies: \n    x -> m\n    mean -> MH\n    sigma -> resolution\n--- RooAbsReal ---\n\n  Plot label is \"gauss\"\n--- RooAbsPdf ---\nCached value = 0\n

Notice how the gaussian PDF, like the RooFormulaVar depends on our RooRealVar objects, these are its servers. Its evaluation will depend on their values.

The main difference between PDFs and Functions in RooFit is that PDFs are automatically normalised to unitiy, hence they represent a probability density, you don't need to normalise yourself. Lets plot it for the different values of \\(m\\).

plot = mass.frame();\n\ngauss.plotOn(plot);\n\nMH.setVal(120);\ngauss.plotOn(plot,RooFit::LineColor(kBlue));\n\nMH.setVal(125);\ngauss.plotOn(plot,RooFit::LineColor(kRed));\n\nMH.setVal(135);\ngauss.plotOn(plot,RooFit::LineColor(kGreen));\n\nplot->Draw();\n\ncan->Update();\ncan->Draw();\n

Note that as we change the value of MH, the PDF gets updated at the same time.

PDFs can be used to generate Monte Carlo data. One of the benefits of RooFit is that to do so only uses a single line of code! As before, we have to tell RooFit which variables to generate in (e.g which are the observables for an experiment). In this case, each of our events will be a single value of \"mass\" \\(m\\). The arguments for the function are the set of observables, follwed by the number of events,

RooDataSet *gen_data = (RooDataSet*) gauss.generate(RooArgSet(mass),500); \n

Now we can plot the data as with other RooFit objects.

plot = mass.frame();\n\ngen_data->plotOn(plot);\ngauss.plotOn(plot);\ngauss.paramOn(plot);\n\nplot->Draw();\ncan->Update();\ncan->Draw();\n

Of course we are not in the business of generating MC events, but collecting real data!. Next we will look at using real data in RooFit.

"},{"location":"part5/roofit/#datasets","title":"Datasets","text":"

A dataset is essentially just a collection of points in N-dimensional (N-observables) space. There are two basic implementations in RooFit,

1) an \"unbinned\" dataset - RooDataSet

2) a \"binned\" dataset - RooDataHist

both of these use the same basic structure as below

We will create an empty dataset where the only observable is the mass. Points can be added to the dataset one by one ...

RooDataSet mydata(\"dummy\",\"My dummy dataset\",RooArgSet(mass)); \n// We've made a dataset with one observable (mass)\n\nmass.setVal(123.4);\nmydata.add(RooArgSet(mass));\nmass.setVal(145.2);\nmydata.add(RooArgSet(mass));\nmass.setVal(170.8);\nmydata.add(RooArgSet(mass));\n\nmydata.Print();\n
RooDataSet::dummy[m] = 3 entries\n

There are also other ways to manipulate datasets in this way as shown in the diagram below

Luckily there are also Constructors for a RooDataSet from a TTree and for a RooDataHist from a TH1 so its simple to convert from your usual ROOT objects.

We will take an example dataset put together already. The file tutorial.root can be downloaded here.

TFile *file = TFile::Open(\"tutorial.root\");\nfile->ls();\n
Show file contents
TFile**     tutorial.root\n TFile*     tutorial.root\n  KEY: RooWorkspace workspace;1 Tutorial Workspace\n  KEY: TProcessID   ProcessID0;1    48737500-e7e5-11e6-be6f-0d0011acbeef\n

Inside the file, there is something called a RooWorkspace. This is just the RooFit way of keeping a persistent link between the objects for a model. It is a very useful way to share data and PDFs/functions etc among CMS collaborators.

We will now take a look at it. It contains a RooDataSet and one variable. This time we called our variable (or observable) CMS_hgg_mass, we will assume that this is the invariant mass of photon pairs where we assume our H-boson decays to photons.

RooWorkspace *wspace = (RooWorkspace*) file->Get(\"workspace\");\nwspace->Print(\"v\");\n
Show
RooWorkspace(workspace) Tutorial Workspace contents\n\nvariables\n---------\n(CMS_hgg_mass)\n\ndatasets\n--------\nRooDataSet::dataset(CMS_hgg_mass)\n

Now we will have a look at the data. The RooWorkspace has several accessor functions, we will use the RooWorkspace::data one. There are also RooWorkspace::var, RooWorkspace::function and RooWorkspace::pdf with (hopefully) obvious purposes.

RooDataSet *hgg_data = (RooDataSet*) wspace->data(\"dataset\");\nRooRealVar *hgg_mass = (RooRealVar*) wspace->var(\"CMS_hgg_mass\");\n\nplot = hgg_mass->frame();\n\nhgg_data->plotOn(plot,RooFit::Binning(160)); \n// Here we've picked a certain number of bins just for plotting purposes \n\nTCanvas *hggcan = new TCanvas();\nplot->Draw();\nhggcan->Update();\nhggcan->Draw();\n

"},{"location":"part5/roofit/#likelihoods-and-fitting-to-data","title":"Likelihoods and Fitting to data","text":"

The data we have in our file does not look like a Gaussian distribution. Instead, we could probably use something like an exponential to describe it.

There is an exponential PDF already in RooFit (yes, you guessed it) RooExponential. For a PDF, we only need one parameter which is the exponential slope \\(\\alpha\\) so our pdf is,

\\[ f(m|\\alpha) = \\dfrac{1}{N} e^{-\\alpha m}\\]

Where of course, \\(N = \\int_{110}^{150} e^{-\\alpha m} dm\\) is the normalisation constant.

You can find several available RooFit functions here: https://root.cern.ch/root/html/ROOFIT_ROOFIT_Index.html

There is also support for a generic PDF in the form of a RooGenericPdf, check this link: https://root.cern.ch/doc/v608/classRooGenericPdf.html

Now we will create an exponential PDF for our background,

RooRealVar alpha(\"alpha\",\"#alpha\",-0.05,-0.2,0.01);\nRooExponential expo(\"exp\",\"exponential function\",*hgg_mass,alpha);\n

We can use RooFit to tell us to estimate the value of \\(\\alpha\\) using this dataset. You will learn more about parameter estimation, but for now we will just assume you know about maximizing likelihoods. This maximum likelihood estimator is common in HEP and is known to give unbiased estimates for things like distribution means etc.

This also introduces the other main use of PDFs in RooFit. They can be used to construct likelihoods easily.

The likelihood \\(\\mathcal{L}\\) is defined for a particluar dataset (and model) as being proportional to the probability to observe the data assuming some pdf. For our data, the probability to observe an event with a value in an interval bounded by a and b is given by,

\\[ P\\left(m~\\epsilon~[a,b] \\right) = \\int_{a}^{b} f(m|\\alpha)dm \\]

As that interval shrinks we can say this probability just becomes equal to \\(f(m|\\alpha)dm\\).

The probability to observe the dataset we have is given by the product of such probabilities for each of our data points, so that

\\[\\mathcal{L}(\\alpha) \\propto \\prod_{i} f(m_{i}|\\alpha)\\]

Note that for a specific dataset, the \\(dm\\) factors which should be there are constnant. They can therefore be absorbed into the constant of proportionality!

The maximum likelihood esitmator for \\(\\alpha\\), usually written as \\(\\hat{\\alpha}\\), is found by maximising \\(\\mathcal{L}(\\alpha)\\).

Note that this will not depend on the value of the constant of proportionality so we can ignore it. This is true in most scenarios because usually only the ratio of likelihoods is needed, in which the constant factors out.

Obviously this multiplication of exponentials can lead to very large (or very small) numbers which can lead to numerical instabilities. To avoid this, we can take logs of the likelihood. Its also common to multiply this by -1 and minimize the resulting Negative Log Likelihood : \\(\\mathrm{-Log}\\mathcal{L}(\\alpha)\\).

RooFit can construct the NLL for us.

RooNLLVar *nll = (RooNLLVar*) expo.createNLL(*hgg_data);\nnll->Print(\"v\");\n
Show
--- RooAbsArg ---\n  Value State: DIRTY\n  Shape State: DIRTY\n  Attributes:\n  Address: 0x7fdddbe46200\n  Clients:\n  Servers:\n    (0x11eab5638,V-) RooRealVar::alpha \"#alpha\"\n  Proxies:\n    paramSet ->\n      1)  alpha\n--- RooAbsReal ---\n\n  Plot label is \"nll_exp_dataset\"\n

Notice that the NLL object knows which RooRealVar is the parameter because it doesn't find that one in the dataset. This is how RooFit distiguishes between observables and parameters.

RooFit has an interface to Minuit via the RooMinimizer class which takes the NLL as an argument. To minimize, we just call the RooMinimizer::minimize() function. Minuit2 is the program and migrad is the minimization routine which uses gradient descent.

RooMinimizer minim(*nll);\nminim.minimize(\"Minuit2\",\"migrad\");  \n
Show
 **********\n **    1 **SET PRINT           1\n **********\n **********\n **    2 **SET NOGRAD\n **********\n PARAMETER DEFINITIONS:\n    NO.   NAME         VALUE      STEP SIZE      LIMITS\n     1 alpha       -5.00000e-02  2.10000e-02   -2.00000e-01  1.00000e-02\n **********\n **    3 **SET ERR         0.5\n **********\n **********\n **    4 **SET PRINT           1\n **********\n **********\n **    5 **SET STR           1\n **********\n NOW USING STRATEGY  1: TRY TO BALANCE SPEED AGAINST RELIABILITY\n **********\n **    6 **MIGRAD         500           1\n **********\n FIRST CALL TO USER FUNCTION AT NEW START POINT, WITH IFLAG=4.\n START MIGRAD MINIMIZATION.  STRATEGY  1.  CONVERGENCE WHEN EDM .LT. 1.00e-03\n FCN=3589.52 FROM MIGRAD    STATUS=INITIATE        4 CALLS           5 TOTAL\n                     EDM= unknown      STRATEGY= 1      NO ERROR MATRIX\n  EXT PARAMETER               CURRENT GUESS       STEP         FIRST\n  NO.   NAME      VALUE            ERROR          SIZE      DERIVATIVE\n   1  alpha       -5.00000e-02   2.10000e-02   2.24553e-01  -9.91191e+01\n                               ERR DEF= 0.5\n MIGRAD MINIMIZATION HAS CONVERGED.\n MIGRAD WILL VERIFY CONVERGENCE AND ERROR MATRIX.\n COVARIANCE MATRIX CALCULATED SUCCESSFULLY\n FCN=3584.68 FROM MIGRAD    STATUS=CONVERGED      18 CALLS          19 TOTAL\n                     EDM=1.4449e-08    STRATEGY= 1      ERROR MATRIX ACCURATE\n  EXT PARAMETER                                   STEP         FIRST\n  NO.   NAME      VALUE            ERROR          SIZE      DERIVATIVE\n   1  alpha       -4.08262e-02   2.91959e-03   1.33905e-03  -3.70254e-03\n                               ERR DEF= 0.5\n EXTERNAL ERROR MATRIX.    NDIM=  25    NPAR=  1    ERR DEF=0.5\n  8.527e-06\n

RooFit has found the best fit value of alpha for this dataset. It also estimates an uncertainty on alpha using the Hessian matrix from the fit.

alpha.Print(\"v\");\n
--- RooAbsArg ---\n  Value State: clean\n  Shape State: clean\n  Attributes:\n  Address: 0x11eab5638\n  Clients:\n    (0x11eab5978,V-) RooExponential::exp \"exponential function\"\n    (0x7fdddbe46200,V-) RooNLLVar::nll_exp_dataset \"-log(likelihood)\"\n    (0x7fdddbe95600,V-) RooExponential::exp \"exponential function\"\n    (0x7fdddbe5a400,V-) RooRealIntegral::exp_Int[CMS_hgg_mass] \"Integral of exponential function\"\n  Servers:\n  Proxies:\n--- RooAbsReal ---\n\n  Plot label is \"alpha\"\n--- RooAbsRealLValue ---\n  Fit range is [ -0.2 , 0.01 ]\n--- RooRealVar ---\n  Error = 0.00291959\n

We will plot the resulting exponential on top of the data. Notice that the value of \\(\\hat{\\alpha}\\) is used for the exponential.

expo.plotOn(plot);\nexpo.paramOn(plot);\nplot->Draw();\nhggcan->Update();\nhggcan->Draw();\n

It looks like there could be a small region near 125 GeV for which our fit does not quite go through the points. Maybe our hypothetical H-boson is not so hypothetical after all!

We will now see what happens if we include some resonant signal into the fit. We can take our Gaussian function again and use that as a signal model. A reasonable value for the resolution of a resonant signal with a mass around 125 GeV decaying to a pair of photons is around a GeV.

sigma.setVal(1.);\nsigma.setConstant();\n\nMH.setVal(125);\nMH.setConstant();\n\nRooGaussian hgg_signal(\"signal\",\"Gaussian PDF\",*hgg_mass,MH,sigma);\n

By setting these parameters constant, RooFit knows (either when creating the NLL by hand or when using fitTo) that there is not need to fit for these parameters.

We need to add this to our exponential model and fit a \"Sigmal+Background model\" by creating a RooAddPdf. In RooFit there are two ways to add PDFs, recursively where the fraction of yields for the signal and background is a parameter or absolutely where each PDF has its own normalization. We're going to use the second one.

RooRealVar norm_s(\"norm_s\",\"N_{s}\",10,100);\nRooRealVar norm_b(\"norm_b\",\"N_{b}\",0,1000);\n\nconst RooArgList components(hgg_signal,expo);\nconst RooArgList coeffs(norm_s,norm_b);\n\nRooAddPdf model(\"model\",\"f_{s+b}\",components,coeffs);\nmodel.Print(\"v\");\n
Show
--- RooAbsArg ---\n  Value State: DIRTY\n  Shape State: DIRTY\n  Attributes: \n  Address: 0x11ed5d7a8\n  Clients: \n  Servers: \n    (0x11ed5a0f0,V-) RooGaussian::signal \"Gaussian PDF\"\n    (0x11ed5d058,V-) RooRealVar::norm_s \"N_{s}\"\n    (0x11eab5978,V-) RooExponential::exp \"exponential function\"\n    (0x11ed5d398,V-) RooRealVar::norm_b \"N_{b}\"\n  Proxies: \n    !refCoefNorm -> \n    !pdfs -> \n      1)  signal\n      2)     exp\n    !coefficients -> \n      1)  norm_s\n      2)  norm_b\n--- RooAbsReal ---\n\n  Plot label is \"model\"\n--- RooAbsPdf ---\nCached value = 0\n

Ok, now we will fit the model. Note this time we add the option Extended(), which tells RooFit that we care about the overall number of observed events in the data \\(n\\) too. It will add an additional Poisson term in the likelihood to account for this so our likelihood this time looks like,

\\[L_{s+b}(N_{s},N_{b},\\alpha) = \\dfrac{ N_{s}+N_{b}^{n} e^{N_{s}+N_{b}} }{n!} \\cdot \\prod_{i}^{n} \\left[ c f_{s}(m_{i}|M_{H},\\sigma)+ (1-c)f_{b}(m_{i}|\\alpha) \\right] \\]

where \\(c = \\dfrac{ N_{s} }{ N_{s} + N_{b} }\\), \\(f_{s}(m|M_{H},\\sigma)\\) is the Gaussian signal pdf and \\(f_{b}(m|\\alpha)\\) is the exponential pdf. Remember that \\(M_{H}\\) and \\(\\sigma\\) are fixed so that they are no longer parameters of the likelihood.

There is a simpler interface for maximum-likelihood fits which is the RooAbsPdf::fitTo method. With this simple method, RooFit will construct the negative log-likelihood function, from the pdf, and minimize all of the free parameters in one step.

model.fitTo(*hgg_data,RooFit::Extended());\n\nmodel.plotOn(plot,RooFit::Components(\"exp\"),RooFit::LineColor(kGreen));\nmodel.plotOn(plot,RooFit::LineColor(kRed));\nmodel.paramOn(plot);\n\nhggcan->Clear();\nplot->Draw();\nhggcan->Update();\nhggcan->Draw();\n

What about if we also fit for the mass (\\(M_{H}\\))? we can easily do this by removing the constant setting on MH.

MH.setConstant(false);\nmodel.fitTo(*hgg_data,RooFit::Extended());\n
Show output
[#1] INFO:Minization -- RooMinimizer::optimizeConst: activating const optimization\n[#1] INFO:Minization --  The following expressions will be evaluated in cache-and-track mode: (signal,exp)\n **********\n **    1 **SET PRINT           1\n **********\n **********\n **    2 **SET NOGRAD\n **********\n PARAMETER DEFINITIONS:\n    NO.   NAME         VALUE      STEP SIZE      LIMITS\n     1 MH           1.25000e+02  1.00000e+00    1.20000e+02  1.30000e+02\n     2 alpha       -4.08793e-02  2.96856e-03   -2.00000e-01  1.00000e-02\n     3 norm_b       9.67647e+02  3.25747e+01    0.00000e+00  1.00000e+03\n MINUIT WARNING IN PARAMETR\n ============== VARIABLE3 BROUGHT BACK INSIDE LIMITS.\n     4 norm_s       3.22534e+01  1.16433e+01    1.00000e+01  1.00000e+02\n **********\n **    3 **SET ERR         0.5\n **********\n **********\n **    4 **SET PRINT           1\n **********\n **********\n **    5 **SET STR           1\n **********\n NOW USING STRATEGY  1: TRY TO BALANCE SPEED AGAINST RELIABILITY\n **********\n **    6 **MIGRAD        2000           1\n **********\n FIRST CALL TO USER FUNCTION AT NEW START POINT, WITH IFLAG=4.\n START MIGRAD MINIMIZATION.  STRATEGY  1.  CONVERGENCE WHEN EDM .LT. 1.00e-03\n FCN=-2327.53 FROM MIGRAD    STATUS=INITIATE       10 CALLS          11 TOTAL\n                     EDM= unknown      STRATEGY= 1      NO ERROR MATRIX       \n  EXT PARAMETER               CURRENT GUESS       STEP         FIRST   \n  NO.   NAME      VALUE            ERROR          SIZE      DERIVATIVE \n   1  MH           1.25000e+02   1.00000e+00   2.01358e-01   1.12769e+01\n   2  alpha       -4.08793e-02   2.96856e-03   3.30048e-02  -1.22651e-01\n   3  norm_b       9.67647e+02   3.25747e+01   2.56674e-01  -1.96463e-02\n   4  norm_s       3.22534e+01   1.16433e+01   3.10258e-01  -8.97036e-04\n                               ERR DEF= 0.5\n MIGRAD MINIMIZATION HAS CONVERGED.\n MIGRAD WILL VERIFY CONVERGENCE AND ERROR MATRIX.\n COVARIANCE MATRIX CALCULATED SUCCESSFULLY\n FCN=-2327.96 FROM MIGRAD    STATUS=CONVERGED      65 CALLS          66 TOTAL\n                     EDM=1.19174e-05    STRATEGY= 1      ERROR MATRIX ACCURATE \n  EXT PARAMETER                                   STEP         FIRST   \n  NO.   NAME      VALUE            ERROR          SIZE      DERIVATIVE \n   1  MH           1.24628e+02   3.98153e-01   2.66539e-03   2.46327e-02\n   2  alpha       -4.07708e-02   2.97195e-03   1.10093e-03   8.33780e-02\n   3  norm_b       9.66105e+02   3.25772e+01   5.96627e-03   1.83523e-03\n   4  norm_s       3.39026e+01   1.17380e+01   9.60816e-03  -2.32681e-03\n                               ERR DEF= 0.5\n EXTERNAL ERROR MATRIX.    NDIM=  25    NPAR=  4    ERR DEF=0.5\n  1.589e-01 -3.890e-05  1.462e-01 -1.477e-01 \n -3.890e-05  8.836e-06 -2.020e-04  2.038e-04 \n  1.462e-01 -2.020e-04  1.073e+03 -1.072e+02 \n -1.477e-01  2.038e-04 -1.072e+02  1.420e+02 \n PARAMETER  CORRELATION COEFFICIENTS  \n       NO.  GLOBAL      1      2      3      4\n        1  0.04518   1.000 -0.033  0.011 -0.031\n        2  0.03317  -0.033  1.000 -0.002  0.006\n        3  0.27465   0.011 -0.002  1.000 -0.275\n        4  0.27610  -0.031  0.006 -0.275  1.000\n **********\n **    7 **SET ERR         0.5\n **********\n **********\n **    8 **SET PRINT           1\n **********\n **********\n **    9 **HESSE        2000\n **********\n COVARIANCE MATRIX CALCULATED SUCCESSFULLY\n FCN=-2327.96 FROM HESSE     STATUS=OK             23 CALLS          89 TOTAL\n                     EDM=1.19078e-05    STRATEGY= 1      ERROR MATRIX ACCURATE \n  EXT PARAMETER                                INTERNAL      INTERNAL  \n  NO.   NAME      VALUE            ERROR       STEP SIZE       VALUE   \n   1  MH           1.24628e+02   3.98106e-01   5.33077e-04  -7.45154e-02\n   2  alpha       -4.07708e-02   2.97195e-03   2.20186e-04   5.42722e-01\n   3  norm_b       9.66105e+02   3.26003e+01   2.38651e-04   1.20047e+00\n   4  norm_s       3.39026e+01   1.17445e+01   3.84326e-04  -4.87967e-01\n                               ERR DEF= 0.5\n EXTERNAL ERROR MATRIX.    NDIM=  25    NPAR=  4    ERR DEF=0.5\n  1.588e-01 -3.888e-05  1.304e-01 -1.304e-01 \n -3.888e-05  8.836e-06 -1.954e-04  1.954e-04 \n  1.304e-01 -1.954e-04  1.074e+03 -1.082e+02 \n -1.304e-01  1.954e-04 -1.082e+02  1.421e+02 \n PARAMETER  CORRELATION COEFFICIENTS  \n       NO.  GLOBAL      1      2      3      4\n        1  0.04274   1.000 -0.033  0.010 -0.027\n        2  0.03314  -0.033  1.000 -0.002  0.006\n        3  0.27694   0.010 -0.002  1.000 -0.277\n        4  0.27806  -0.027  0.006 -0.277  1.000\n[#1] INFO:Minization -- RooMinimizer::optimizeConst: deactivating const optimization\n

Notice the result for the fitted MH is not 125 and is included in the list of fitted parameters. We can get more information about the fit, via the RooFitResult, using the option Save().

RooFitResult *fit_res = (RooFitResult*) model.fitTo(*hgg_data,RooFit::Extended(),RooFit::Save());\n

For example, we can get the Correlation Matrix from the fit result... Note that the order of the parameters are the same as listed in the \"Floating Parameter\" list above

TMatrixDSym cormat = fit_res->correlationMatrix();\ncormat.Print();\n
4x4 matrix is as follows\n\n     |      0    |      1    |      2    |      3    |\n---------------------------------------------------------\n   0 |          1    -0.03282    0.009538    -0.02623 \n   1 |   -0.03282           1   -0.001978    0.005439 \n   2 |   0.009538   -0.001978           1     -0.2769 \n   3 |   -0.02623    0.005439     -0.2769           1 \n

A nice feature of RooFit is that once we have a PDF, data and results like this, we can import this new model into our RooWorkspace and show off our new discovery to our LHC friends (if we weren't already too late!). We can also save the \"state\" of our parameters for later, by creating a snapshot of the current values.

wspace->import(model);  \nRooArgSet *params = model.getParameters(*hgg_data);\nwspace->saveSnapshot(\"nominal_values\",*params);\n\nwspace->Print(\"V\");\n
Show output
RooWorkspace(workspace) Tutorial Workspace contents\n\nvariables\n---------\n(CMS_hgg_mass,MH,alpha,norm_b,norm_s,resolution)\n\np.d.f.s\n-------\nRooExponential::exp[ x=CMS_hgg_mass c=alpha ] = 0.00248636\nRooAddPdf::model[ norm_s * signal + norm_b * exp ] = 0.00240205\nRooGaussian::signal[ x=CMS_hgg_mass mean=MH sigma=resolution ] = 5.34013e-110\n\ndatasets\n--------\nRooDataSet::dataset(CMS_hgg_mass)\n\nparameter snapshots\n-------------------\nnominal_values = (MH=124.627 +/- 0.398094,resolution=1[C],norm_s=33.9097 +/- 11.7445,alpha=-0.040779 +/- 0.00297195,norm_b=966.109 +/- 32.6025)\n

This is exactly what needs to be done when you want to use shape based datacards in Combine with parametric models.

"},{"location":"part5/roofit/#a-likelihood-for-a-counting-experiment","title":"A likelihood for a counting experiment","text":"

An introductory presentation about likelihoods and interval estimation is available here.

**Note: We will use python syntax in this section; you should use a .py script. Make sure to do import ROOT at the top of your script **

We have seen how to create variables and PDFs, and how to fit a PDF to data. But what if we have a counting experiment, or a histogram template shape? And what about systematic uncertainties? We are going to build a likelihood for this:

\\(\\mathcal{L} \\propto p(\\text{data}|\\text{parameters})\\)

where our parameters are parameters of interest, \\(\\mu\\), and nuisance parameters, \\(\\theta\\). The nuisance parameters are constrained by external measurements, so we add constraint terms \\(\\pi(\\vec{\\theta}_0|\\vec{\\theta})\\)

So we have \\(\\mathcal{L} \\propto p(\\text{data}|\\mu,\\vec{\\theta})\\cdot \\pi(\\vec{\\theta}_0|\\vec{\\theta})\\)

now we will try to build the likelihood by hand for a 1-bin counting experiment. The data is the number of observed events \\(N\\), and the probability is just a Poisson probability \\(p(N|\\lambda) = \\frac{\\lambda^N e^{-\\lambda}}{N!}\\), where \\(\\lambda\\) is the number of events expected in our signal+background model: \\(\\lambda = \\mu\\cdot s(\\vec{\\theta}) + b(\\vec{\\theta})\\).

In the expression, s and b are the numbers of expected signal and background events, which both depend on the nuisance parameters. We will start by building a simple likelihood function with one signal process and one background process. We will assume there are no nuisance parameters for now. The number of observed events in data is 15, the expected number of signal events is 5 and the expected number of background events 8.1.

It is easiest to use the RooFit workspace factory to build our model (this tutorial has more information on the factory syntax).

import ROOT\nw = ROOT.RooWorkspace(\"w\")\n

We need to create an expression for the number of events in our model, \\(\\mu s +b\\):

w.factory('expr::n(\"mu*s +b\", mu[1.0,0,4], s[5],b[8.1])')\n

Now we can build the likelihood, which is just our Poisson PDF:

w.factory('Poisson::poisN(N[15],n)')\n

To find the best fit value for our parameter of interest \\(\\mu\\) we need to maximize the likelihood. In practice it is actually easier to minimize the Negative log of the likelihood, or NLL:

w.factory('expr::NLL(\"-log(@0)\",poisN)')\n

We can now use the RooMinimizer to find the minimum of the NLL

nll = w.function(\"NLL\")\nminim = ROOT.RooMinimizer(nll)\nminim.setErrorLevel(0.5)\nminim.minimize(\"Minuit2\",\"migrad\")\nbestfitnll = nll.getVal()\n

Notice that we need to set the error level to 0.5 to get the uncertainties (relying on Wilks' theorem!) - note that there is a more reliable way of extracting the confidence interval (explicitly rather than relying on migrad). We will discuss this a bit later in this section.

Now we will add a nuisance parameter, lumi, which represents the luminosity uncertainty. It has a 2.5% effect on both the signal and the background. The parameter will be log-normally distributed: when it's 0, the normalization of the signal and background are not modified; at \\(+1\\sigma\\) the signal and background normalizations will be multiplied by 1.025 and at \\(-1\\sigma\\) they will be divided by 1.025. We should modify the expression for the number of events in our model:

w.factory('expr::n(\"mu*s*pow(1.025,lumi) +b*pow(1.025,lumi)\", mu[1.0,0,4], s[5],b[8.1],lumi[0,-4,4])')\n

And we add a unit gaussian constraint

w.factory('Gaussian::lumiconstr(lumi,0,1)')\n

Our full likelihood will now be

w.factory('PROD::likelihood(poisN,lumiconstr)')\n

and the NLL

w.factory('expr::NLL(\"-log(@0)\",likelihood)')\n

Which we can minimize in the same way as before.

Now we will extend our model a bit.

  • Expanding on what was demonstrated above, build the likelihood for \\(N=15\\), a signal process s with expectation 5 events, a background ztt with expectation 3.7 events and a background tt with expectation 4.4 events. The luminosity uncertainty applies to all three processes. The signal process is further subject to a 5% log-normally distributed uncertainty sigth, tt is subject to a 6% log-normally distributed uncertainty ttxs, and ztt is subject to a 4% log-normally distributed uncertainty zttxs. Find the best-fit value and the associated uncertainty
  • Also perform an explicit scan of the \\(\\Delta\\) NLL ( = log of profile likelihood ratio) and make a graph of the scan. Some example code can be found below to get you started. Hint: you'll need to perform fits for different values of mu, where mu is fixed. In RooFit you can set a variable to be constant as var(\"VARNAME\").setConstant(True)
  • From the curve that you have created by performing an explicit scan, we can extract the 68% CL interval. You can do so by eye or by writing some code to find the relevant intersections of the curve.
gr = ROOT.TGraph()\n\nnpoints = 0\nfor i in range(0,60):\n  npoints+=1\n  mu=0.05*i\n  ...\n  [perform fits for different values of mu with mu fixed]\n  ...\n  deltanll = ...\n  gr.SetPoint(npoints,mu,deltanll)\n\n\ncanv = ROOT.TCanvas()\ngr.Draw(\"ALP\")\ncanv.SaveAs(\"likelihoodscan.pdf\")\n

Well, this is doable - but we were only looking at a simple one-bin counting experiment. This might become rather cumbersome for large models... \\([*]\\)

For the next set ot tutorials, we will now switch to working with Combine that will help in building the statistical model and do the statistical analysis, instead of building the likelihood with RooFit.

Info

RooFit does have additional functionality to help with statistical model building, but we will not go into detail in these tutorials.

"},{"location":"tutorial2023/parametric_exercise/","title":"Parametric Models in Combine","text":""},{"location":"tutorial2023/parametric_exercise/#getting-started","title":"Getting started","text":"

By now you should have a working setup of Combine v9 from the pre-tutorial exercise. If so then move onto the cloning of the parametric fitting exercise gitlab repo below. If not then you need to set up a CMSSW area and checkout the combine package:

cmssw-el7\ncmsrel CMSSW_11_3_4\ncd CMSSW_11_3_4/src\ncmsenv\ngit clone https://github.com/cms-analysis/HiggsAnalysis-CombinedLimit.git HiggsAnalysis/CombinedLimit\ncd HiggsAnalysis/CombinedLimit\n\ncd $CMSSW_BASE/src/HiggsAnalysis/CombinedLimit\ngit fetch origin\ngit checkout v9.1.0\n

We will also make use of another package, CombineHarvester, which contains some high-level tools for working with combine. The following command will download the repository and checkout just the parts of it we need for this exercise:

cd $CMSSW_BASE/src/\nbash <(curl -s https://raw.githubusercontent.com/cms-analysis/CombineHarvester/main/CombineTools/scripts/sparse-checkout-https.sh)\n

Now let's compile the CMSSW area:

scramv1 b clean; scramv1 b\ncmsenv\n

Finally, let's move to the working directory for this tutorial which contains all of the inputs and scripts needed to run the parametric fitting exercise:

cd $CMSSW_BASE/src/HiggsAnalysis/CombinedLimit/data/tutorials/parametric_exercise\n
"},{"location":"tutorial2023/parametric_exercise/#session-structure","title":"Session structure","text":"

The exercise is split into six parts which cover:

1) Parametric model building

2) Simple fits

3) Systematic uncertainties

4) Toy generation

5) Discrete profiling

6) Multi-signal hypothesis

Throughout the tutorial there are a number of questions and exercises for you to complete. These are shown by the bullet points in this markdown file.

All the code required to run the different parts is available in python scripts. We have purposely commented out the code to encourage you to open the scripts and take a look what is inside. Each block is separated by a divider and a blank line. When you are happy and understand the code, you can uncomment (block by block) and then run the scripts (using python3) e.g.:

python3 construct_models_part1.py\n

A number of scripts will produce plots (as .png files). The default path to store these plots is in the current working directory. You can change this (e.g. pipe to an eos webpage) by changing the plot_dir variable in the config.py script.

There's also a set of combine (.txt) datacards which will help you get through the various parts of the exercise. The exercises should help you become familiar with the structure of parametric fitting datacards.

Finally, this exercise is heavily based off the RooFit package. So if you find yourself using the python interpreter for any checks, don't forget to...

import ROOT\n
"},{"location":"tutorial2023/parametric_exercise/#jupyter-notebooks","title":"Jupyter notebooks","text":"

Alternatively, we have provided Jupyter notebooks to run the different parts of the exercise e.g. part1.ipynb. You will have already downloaded these notebooks when cloning the tutorial gitlab repo. To open Jupyter notebooks on lxplus within a CMSSW environment, you can add the following option when you ssh into lxplus:

ssh -X -Y username@lxplus.cern.ch -L8xxx:localhost:8xxx\n

where you should replace xxx with some three digit number. Then cd into the combinetutorial-2023-parametric directory and set up the CMSSW environment with:

cmsenv\n

You can then open the Jupyter notebook inside the environment with:

jupyter notebook --no-browser --port 8xxx\n

replacing xxx with the same three digit number. You should now be able to copy the url it provides into a browser and access the various exercise notebooks.

"},{"location":"tutorial2023/parametric_exercise/#analysis-overview","title":"Analysis overview","text":"

In this exercise we will look at one of the most famous parametric fitting analyses at the LHC: the Higgs boson decaying to two photons (H \\(\\rightarrow \\gamma\\gamma\\)). This decay channel is key in understanding the properties of the Higgs boson due to its clean final state topology. The excellent energy resolution- of the CMS electromagnetic calorimeter leads to narrow signal peak in the diphoton invariant mass spectrum, \\(m_{\\gamma\\gamma}\\), above a smoothly falling background continuum. The mass spectrum for the legacy Run 2 analysis is shown below.

In the analysis, we construct parametric models (analytic functions) of both signal and background events to fit the \\(m_{\\gamma\\gamma}\\) spectrum in data. From the fit we can extract measurements of Higgs boson properties including its rate of production, its mass (\\(m_H\\)), its coupling behaviour, to name a few. This exercise will show how to construct parametric models using RooFit, and subsequently how to use combine to extract the results.

"},{"location":"tutorial2023/parametric_exercise/#part-1-parametric-model-building","title":"Part 1: Parametric model building","text":"

As with any fitting exercise, the first step is to understand the format of the input data, explore its contents and construct a model. The python script which performs the model construction is construct_models_part1.py. This section will explain what the various lines of code are doing.

"},{"location":"tutorial2023/parametric_exercise/#signal-modelling","title":"Signal modelling","text":"

Firstly, we will construct a model to fit the signal (H \\(\\rightarrow\\gamma\\gamma\\)) mass peak using a Monte Carlo simulation sample of gluon-gluon fusion production (ggH) events with \\(m_H=125\\) GeV. This production mode has the largest cross section in the SM, and the LO Feynman diagram is shown below.

There has already been a dedicated selection performed on the events to increase the signal-to-background ratio (e.g. using some ML event classifier). Events passing this selection enter the analysis category, Tag0. Events entering Tag0 are used for the parametric fitting of the \\(m_{\\gamma\\gamma}\\) spectrum.

The events are stored in a ROOT TTree, where the diphoton mass, CMS_hgg_mass, and the event weight, weight, are saved. Let's begin by loading the MC, and converting the TTree data into RooDataSet:

import ROOT\nROOT.gROOT.SetBatch(True)\n\nf = ROOT.TFile(\"mc_part1.root\",\"r\")\n# Load TTree\nt = f.Get(\"ggH_Tag0\")\n\n# Define mass and weight variables\nmass = ROOT.RooRealVar(\"CMS_hgg_mass\", \"CMS_hgg_mass\", 125, 100, 180)\nweight = ROOT.RooRealVar(\"weight\",\"weight\",0,0,1)\n\n# Convert to RooDataSet\nmc = ROOT.RooDataSet(\"ggH_Tag0\",\"ggH_Tag0\", t, ROOT.RooArgSet(mass,weight), \"\", \"weight\" )\n\n# Lets plot the signal mass distribution\ncan = ROOT.TCanvas()\nplot = mass.frame()\nmc.plotOn(plot)\nplot.Draw()\ncan.Update()\ncan.SaveAs(\"part1_signal_mass.png\")\n

The plot shows a peak centred on the Higgs mass at 125 GeV. Let's use a simple Gaussian to model the peak.

# Introduce a RooRealVar into the workspace for the Higgs mass\nMH = ROOT.RooRealVar(\"MH\", \"MH\", 125, 120, 130 )\nMH.setConstant(True)\n\n# Signal peak width\nsigma = ROOT.RooRealVar(\"sigma_ggH_Tag0\", \"sigma_ggH_Tag0\", 2, 1, 5)\n\n# Define the Gaussian with mean=MH and width=sigma\nmodel = ROOT.RooGaussian( \"model_ggH_Tag0\", \"model_ggH_Tag0\", mass, MH, sigma ) \n\n# Fit Gaussian to MC events and plot\nmodel.fitTo(mc,ROOT.RooFit.SumW2Error(True))\n\ncan = ROOT.TCanvas()\nplot = mass.frame()\nmc.plotOn(plot)\nmodel.plotOn( plot, ROOT.RooFit.LineColor(2) )\nplot.Draw()\ncan.Update()\ncan.Draw()\ncan.SaveAs(\"part1_signal_model_v0.png\")\n

It looks like a good fit!

  • Do you understand the output from the fitTo command (i.e the mimimization)? From now on we will add the option ROOT.RooFit.PrintLevel(-1) when fitting the models to surpress the minimizer output.

But what if the mean of the model does not correspond directly to the Higgs boson mass i.e. there are some reconstruction effects. Let's instead define the mean of the model as:

\\[\\mu = m_H + \\delta\\]

and we can fit for \\(\\delta\\) in the model construction. For this we introduce a RooFormulaVar.

dMH = ROOT.RooRealVar(\"dMH_ggH_Tag0\", \"dMH_ggH_Tag0\", 0, -1, 1 )\nmean = ROOT.RooFormulaVar(\"mean_ggH_Tag0\", \"mean_ggH_Tag0\", \"(@0+@1)\", ROOT.RooArgList(MH,dMH))\nmodel = ROOT.RooGaussian( \"model_ggH_Tag0\", \"model_ggH_Tag0\", mass, mean, sigma )\n\n# Fit the new model with a variable mean\nmodel.fitTo(mc,ROOT.RooFit.SumW2Error(True),ROOT.RooFit.PrintLevel(-1))\n\n# Model is parametric in MH. Let's show this by plotting for different values of MH\ncan = ROOT.TCanvas()\nplot = mass.frame()\nMH.setVal(120)\nmodel.plotOn( plot, ROOT.RooFit.LineColor(2) )\nMH.setVal(125)\nmodel.plotOn( plot, ROOT.RooFit.LineColor(3) )\nMH.setVal(130)\nmodel.plotOn( plot, ROOT.RooFit.LineColor(4) )\nplot.Draw()\ncan.Update()\ncan.SaveAs(\"part1_signal_model_v1.png\")\n

Let's now save the model inside a RooWorkspace. Combine will load this model when performing the fits. Crucially, we need to freeze the fit parameters of the signal model, otherwise they will be freely floating in the final results extraction.

  • This choice of setting the shape parameters to constant means we believe our MC will perfectly model the Higgs boson events in data. Is this the case? How could we account for the MC mis-modelling in the fit? (See part 3).
MH.setVal(125)\ndMH.setConstant(True)\nsigma.setConstant(True)\n\nf_out = ROOT.TFile(\"workspace_sig.root\", \"RECREATE\")\nw_sig = ROOT.RooWorkspace(\"workspace_sig\",\"workspace_sig\")\ngetattr(w_sig, \"import\")(model)\nw_sig.Print()\nw_sig.Write()\nf_out.Close()\n

We have successfully constructed a parametric model to fit the shape of the signal peak. But we also need to know the yield/normalisation of the ggH signal process. In the SM, the ggH event yield in Tag0 is equal to:

\\[ N = \\sigma_{ggH} \\cdot \\mathcal{B}^{\\gamma\\gamma} \\cdot \\epsilon \\cdot \\mathcal{L}\\]

Where \\(\\sigma_{ggH}\\) is the SM ggH cross section, \\(\\mathcal{B}^{\\gamma\\gamma}\\) is the SM branching fraction of the Higgs boson to two photons, \\(\\epsilon\\) is the efficiency factor and corresponds to the fraction of the total ggH events landing in the Tag0 analysis category. Finally \\(\\mathcal{L}\\) is the integrated luminosity.

In this example, the ggH MC events are normalised before any selection is performed to \\(\\sigma_{ggH} \\cdot \\mathcal{B}^{\\gamma\\gamma}\\), taking the values from the LHCHWG twiki. Note this does not include the lumi scaling, which may be different to what you have in your own analyses! We can then calculate the efficiency factor, \\(\\epsilon\\), by taking the sum of weights in the MC dataset and dividing through by \\(\\sigma_{ggH} \\cdot \\mathcal{B}^{\\gamma\\gamma}\\). This will tell us what fraction of ggH events land in Tag0.

# Define SM cross section and branching fraction values\nxs_ggH = 48.58 #in [pb]\nbr_gamgam = 2.7e-3\n\n# Calculate the efficiency and print output\nsumw = mc.sumEntries()\neff = sumw/(xs_ggH*br_gamgam)\nprint(\"Efficiency of ggH events landing in Tag0 is: %.2f%%\"%(eff*100))\n\n# Calculate the total yield (assuming full Run 2 lumi) and print output\nlumi = 138000\nN = xs_ggH*br_gamgam*eff*lumi\nprint(\"For 138fb^-1, total normalisation of signal is: N = xs * br * eff * lumi = %.2f events\"%N)\n

Gives the output:

Efficiency of ggH events landing in Tag0 is: 1.00%\nFor 138fb^-1, total normalisation of signal is: N = xs * br * eff * lumi = 181.01 events\n

So we find 1% of all ggH events enter Tag0. And the total expected yield of ggH events in Tag0 (with lumi scaling) is 181.01. Lets make a note of this for later!

"},{"location":"tutorial2023/parametric_exercise/#background-modelling","title":"Background modelling","text":"

In the H \\(\\rightarrow\\gamma\\gamma\\) analysis we construct the background model directly from data. To avoid biasing our background estimate, we remove the signal region from the model construction and fit the mass sidebands. Let's begin by loading the data TTree and converting to a RooDataSet. We will then plot the mass sidebands.

f = ROOT.TFile(\"data_part1.root\",\"r\")\nt = f.Get(\"data_Tag0\")\n\n# Convert TTree to a RooDataSet\ndata = ROOT.RooDataSet(\"data_Tag0\", \"data_Tag0\", t, ROOT.RooArgSet(mass), \"\", \"weight\")\n\n# Define mass sideband ranges on the mass variable: 100-115 and 135-180\nn_bins = 80\nbinning = ROOT.RooFit.Binning(n_bins,100,180)\nmass.setRange(\"loSB\", 100, 115 )\nmass.setRange(\"hiSB\", 135, 180 )\nmass.setRange(\"full\", 100, 180 )\nfit_range = \"loSB,hiSB\"\n\n# Plot the data in the mass sidebands\ncan = ROOT.TCanvas()\nplot = mass.frame()\ndata.plotOn( plot, ROOT.RooFit.CutRange(fit_range), binning )\nplot.Draw()\ncan.Update()\ncan.Draw()\ncan.SaveAs(\"part1_data_sidebands.png\")\n

By eye, it looks like an exponential function would fit the data sidebands well. Let's construct the background model using a RooExponential and fit the data sidebands:

alpha = ROOT.RooRealVar(\"alpha\", \"alpha\", -0.05, -0.2, 0 )\nmodel_bkg = ROOT.RooExponential(\"model_bkg_Tag0\", \"model_bkg_Tag0\", mass, alpha )\n\n# Fit model to data sidebands\nmodel_bkg.fitTo( data, ROOT.RooFit.Range(fit_range),  ROOT.RooFit.PrintLevel(-1))\n\n# Let's plot the model fit to the data\ncan = ROOT.TCanvas()\nplot = mass.frame()\n# We have to be careful with the normalisation as we only fit over sidebands\n# First do an invisible plot of the full data set\ndata.plotOn( plot, binning, ROOT.RooFit.MarkerColor(0), ROOT.RooFit.LineColor(0) )\nmodel_bkg.plotOn( plot, ROOT.RooFit.NormRange(fit_range), ROOT.RooFit.Range(\"full\"), ROOT.RooFit.LineColor(2))\ndata.plotOn( plot, ROOT.RooFit.CutRange(fit_range), binning )\nplot.Draw()\ncan.Update()\ncan.Draw()\ncan.SaveAs(\"part1_bkg_model.png\")\n

As the background model is extracted from data, we want to introduce a freely floating normalisation term. We use the total number of data events (including in the signal region) as the initial prefit value of this normalisation object i.e. assuming no signal in the data. The syntax to name this normalisation object is {model}_norm which will the be picked up automatically by combine. Note we also allow the shape parameter to float in the final fit to data (by not setting to constant).

norm = ROOT.RooRealVar(\"model_bkg_Tag0_norm\", \"Number of background events in Tag0\", data.numEntries(), 0, 3*data.numEntries() )\nalpha.setConstant(False)\n

Let's then save the background model, the normalisation object, and the data distribution to a new RooWorkspace:

f_out = ROOT.TFile(\"workspace_bkg.root\", \"RECREATE\")\nw_bkg = ROOT.RooWorkspace(\"workspace_bkg\",\"workspace_bkg\")\ngetattr(w_bkg, \"import\")(data)\ngetattr(w_bkg, \"import\")(norm)\ngetattr(w_bkg, \"import\")(model_bkg)\nw_bkg.Print()\nw_bkg.Write()\nf_out.Close()\n
"},{"location":"tutorial2023/parametric_exercise/#datacard","title":"Datacard","text":"

The model workspaces have now been constructed. But before we can run any fits in combine we need to build the so-called datacard. This is a text file which defines the different processes entering the fit and their expected yields, and maps these processes to the corresponding (parametric) models. We also store information on the systematic uncertainties in the datacard (see part 3). Given the low complexity of this example, the datacard is reasonably short. The datacard for this section is titled datacard_part1.txt. Take some time to understand the different lines. In particular, the values for the process normalisations:

  • Where does the signal (ggH) normalisation come from?
  • Why do we use a value of 1.0 for the background model normalisation in this analysis?
# Datacard example for combine tutorial 2023 (part 1)\n---------------------------------------------\nimax 1\njmax 1\nkmax *\n---------------------------------------------\n\nshapes      ggH          Tag0      workspace_sig.root      workspace_sig:model_ggH_Tag0\nshapes      bkg_mass     Tag0      workspace_bkg.root      workspace_bkg:model_bkg_Tag0\nshapes      data_obs     Tag0      workspace_bkg.root      workspace_bkg:data_Tag0\n\n---------------------------------------------\nbin             Tag0\nobservation     -1\n---------------------------------------------\nbin             Tag0         Tag0\nprocess         ggH          bkg_mass\nprocess         0            1\nrate            181.01       1.0\n---------------------------------------------\n

To compile the datacard we run the following command, using a value of the Higgs mass of 125.0:

text2workspace.py datacard_part1.txt -m 125\n
  • This compiles the datacard into a RooWorkspace, effectively building the likelihood function. Try opening the compiled workspace (root datacard_part1.root) and print the contents.
w->Print()\n
  • Do you understand what all the different objects are? What does the variable r correspond to? Try (verbose) printing with:
w->var(\"r\")->Print(\"v\")\n
"},{"location":"tutorial2023/parametric_exercise/#extension-signal-normalisation-object","title":"Extension: signal normalisation object","text":"

In the example above, the signal model normalisation is input by hand in the datacard. We can instead define the signal normalisation components in the model in a similar fashion to the background model normalisation object. Let's build the cross section (ggH), branching fraction (H->gamgam), and efficiency variables. It's important to set these terms to be constant for the final fit to data:

xs_ggH = ROOT.RooRealVar(\"xs_ggH\", \"Cross section of ggH in [pb]\", 48.58 )\nbr_gamgam = ROOT.RooRealVar(\"BR_gamgam\", \"Branching ratio of Higgs to gamma gamma\", 0.0027 )\neff_ggH_Tag0 = ROOT.RooRealVar(\"eff_ggH_Tag0\", \"Efficiency for ggH events to land in Tag0\", eff )\n\nxs_ggH.setConstant(True)\nbr_gamgam.setConstant(True)\neff_ggH_Tag0.setConstant(True)\n

The normalisation component is then defined as the product of these three variables:

norm_sig = ROOT.RooProduct(\"model_ggH_Tag0_norm\", \"Normalisation term for ggH in Tag 0\", ROOT.RooArgList(xs_ggH,br_gamgam,eff_ggH_Tag0))\n

Again the syntax {model}_norm has been used so that combine will automatically assign this object as the normalisation for the model (model_ggH_Tag0). Firstly we need to save a new version of the signal model workspace with the normalisation term included.

f_out = ROOT.TFile(\"workspace_sig_with_norm.root\", \"RECREATE\")\nw_sig = ROOT.RooWorkspace(\"workspace_sig\",\"workspace_sig\")\ngetattr(w_sig, \"import\")(model)\ngetattr(w_sig, \"import\")(norm_sig)\nw_sig.Print()\nw_sig.Write()\nf_out.Close()\n

We then need to modify the datacard to account for this normalisation term. Importantly, the {model}_norm term in our updated signal model workspace does not contain the integrated luminosity. Therefore, the rate term in the datacard must be set equal to the integrated luminosity in [pb^-1] (as the cross section was defined in [pb]). The total normalisation for the signal model is then the product of the {model}_norm and the rate value.

  • You can find the example datacard here: datacard_part1_with_norm.txt with the signal normalisation object included. Check if it compiles successfully using text2workspace? If so, try printing out the contents of the workspace. Can you see the normalisation component?
"},{"location":"tutorial2023/parametric_exercise/#extension-unbinned-vs-binned","title":"Extension: unbinned vs binned","text":"

In a parametric analysis, the fit can be performed using a binned or unbinned likelihood function. The consequences of binned vs unbinned likelihoods were discussed in the morning session. In combine, we can simply toggle between binned and unbinned fits by changing how the data set is stored in the workspace. In the example above, the data was saved as a RooDataSet. This means that an unbinned maximum likelihood function would be used.

To switch to a binned maximum likelihood fit, we need to store the data set in the workspace as a RooDataHist. Let's first load the data as a RooDataSet as before:

f = ROOT.TFile(\"data_part1.root\",\"r\")\nt = f.Get(\"data_Tag0\")\n\n# Convert TTree to a RooDataSet\ndata = ROOT.RooDataSet(\"data_Tag0\", \"data_Tag0\", t, ROOT.RooArgSet(mass, weight), \"\", \"weight\")\n

We then need to set the number of bins in the observable and convert the data to a RooDataHist. In this example we will use 320 bins over the full mass range (0.25 GeV per bin). It is important that the binning is sufficiently granular so that we do not lose information in the data by switching to a binned likelihood fit. When fitting a signal peak over a background we want the bin width to be sufficiently smaller than the signal model mass resolution.

# Set bin number for mass variables\nmass.setBins(320)\ndata_hist = ROOT.RooDataHist(\"data_hist_Tag0\", \"data_hist_Tag0\", mass, data)\n\n# Save the background model with the RooDataHist instead\nf_out = ROOT.TFile(\"workspace_bkg_binned.root\", \"RECREATE\")\nw_bkg = ROOT.RooWorkspace(\"workspace_bkg\",\"workspace_bkg\")\ngetattr(w_bkg, \"import\")(data_hist)\ngetattr(w_bkg, \"import\")(norm)\ngetattr(w_bkg, \"import\")(model_bkg)\nw_bkg.Print()\nw_bkg.Write()\nf_out.Close()\n
"},{"location":"tutorial2023/parametric_exercise/#part-2-simple-fits","title":"Part 2: Simple fits","text":"

Now the parametric models have been constructed and the datacard has been compiled, we are ready to start using combine for running fits. In CMS analyses we begin by blinding ourselves to the data in the signal region, and looking only at the expected results based off toys datasets (asimov or pseudo-experiments). In this exercise, we will look straight away at the observed results. Note, the python commands in this section are taken from simple_fits.py.

To run a simple best-fit for the signal strength, r, fixing the Higgs mass to 125 GeV, you can run the command in the terminal:

combine -M MultiDimFit datacard_part1_with_norm.root -m 125 --freezeParameters MH --saveWorkspace -n .bestfit\n

We obtain a best-fit signal strength of r = 1.548 i.e. the observed signal yield is 1.548 times the SM prediction.

The option --saveWorkspace stores a snapshot of the postfit workspace in the output file (higgsCombine.bestfit.MultiDimFit.mH125.root). We can load the postfit workspace and look at how the values of all the fit parameters change (compare the clean and MultiDimFit parameter snapshots):

import ROOT\n\nf = ROOT.TFile(\"higgsCombine.bestfit.MultiDimFit.mH125.root\")\nw = f.Get(\"w\")\nw.Print(\"v\")\n

We can even plot the postfit signal-plus-background model using the workspace snapshot:

n_bins = 80\nbinning = ROOT.RooFit.Binning(n_bins,100,180)\n\ncan = ROOT.TCanvas()\nplot = w.var(\"CMS_hgg_mass\").frame()\nw.data(\"data_obs\").plotOn( plot, binning )\n\n# Load the S+B model\nsb_model = w.pdf(\"model_s\").getPdf(\"Tag0\")\n\n# Prefit\nsb_model.plotOn( plot, ROOT.RooFit.LineColor(2), ROOT.RooFit.Name(\"prefit\") )\n\n# Postfit\nw.loadSnapshot(\"MultiDimFit\")\nsb_model.plotOn( plot, ROOT.RooFit.LineColor(4), ROOT.RooFit.Name(\"postfit\") )\nr_bestfit = w.var(\"r\").getVal()\n\nplot.Draw()\n\nleg = ROOT.TLegend(0.55,0.6,0.85,0.85)\nleg.AddEntry(\"prefit\", \"Prefit S+B model (r=1.00)\", \"L\")\nleg.AddEntry(\"postfit\", \"Postfit S+B model (r=%.2f)\"%r_bestfit, \"L\")\nleg.Draw(\"Same\")\n\ncan.Update()\ncan.SaveAs(\"part2_sb_model.png\")\n

"},{"location":"tutorial2023/parametric_exercise/#confidence-intervals","title":"Confidence intervals","text":"

We not only want to find the best-fit value of the signal strength, r, but also the confidence intervals. The singles algorithm will find the 68% CL intervals:

combine -M MultiDimFit datacard_part1_with_norm.root -m 125 --freezeParameters MH -n .singles --algo singles\n

To perform a likelihood scan (i.e. calculate 2NLL at fixed values of the signal strength, profiling the other parameters), we use the grid algorithm. We can control the number of points in the scan using the --points option. Also, it is important to set a suitable range for the signal strength parameter. The singles algorithm has shown us that the 1 stdev interval on r is around +/-0.2.

  • Use these intervals to define a suitable range for the scan, and change lo,hi in the following options accordingly: --setParameterRanges r=lo,hi.
combine -M MultiDimFit datacard_part1_with_norm.root -m 125 --freezeParameters MH -n .scan --algo grid --points 20 --setParameterRanges r=lo,hi\n

We can use the plot1DScan.py function from combineTools to plot the likelihood scan:

plot1DScan.py higgsCombine.scan.MultiDimFit.mH125.root -o part2_scan\n

  • Do you understand what the plot is showing? What information about the signal strength parameter can be inferred from the plot?
"},{"location":"tutorial2023/parametric_exercise/#extension-expected-fits","title":"Extension: expected fits","text":"

To run expected fits we simply add -t N to the combine command. For N>0, this will generate N random toys from the model and fit each one independently. For N=-1, this will generate an asimov toy in which all statistical fluctuations from the model are suppressed.

You can use the --expectSignal 1 option to set the signal strength parameter to 1 when generating the toy. Alternatively, --expectSignal 0 will generate a toy from the background-only model. For multiple parameter models you can set the initial values when generating the toy(s) using the --setParameters option of combine. For example, if you want to throw a toy where the Higgs mass is at 124 GeV and the background slope parameter alpha is equal to -0.05, you would add --setParameters MH=124.0,alpha=-0.05.

  • Try running the asimov likelihood scan for r=1 and r=0, and plotting them using the plot1DScan.py script.
"},{"location":"tutorial2023/parametric_exercise/#extension-goodness-of-fit-tests","title":"Extension: goodness-of-fit tests","text":"

The goodness-of-fit tests available in combine are only well-defined for binned maximum likelihood fits. Therefore, to perform a goodness-of-fit test with a parametric datacard, make sure to save the data object as a RooDataHist, as in workspace_bkg_binned.root.

  • Try editing the datacard_part1_with_norm.txt file to pick up the correct binned workspace file, and the RooDataHist. The goodness-of-fit method requires at-least one nuisance parameter in the model to run successfully. Append the following line to the end of the datacard:
lumi_13TeV      lnN          1.01         -\n
  • Does the datacard compile with the text2workspace.py command?

We use the GoodnessOfFit method in combine to evaluate how compatible the observed data are with the model pdf. There are three types of GoF algorithm within combine, this example will use the saturated algorithm. You can find more information about the other algorithms here.

Firstly, we want to calculate the value of the test statistic for the observed data:

combine -M GoodnessOfFit datacard_part1_binned.root --algo saturated -m 125 --freezeParameters MH -n .goodnessOfFit_data\n

Now lets calculate the test statistic value for many toys thrown from the model:

combine -M GoodnessOfFit datacard_part1_binned.root --algo saturated -m 125 --freezeParameters MH -n .goodnessOfFit_toys -t 1000\n

To make a plot of the GoF test-statistic distribution you can run the following commands, which first collect the values of the test-statistic into a json file, and then plots from the json file:

combineTool.py -M CollectGoodnessOfFit --input higgsCombine.goodnessOfFit_data.GoodnessOfFit.mH125.root higgsCombine.goodnessOfFit_toys.GoodnessOfFit.mH125.123456.root -m 125.0 -o gof.json\n\nplotGof.py gof.json --statistic saturated --mass 125.0 -o part2_gof\n

  • What does the plot tell us? Does the model fit the data well? You can refer back to the discussion here
"},{"location":"tutorial2023/parametric_exercise/#part-3-systematic-uncertainties","title":"Part 3: Systematic uncertainties","text":"

In this section, we will learn how to add systematic uncertainties to a parametric fit analysis. The python commands are taken from the systematics.py script.

For uncertainties which only affect the process normalisation, we can simply implement these as lnN uncertainties in the datacard. The file mc_part3.root contains the systematic-varied trees i.e. Monte-Carlo events where some systematic uncertainty source {photonID,JEC,scale,smear} has been varied up and down by \\(1\\sigma\\).

import ROOT\n\nf = ROOT.TFile(\"mc_part3.root\")\nf.ls()\n

Gives the output:

TFile**     mc_part3.root   \n TFile*     mc_part3.root   \n  KEY: TTree    ggH_Tag0;1  ggH_Tag0\n  KEY: TTree    ggH_Tag0_photonIDUp01Sigma;1    ggH_Tag0_photonIDUp01Sigma\n  KEY: TTree    ggH_Tag0_photonIDDown01Sigma;1  ggH_Tag0_photonIDDown01Sigma\n  KEY: TTree    ggH_Tag0_scaleUp01Sigma;1   ggH_Tag0_scaleUp01Sigma\n  KEY: TTree    ggH_Tag0_scaleDown01Sigma;1 ggH_Tag0_scaleDown01Sigma\n  KEY: TTree    ggH_Tag0_smearUp01Sigma;1   ggH_Tag0_smearUp01Sigma\n  KEY: TTree    ggH_Tag0_smearDown01Sigma;1 ggH_Tag0_smearDown01Sigma\n  KEY: TTree    ggH_Tag0_JECUp01Sigma;1 ggH_Tag0_JECUp01Sigma\n  KEY: TTree    ggH_Tag0_JECDown01Sigma;1   ggH_Tag0_JECDown01Sigma\n

Let's first load the systematic-varied trees as RooDataSets and store them in a python dictionary, mc:

# Define mass and weight variables\nmass = ROOT.RooRealVar(\"CMS_hgg_mass\", \"CMS_hgg_mass\", 125, 100, 180)\nweight = ROOT.RooRealVar(\"weight\",\"weight\",0,0,1)\n\nmc = {}\n\n# Load the nominal dataset\nt = f.Get(\"ggH_Tag0\")\nmc['nominal'] = ROOT.RooDataSet(\"ggH_Tag0\",\"ggH_Tag0\", t, ROOT.RooArgSet(mass,weight), \"\", \"weight\" )\n\n# Load the systematic-varied datasets\nfor syst in ['JEC','photonID','scale','smear']:\n    for direction in ['Up','Down']:\n        key = \"%s%s01Sigma\"%(syst,direction)\n        name = \"ggH_Tag0_%s\"%(key)\n        t = f.Get(name)\n        mc[key] = ROOT.RooDataSet(name, name, t, ROOT.RooArgSet(mass,weight), \"\", \"weight\" )\n

The jet energy scale (JEC) and photon identification (photonID) uncertainties do not affect the shape of the \\(m_{\\gamma\\gamma}\\) distribution i.e. they only effect the signal yield estimate. We can calculate their impact by comparing the sum of weights to the nominal dataset. Note, the photonID uncertainty changes the weight of the events in the tree, whereas the JEC varied trees contain a different set of events, generated by shifting the jet energy scale in the simulation. In any case, the means for calculating the yield variations is equivalent:

for syst in ['JEC','photonID']:\n    for direction in ['Up','Down']:\n        yield_variation = mc['%s%s01Sigma'%(syst,direction)].sumEntries()/mc['nominal'].sumEntries()\n        print(\"Systematic varied yield (%s,%s): %.3f\"%(syst,direction,yield_variation))\n
Systematic varied yield (JEC,Up): 1.056\nSystematic varied yield (JEC,Down): 0.951\nSystematic varied yield (photonID,Up): 1.050\nSystematic varied yield (photonID,Down): 0.950\n

We can write these yield variations in the datacard with the lines:

CMS_scale_j           lnN      0.951/1.056      -\nCMS_hgg_phoIdMva      lnN      1.05             -   \n
  • Why is the photonID uncertainty expressed as one number, whereas the JEC uncertainty is defined by two?

Note in this analysis there are no systematic uncertainties affecting the background estimate (- in the datacard), as the background model has been derived directly from data.

"},{"location":"tutorial2023/parametric_exercise/#parametric-shape-uncertainties","title":"Parametric shape uncertainties","text":"

What about systematic uncertainties which affect the shape of the mass distribution?

In a parametric analysis, we need to build the dependence directly into the model parameters. The example uncertainty sources in this tutorial are the photon energy scale and smearing uncertainties. From the names alone we can expect that the scale uncertainty will affect the mean of the signal Gaussian, and the smear uncertainty will impact the resolution (sigma). Let's first take a look at the scaleUp01Sigma dataset:

# Build the model to fit the systematic-varied datasets\nmean = ROOT.RooRealVar(\"mean\", \"mean\", 125, 124, 126)\nsigma = ROOT.RooRealVar(\"sigma\", \"sigma\", 2, 1.5, 2.5)\ngaus = ROOT.RooGaussian(\"model\", \"model\", mass, mean, sigma)\n\n# Run the fits twice (second time from the best-fit of first run) to obtain more reliable results\ngaus.fitTo(mc['scaleUp01Sigma'], ROOT.RooFit.SumW2Error(True),ROOT.RooFit.PrintLevel(-1))\ngaus.fitTo(mc['scaleUp01Sigma'], ROOT.RooFit.SumW2Error(True),ROOT.RooFit.PrintLevel(-1))\nprint(\"Mean = %.3f +- %.3f GeV, Sigma = %.3f +- %.3f GeV\"%(mean.getVal(),mean.getError(),sigma.getVal(),sigma.getError()) )\n

Gives the output:

Mean = 125.370 +- 0.009 GeV, Sigma = 2.011 +- 0.006 GeV\n

Now let's compare the values to the nominal fit for all systematic-varied trees. We observe a significant variation in the mean for the scale uncertainty, and a significant variation in sigma for the smear uncertainty.

# First fit the nominal dataset\ngaus.fitTo(mc['nominal'], ROOT.RooFit.SumW2Error(True), ROOT.RooFit.PrintLevel(-1) )\ngaus.fitTo(mc['nominal'], ROOT.RooFit.SumW2Error(True), ROOT.RooFit.PrintLevel(-1) )\n# Save the mean and sigma values and errors to python dicts\nmean_values, sigma_values = {}, {}\nmean_values['nominal'] = [mean.getVal(),mean.getError()]\nsigma_values['nominal'] = [sigma.getVal(),sigma.getError()]\n\n# Next for the systematic varied datasets\nfor syst in ['scale','smear']:\n    for direction in ['Up','Down']:\n        key = \"%s%s01Sigma\"%(syst,direction)\n        gaus.fitTo(mc[key] , ROOT.RooFit.SumW2Error(True),  ROOT.RooFit.PrintLevel(-1))\n        gaus.fitTo(mc[key], ROOT.RooFit.SumW2Error(True), ROOT.RooFit.PrintLevel(-1))\n        mean_values[key] = [mean.getVal(), mean.getError()]\n        sigma_values[key] = [sigma.getVal(), sigma.getError()]\n\n# Print the variations in mean and sigma\nfor key in mean_values.keys():\n    print(\"%s: mean = %.3f +- %.3f GeV, sigma = %.3f +- %.3f GeV\"%(key,mean_values[key][0],mean_values[key][1],sigma_values[key][0],sigma_values[key][1]))\n

Prints the output:

nominal: mean = 125.001 +- 0.009 GeV, sigma = 1.996 +- 0.006 GeV\nscaleUp01Sigma: mean = 125.370 +- 0.009 GeV, sigma = 2.011 +- 0.006 GeV\nscaleDown01Sigma: mean = 124.609 +- 0.009 GeV, sigma = 2.005 +- 0.006 GeV\nsmearUp01Sigma: mean = 125.005 +- 0.009 GeV, sigma = 2.097 +- 0.007 GeV\nsmearDown01Sigma: mean = 125.007 +- 0.009 GeV, sigma = 1.912 +- 0.006 GeV\n

The values tell us that the scale uncertainty (at \\(\\pm 1 \\sigma\\)) varies the signal peak mean by around 0.3%, and the smear uncertainty (at \\(\\pm 1 \\sigma\\)) varies the signal width (sigma) by around 4.5% (average of up and down variations).

Now we need to bake these effects into the parametric signal model. The mean of the Gaussian was previously defined as:

\\[ \\mu = m_H + \\delta\\]

We introduce the nuisance parameter nuisance_scale = \\(\\eta\\) to account for a shift in the signal peak mean using:

\\[ \\mu = (m_H + \\delta) \\cdot (1+0.003\\eta)\\]

At \\(\\eta = +1 (-1)\\) the signal peak mean will shift up (down) by 0.3%. To build this into the RooFit signal model we simply define a new parameter, \\(\\eta\\), and update the definition of the mean formula variable:

# Building the workspace with systematic variations\nMH = ROOT.RooRealVar(\"MH\", \"MH\", 125, 120, 130 )\nMH.setConstant(True)\n\n# Define formula for mean of Gaussian\ndMH = ROOT.RooRealVar(\"dMH_ggH_Tag0\", \"dMH_ggH_Tag0\", 0, -5, 5 )\neta = ROOT.RooRealVar(\"nuisance_scale\", \"nuisance_scale\", 0, -5, 5)\neta.setConstant(True)\nmean_formula = ROOT.RooFormulaVar(\"mean_ggH_Tag0\", \"mean_ggH_Tag0\", \"(@0+@1)*(1+0.003*@2)\", ROOT.RooArgList(MH,dMH,eta))\n
  • Why do we set the nuisance parameter to constant at this stage?

Similar for the width introducing a nuisance parameter, \\(\\chi\\):

\\[ \\sigma = \\sigma \\cdot (1+0.045\\chi)\\]
sigma = ROOT.RooRealVar(\"sigma_ggH_Tag0_nominal\", \"sigma_ggH_Tag0_nominal\", 2, 1, 5)\nchi = ROOT.RooRealVar(\"nuisance_smear\", \"nuisance_smear\", 0, -5, 5)\nchi.setConstant(True)\nsigma_formula = ROOT.RooFormulaVar(\"sigma_ggH_Tag0\", \"sigma_ggH_Tag0\", \"@0*(1+0.045*@1)\", ROOT.RooArgList(sigma,chi))\n

Let's now fit the new model to the signal Monte-Carlo dataset, build the normalisation object and save the workspace.

# Define Gaussian\nmodel = ROOT.RooGaussian( \"model_ggH_Tag0\", \"model_ggH_Tag0\", mass, mean_formula, sigma_formula )\n\n# Fit model to MC\nmodel.fitTo( mc['nominal'], ROOT.RooFit.SumW2Error(True), ROOT.RooFit.PrintLevel(-1) )\n\n# Build signal model normalisation object\nxs_ggH = ROOT.RooRealVar(\"xs_ggH\", \"Cross section of ggH in [pb]\", 48.58 )\nbr_gamgam = ROOT.RooRealVar(\"BR_gamgam\", \"Branching ratio of Higgs to gamma gamma\", 0.0027 )\neff = mc['nominal'].sumEntries()/(xs_ggH.getVal()*br_gamgam.getVal())\neff_ggH_Tag0 = ROOT.RooRealVar(\"eff_ggH_Tag0\", \"Efficiency for ggH events to land in Tag0\", eff )\n# Set values to be constant\nxs_ggH.setConstant(True)\nbr_gamgam.setConstant(True)\neff_ggH_Tag0.setConstant(True)\n# Define normalisation component as product of these three variables\nnorm_sig = ROOT.RooProduct(\"model_ggH_Tag0_norm\", \"Normalisation term for ggH in Tag 0\", ROOT.RooArgList(xs_ggH,br_gamgam,eff_ggH_Tag0))\n\n# Set shape parameters of model to be constant (i.e. fixed in fit to data)\ndMH.setConstant(True)\nsigma.setConstant(True)\n\n# Build new signal model workspace with signal normalisation term. \nf_out = ROOT.TFile(\"workspace_sig_with_syst.root\", \"RECREATE\")\nw_sig = ROOT.RooWorkspace(\"workspace_sig\",\"workspace_sig\")\ngetattr(w_sig, \"import\")(model)\ngetattr(w_sig, \"import\")(norm_sig)\nw_sig.Print()\nw_sig.Write()\nf_out.Close()\n

The final step is to add the parametric uncertainties as Gaussian-constrained nuisance parameters into the datacard. The syntax means the Gaussian constraint term in the likelihood function will have a mean of 0 and a width of 1.

nuisance_scale        param    0.0    1.0\nnuisance_smear        param    0.0    1.0\n
  • Try adding these lines to datacard_part1_with_norm.txt, along with the lines for the JEC and photonID yield uncertainties above, and compiling with the text2workspace command. Open the workspace and look at its contents. You will need to change the signal process workspace file name in the datacard to point to the new workspace (workspace_sig_with_syst.root).
  • Can you see the new objects in the compiled datacard that have been created for the systematic uncertainties? What do they correspond to?

We can now run a fit with the systematic uncertainties included. The option --saveSpecifiedNuis can be called to save the postfit nuisance parameter values in the combine output limit tree.

combine -M MultiDimFit datacard_part1_with_norm.root -m 125 --freezeParameters MH --saveWorkspace -n .bestfit.with_syst --saveSpecifiedNuis CMS_scale_j,CMS_hgg_phoIdMva,nuisance_scale,nuisance_smear\n
  • What do the postfit values of the nuisances tell us here? You can check them by opening the output file (root higgsCombine.bestfit.with_syst.MultiDimFit.mH125.root) and running limit->Show(0).
  • Try plotting the postfit mass distribution (as detailed in part 2). Do you notice any difference?
"},{"location":"tutorial2023/parametric_exercise/#uncertainty-breakdown","title":"Uncertainty breakdown","text":"

A more complete datacard with additional nuisance parameters is stored in datacard_part3.txt. We will use this datacard for the rest of part 3. Open the text file and have a look at the contents.

The following line has been appended to the end of the datacard to define the set of theory nuisance parameters. This will come in handy when calculating the uncertainty breakdown.

theory group = BR_hgg QCDscale_ggH pdf_Higgs_ggH alphaS_ggH UnderlyingEvent PartonShower\n

Compile the datacard and run an observed MultiDimFit likelihood scan over the signal strength, r:

text2workspace.py datacard_part3.txt -m 125\n\ncombine -M MultiDimFit datacard_part3.root -m 125 --freezeParameters MH -n .scan.with_syst --algo grid --points 20 --setParameterRanges r=0.5,2.5\n

Our aim is to break down the total uncertainty into the systematic and statistical components. To get the statistical-uncertainty-only scan it should be as simple as freezing the nuisance parameters in the fit... right?

Try it by adding ,allConstrainedNuisances to the --freezeParameters option. This will freeze all (constrained) nuisance parameters in the fit. You can also feed in regular expressions with wildcards using rgx{.*}. For instance to freeze only the nuisance_scale and nuisance_smear you could run with --freezeParameters MH,rgx{nuisance_.*}.

combine -M MultiDimFit datacard_part3.root -m 125 --freezeParameters MH,allConstrainedNuisances -n .scan.with_syst.statonly --algo grid --points 20 --setParameterRanges r=0.5,2.5\n

You can plot the two likelihood scans on the same axis with the command:

plot1DScan.py higgsCombine.scan.with_syst.MultiDimFit.mH125.root --main-label \"With systematics\" --main-color 1 --others higgsCombine.scan.with_syst.statonly.MultiDimFit.mH125.root:\"Stat-only\":2 -o part3_scan_v0\n

  • Can you spot the problem?

The nuisance parameters introduced into the model have pulled the best-fit signal strength point! Therefore we cannot simply subtract the uncertainties in quadrature to get an estimate for the systematic/statistical uncertainty breakdown.

The correct approach is to freeze the nuisance parameters to their respective best-fit values in the stat-only scan. We can do this by first saving a postfit workspace with all nuisance parameters profiled in the fit. Then we load the postfit snapshot values of the nuisance parameters (with the option --snapshotName MultiDimFit) from the combine output of the previous step, and then freeze the nuisance parameters for the stat-only scan.

combine -M MultiDimFit datacard_part3.root -m 125 --freezeParameters MH -n .bestfit.with_syst --setParameterRanges r=0.5,2.5 --saveWorkspace\n\ncombine -M MultiDimFit higgsCombine.bestfit.with_syst.MultiDimFit.mH125.root -m 125 --freezeParameters MH,allConstrainedNuisances -n .scan.with_syst.statonly_correct --algo grid --points 20 --setParameterRanges r=0.5,2.5 --snapshotName MultiDimFit\n

Adding the option --breakdown syst,stat to the plot1DScan.py command will automatically calculate the uncertainty breakdown for you.

plot1DScan.py higgsCombine.scan.with_syst.MultiDimFit.mH125.root --main-label \"With systematics\" --main-color 1 --others higgsCombine.scan.with_syst.statonly_correct.MultiDimFit.mH125.root:\"Stat-only\":2 -o part3_scan_v1 --breakdown syst,stat\n

We can also freeze groups of nuisance parameters defined in the datacard with the option --freezeNuisanceGroups. Let's run a scan freezing only the theory uncertainties (using the nuisance group we defined in the datacard):

combine -M MultiDimFit higgsCombine.bestfit.with_syst.MultiDimFit.mH125.root -m 125 --freezeParameters MH --freezeNuisanceGroups theory -n .scan.with_syst.freezeTheory --algo grid --points 20 --setParameterRanges r=0.5,2.5 --snapshotName MultiDimFit\n

To breakdown the total uncertainty into the theory, experimental and statistical components we can then use:

plot1DScan.py higgsCombine.scan.with_syst.MultiDimFit.mH125.root --main-label Total --main-color 1 --others higgsCombine.scan.with_syst.freezeTheory.MultiDimFit.mH125.root:\"Freeze theory\":4 higgsCombine.scan.with_syst.statonly_correct.MultiDimFit.mH125.root:\"Stat-only\":2 -o part3_scan_v2 --breakdown theory,exp,stat\n

These methods are not limited to this particular grouping of systematics. We can use the above procedure to assess the impact of any nuisance parameter(s) on the signal strength confidence interval.

  • Try and calculate the contribution to the total uncertainty from the luminosity estimate using this approach.
"},{"location":"tutorial2023/parametric_exercise/#impacts","title":"Impacts","text":"

It is often useful/required to check the impacts of the nuisance parameters (NP) on the parameter of interest, r. The impact of a NP is defined as the shift \\(\\Delta r\\) induced as the NP, \\(\\theta\\), is fixed to its \\(\\pm1\\sigma\\) values, with all other parameters profiled as normal. More information can be found in the combine documentation via this link.

Let's calculate the impacts for our analysis. We can use the combineTool.py from the CombineHarvester package to automate the scripts. The impacts are calculated in a few stages:

1) Do an initial fit for the parameter of interest, adding the --robustFit 1 option:

combineTool.py -M Impacts -d datacard_part3.root -m 125 --freezeParameters MH -n .impacts --setParameterRanges r=0.5,2.5 --doInitialFit --robustFit 1\n
  • What does the option --robustFit 1 do?

2) Next perform a similar scan for each NP with the --doFits option. This may take a few minutes:

combineTool.py -M Impacts -d datacard_part3.root -m 125 --freezeParameters MH -n .impacts --setParameterRanges r=0.5,2.5 --doFits --robustFit 1\n

3) Collect the outputs from the previous step and write the results to a json file:

combineTool.py -M Impacts -d datacard_part3.root -m 125 --freezeParameters MH -n .impacts --setParameterRanges r=0.5,2.5 -o impacts_part3.json\n

4) Produce a plot summarising the nuisance parameter values and impacts:

plotImpacts.py -i impacts_part3.json -o impacts_part3\n

There is a lot of information in these plots, which can be of invaluable use to analysers in understanding the fit. Do you understand everything that the plot is showing?

  • Which NP has the highest impact on the signal strength measurement?
  • Which NP is pulled the most in the fit to data? What does this information imply about the signal model mean in relation to the data?
  • Which NP is the most constrained in the fit to the data? What does it mean for a nuisance parameter to be constrained?
  • Try adding the option --summary to the impacts plotting command. This is a nice new feature in combine!
"},{"location":"tutorial2023/parametric_exercise/#part-4-toy-generation-and-bias-studies","title":"Part 4: Toy generation and bias studies","text":"

With combine we can generate toy datasets from the compiled datacard workspace. Please read this section in the combine manual before proceeding.

An interesting use case of toy generation is when performing bias studies. In the Higgs to two photon (Hgg) analysis, the background is fit with some functional form. However (due to the complexities of QCD) the exact form of this function is unknown. Therefore, we need to understand how our choice of background function may impact the fitted signal strength. This is performed using a bias study, which will indicate how much potential bias is present given a certain choice of functional form.

In the classical bias studies we begin by building a set of workspaces which correspond to different background function choices. In addition to the RooExponential constructed in Section 1, let's also try a (4th order) RooChebychev polynomial and a simple power law function to fit the background \\(m_{\\gamma\\gamma}\\) distribution.

The script used to fit the different functions and build the workspaces is construct_models_bias_study_part4.py. Take some time to look at the script and understand what the code is doing. In particular notice how we have saved the data as a RooDataHist in the workspace. This means we are now performing binned maximum likelihood fits (this is useful for part 4 to speed up fitting the many toys). If the binning is sufficiently granular, then there will be no noticeable difference in the results to the unbinned likelihood fits. Run the script with:

python3  construct_models_bias_study_part4.py\n

The outputs are a set of workspaces which correspond to different choices of background model functions, and a plot showing fits of the different functions to the data mass sidebands.

The datacards for the different background model functions are saved as datacard_part4_{pdf}.txt where pdf = {exp,poly,pow}. Have a look inside the .txt files and understand what changes have been made to pick up the different functions. Compile the datacards with:

for pdf in {exp,poly,pow}; do text2workspace.py datacard_part4_${pdf}.txt -m 125; done\n
"},{"location":"tutorial2023/parametric_exercise/#bias-studies","title":"Bias studies","text":"

For the bias studies we want to generate (\"throw\") toy datasets with some choice of background function and fit back with another. The toys are thrown with a known value of the signal strength (r=1 in this example), which we will call \\(r_{truth}\\). The fitted value of r is defined as \\(r_{fit}\\), with some uncertainty \\(\\sigma_{fit}\\). A pull value, \\(P\\), is calculated for each toy dataset according to,

\\[ P = (r_{truth}-r_{fit})/\\sigma_{fit}\\]

By repeating the process for many toys we can build up a pull distribution. If there is no bias present then we would expect to obtain a normal distribution centred at 0, with a standard deviation of 1. Let's calculate the bias for our analysis.

Firstly, we generate N=1000 toys from each of the background function choices and save them in a ROOT file. For this we use the GenerateOnly method of combine. We will inject signal in the toys by setting r=1 using the --expectSignal 1 option.

  • If time allows, repeat the bias studies with --expectSignal 0. This will inform us of the potential bias in the signal strength measurement given that there is no true signal.

The following commands show the example of throwing 1000 toys from the exponential function, and then fitting back with the 4th-order Chebychev polynomial. We use the singles algorithm to obtain a value for \\(r_{fit}\\) and \\(\\sigma_{fit}\\) simultaneously.

combine -M GenerateOnly datacard_part4_exp.root -m 125 --freezeParameters MH -t 1000 -n .generate_exp --expectSignal 1 --saveToys\n\ncombine -M MultiDimFit datacard_part4_poly.root -m 125 --freezeParameters MH -t 1000 -n .bias_truth_exp_fit_poly --expectSignal 1 --toysFile higgsCombine.generate_exp.GenerateOnly.mH125.123456.root --algo singles\n

The script plot_bias_pull.py will plot the pull distribution and fit a Gaussian to it:

python3 plot_bias_pull.py\n

The potential bias is defined as the (fitted) mean of the pull distribution.

  • What is our bias value? Have we generated enough toys to be confident of the bias value? You could try generating more toys if not.
  • What threshold do we use to define \"acceptable\" bias?

From the pull definition, we see the bias value is defined relative to the total uncertainty in the signal strength (denominator of \\(\\sigma_{fit}\\)). Some analyses use 0.14 as the threshold because a bias below this value would change the total uncertainty (when added in quadrature) by less than 1% (see equation below). Other analyses use 0.2 as this will change the total uncertainty by less than 2%. We should define the threshold before performing the bias study.

\\[ \\sqrt{ 1^2 + 0.14^2} = 1.0098 \\]
  • How does our bias value compare to the thresholds? If we the bias is outside the acceptable region we should account for this using a spurious signal method (see advanced exercises TBA).
  • Repeat the bias study for each possible truth and fitted background function combinations. Do the bias values induced by the choice of background function merit adding a spurious signal component into the fit?
  • What would you expect the bias value to be for a background function that does not fit the data well? Should we be worried about such functions? What test could we use to reject such functions from the study beforehand?
"},{"location":"tutorial2023/parametric_exercise/#part-5-discrete-profiling","title":"Part 5: Discrete-profiling","text":"

If multiple pdfs exist to fit some distribution, we can store all pdfs in a single workspace by using a RooMultiPdf object. The script construct_models_multipdf_part5.py shows how to store the exponential, (4th order) Chebychev polynomial and the power law function from the previous section in a RooMultiPdf object. This requires a RooCategory index, which controls the pdf which is active at any one time. Look at the contents of the script and then run with:

python3 construct_models_multipdf_part5.py\n

The file datacard_part5.txt will load the multipdf as the background model. Notice the line at the end of the datacard (see below). This tells combine about the RooCategory index.

pdfindex_Tag0         discrete\n

Compile the datacard with:

text2workspace.py datacard_part5.txt -m 125\n

The RooMultiPdf is a handy object for performing bias studies as all functions can be stored in a single workspace. You can then set which function is used for generating the toys with the --setParameters pdfindex_Tag0=i option, and which function is used for fitting with --setParameters pdfindex_Tag0=j --freezeParameters pdfindex_Tag0 options.

  • It would be a useful exercise to repeat the bias studies from part 4 but using the RooMultiPdf workspace. What happens when you do not freeze the index in the fitting step?

But simpler bias studies are not the only benefit of using the RooMultiPdf! It also allows us to apply the discrete profiling method in our analysis. In this method, the index labelling which pdf is active (a discrete nuisance parameter) is left floating in the fit, and will be profiled by looping through all the possible index values and finding the pdf which gives the best fit. In this manner, we are able to account for the uncertainty in the choice of the background function.

Note, by default, the multipdf will tell combine to add 0.5 to the NLL for each parameter in the pdf. This is known as the penalty term (or correction factor) for the discrete profiling method. You can toggle this term when building the workspace with the command multipdf.setCorrectionFactor(0.5). You may need to change the value of this term to obtain an acceptable bias in your fit!

Let's run a likelihood scan using the compiled datacard with the RooMultiPdf:

combine -M MultiDimFit datacard_part5.root -m 125 --freezeParameters MH -n .scan.multidimfit --algo grid --points 20 --cminDefaultMinimizerStrategy 0 --saveSpecifiedIndex pdfindex_Tag0 --setParameterRanges r=0.5,2.5\n

The option --cminDefaultMinimizerStrategy 0 is required to prevent HESSE being called as this cannot handle discrete nuisance parameters. HESSE is the full calculation of the second derivative matrix (Hessian) of the likelihood using finite difference methods.

The option --saveSpecifiedIndex pdfindex_Tag0 saves the value of the index at each point in the likelihood scan. Let's have a look at how the index value changes as a function of the signal strength. You can make the following plot by running:

python3 plot_pdfindex.py\n

By floating the discrete nuisance parameter pdfindex_Tag0, at each point in the likelihood scan the pdfs will be iterated over and the one which gives the max likelihood (lowest 2NLL) including the correction factor will be used. The plot above shows that the pdfindex_Tag0=0 (exponential) is chosen for the majority of r values, but this switches to pdfindex_Tag0=1 (Chebychev polynomial) at the lower edge of the r range. We can see the impact on the likelihood scan by fixing the pdf to the exponential:

combine -M MultiDimFit datacard_part5.root -m 125 --freezeParameters MH,pdfindex_Tag0 --setParameters pdfindex_Tag0=0 -n .scan.multidimfit.fix_exp --algo grid --points 20 --cminDefaultMinimizerStrategy 0 --saveSpecifiedIndex pdfindex_Tag0 --setParameterRanges r=0.5,2.5\n

Plotting the two scans on the same axis:

plot1DScan.py higgsCombine.scan.multidimfit.MultiDimFit.mH125.root --main-label \"Pdf choice floating\" --main-color 1 --others higgsCombine.scan.multidimfit.fix_exp.MultiDimFit.mH125.root:\"Pdf fixed to exponential\":2 -o part5_scan --y-cut 35 --y-max 35\n

The impact on the likelihood scan is evident at the lower edge, where the scan in which the index is floating flattens out. In this example, neither the \\(1\\sigma\\) or \\(2\\sigma\\) intervals are affected. But this is not always the case! Ultimately, this method allows us to account for the uncertainty in the choice of background function in the signal strength measurement.

Coming back to the bias studies. Do you now understand what you are testing if you do not freeze the index in the fitting stage? In this case you are fitting the toys back with the discrete profiling method. This is the standard approach for the bias studies when we use the discrete-profiling method in an analysis.

There are a number of options which can be added to the combine command to improve the performance when using discrete nuisance parameters. These are detailed at the end of this section in the combine manual.

"},{"location":"tutorial2023/parametric_exercise/#part-6-multi-signal-model","title":"Part 6: Multi-signal model","text":"

In reality, there are multiple Higgs boson processes which contribute to the total signal model, not only ggH. This section will explain how we can add an additional signal process (VBF) into the fit. Following this, we will add a second analysis category (Tag1), which has a higher purity of VBF events. To put this in context, the selection for Tag1 may require two jets with a large pseudorapidity separation and high invariant mass, which are typical properties of the VBF topology. By including this additional category with a different relative yield of VBF to ggH production, we are able to simultaneously constrain the rate of the two production modes.

In the SM, the VBF process has a cross section which is roughly 10 times smaller than the ggH cross section. This explains why we need to use certain features of the event to boost the purity of VBF events. The LO Feynman diagram for VBF production is shown below.

"},{"location":"tutorial2023/parametric_exercise/#building-the-models","title":"Building the models","text":"

Firstly, lets build the necessary inputs for this section using construct_models_part6.py. This script uses everything we have learnt in the previous sections: 1) Signal models (Gaussians) are built separately for each process (ggH and VBF) in each analysis category (Tag0 and Tag1). This uses separate TTrees for each contribution in the mc_part6.root file. The mean and width of the Gaussians include the effect of the parametric shape uncertainties, nuisance_scale and nuisance_smear. Each signal model is normalised according to the following equation, where \\(\\epsilon_{ij}\\) labels the fraction of process, \\(i\\) (=ggH,VBF), landing in analysis category, \\(j\\) (=Tag0,Tag1), and \\(\\mathcal{L}\\) is the integrated luminosity (defined in the datacard).

\\[ N_{ij} = \\sigma_i \\cdot \\mathcal{B}^{\\gamma\\gamma} \\cdot \\epsilon_{ij} \\cdot \\mathcal{L}\\]

2) A background model is constructed for each analysis category by fitting the mass sidebands in data. The input data is stored in the data_part6.root file. The models are RooMultiPdfs which contain an exponential, a 4th-order Chebychev polynomial and a power law function. The shape parameters and normalisation terms of the background models are freely floating in the final fit.

  • Have a look through the construct_models_part6.py script and try to understand all parts of the model construction. When you are happy, go ahead and construct the models with:
python3 construct_models_part6.py\n

The datacards for the two analysis categories are saved separately as datacard_part6_Tag0.txt and datacard_part6_Tag1.txt.

  • Do you understand the changes made to include multiple signal processes in the datacard? What value in the process line is used to label VBF as a signal?
  • Try compiling the individual datacards. What are the prefit ggH and VBF yields in each analysis category? You can find these by opening the workspace and printing the contents.
  • Run the best fits and plot the prefit and postfit S+B models along with the data (see code in part 2). How does the absolute number of data events in Tag1 compare to Tag0? What about the signal-to-background ratio, S/B?

In order to combine the two categories into a single datacard, we make use of the combineCards.py script:

combineCards.py datacard_part6_Tag0.txt datacard_part6_Tag1.txt > datacard_part6_combined.txt\n
"},{"location":"tutorial2023/parametric_exercise/#running-the-fits","title":"Running the fits","text":"

If we use the default text2workspace command on the combined datacard, then this will introduce a single signal strength modifer which modifies the rate of all signal processes (ggH and VBF) by the same factor.

  • Try compiling the combined datacard and running a likelihood scan. Does the sensitivity to the global signal strength improve by adding the additional analysis category \"Tag1\"?

If we want to measure the independent rates of both processes simultaneously, then we need to introduce a separate signal strength for ggH and VBF. To do this we use the multiSignalModel physics model in combine by adding the following options to the text2workspace command:

text2workspace.py datacard_part6_combined.txt -m 125 -P HiggsAnalysis.CombinedLimit.PhysicsModel:multiSignalModel --PO \"map=.*/ggH:r_ggH[1,0,2]\" --PO \"map=.*/VBF:r_VBF[1,0,3]\" -o datacard_part6_combined_multiSignalModel.root\n

The syntax for the parameter to process mapping is map=category/process/POI[default,min,max]. We have used the wildcard .* to tell combine that the POI (parameter of interest) should scale all cases of that process, regardless of the analysis category. The output of this command tells us what is scaled by the two signal strengths:

Will scale  ch1/ggH  by  r_ggH\nWill scale  ch1/VBF  by  r_VBF\nWill scale  ch1/bkg_mass  by  1\nWill scale  ch2/ggH  by  r_ggH\nWill scale  ch2/VBF  by  r_VBF\nWill scale  ch2/bkg_mass  by  1\nWill scale  ch1/ggH  by  r_ggH\nWill scale  ch1/VBF  by  r_VBF\nWill scale  ch1/bkg_mass  by  1\nWill scale  ch2/ggH  by  r_ggH\nWill scale  ch2/VBF  by  r_VBF\nWill scale  ch2/bkg_mass  by  1\n

Exactly what we require!

To run a 1D \"profiled\" likelihood scan for ggH we use the following command:

combine -M MultiDimFit datacard_part6_combined_multiSignalModel.root -m 125 --freezeParameters MH -n .scan.part6_multiSignalModel_ggH --algo grid --points 20 --cminDefaultMinimizerStrategy 0 --saveInactivePOI 1 -P r_ggH --floatOtherPOIs 1\n
  • \"Profiled\" here means we are profiling over the other parameter of interest, r_VBF in the fit. In other words, we are treating r_VBF as an additional nuisance parameter. The option --saveInactivePOI 1 stores the value of r_VBF in the combine output. Take a look at the fit output. Does the value of r_VBF depend on r_ggH? Are the two parameters of interest correlated? Remember, to look at the contents of the TTree you can use limit->Show(i), where i is an integer labelling the point in the likelihood scan.
  • Run the profiled scan for the VBF signal strength. Plot the r_ggH and r_VBF likelihood scans using the plot1DScan.py script. You will need to change some of the input options, in particular the --POI option. You can list the full set of options by running:
plot1DScan.py --help\n
"},{"location":"tutorial2023/parametric_exercise/#two-dimensional-likelihood-scan","title":"Two-dimensional likelihood scan","text":"

We can also run the fit at fixed points in (r_ggH,r_VBF) space. By using a sufficient number of points, we are able to up the 2D likelihood surface. Let's change the ranges of the parameters of interest to match what we have found in the profiled scans:

combine -M MultiDimFit datacard_part6_combined_multiSignalModel.root -m 125 --freezeParameters MH -n .scan2D.part6_multiSignalModel --algo grid --points 800 --cminDefaultMinimizerStrategy 0 -P r_ggH -P r_VBF --setParameterRanges r_ggH=0.5,2.5:r_VBF=-1,2\n

To plot the output you can use the plot_2D_scan.py script:

python3 plot_2D_scan.py\n

This script interpolates the 2NLL value between the points ran in the scan so that the plot shows a smooth likelihood surface. You may find in some cases, the number of scanned points and interpolation parameters need to be tuned to get a sensible looking surface. This basically depends on how complicated the likelihood surface is.

  • The plot shows that the data is in agreement with the SM within the \\(2\\sigma\\) CL. Here, the \\(1\\sigma\\) and \\(2\\sigma\\) confidence interval contours corresponds to 2NLL values of 2.3 and 5.99, respectively. Do you understand why this? Think about Wilk's theorem.
  • Does the plot show any correlation between the ggH and VBF signal strengths? Are the two positively or negatively correlated? Does this make sense for this pair of parameters given the analysis setup? Try repeating the 2D likelihood scan using the \"Tag0\" only datacard. How does the correlation behaviour change?
  • How can we read off the \"profiled\" 1D likelihood scan constraints from this plot?
"},{"location":"tutorial2023/parametric_exercise/#correlations-between-parameters","title":"Correlations between parameters","text":"

For template-based analyses we can use the FitDiagnostics method in combine to extract the covariance matrix for the fit parameters. Unfortunately, this method is not compatible when using discrete nuisance parameters (RooMultiPdf). Instead, we can use the robustHesse method to find the Hessian matrix by finite difference methods. The matrix is then inverted to get the covariance. Subsequently, we can use the covariance to extract the correlations between fit parameters.

combine -M MultiDimFit datacard_part6_combined_multiSignalModel.root -m 125 --freezeParameters MH -n .robustHesse.part6_multiSignalModel --cminDefaultMinimizerStrategy 0 -P r_ggH -P r_VBF --setParameterRanges r_ggH=0.5,2.5:r_VBF=-1,2 --robustHesse 1 --robustHesseSave 1 --saveFitResult\n

The output file robustHesse.robustHesse.part6_multiSignalModel.root stores the correlation matrix (h_correlation). This contains the correlations between all parameters including the nuisances. So if we are interested in the correlation between r_ggH and r_VBF, we first need to find which bin corresponds to these parameters:

root robustHesse.robustHesse.part6_multiSignalModel.root\n\nroot [1] h_correlation->GetXaxis()->GetBinLabel(19)\n(const char *) \"r_VBF\"\nroot [2] h_correlation->GetYaxis()->GetBinLabel(20)\n(const char *) \"r_ggH\"\nroot [3] h_correlation->GetBinContent(19,20)\n(double) -0.19822058\n
  • The two parameters of interest have a correlation coefficient of -0.198. This means the two parameters are somewhat anti-correlated. Does this match what we see in the 2D likelihood scan?
"},{"location":"tutorial2023/parametric_exercise/#impacts_1","title":"Impacts","text":"

We extract the impacts for each parameter of interest using the following commands:

combineTool.py -M Impacts -d datacard_part6_combined_multiSignalModel.root -m 125 --freezeParameters MH -n .impacts_part6_multiSignal --robustFit 1 --cminDefaultMinimizerStrategy 0 -P r_ggH -P r_VBF --doInitialFit\n\ncombineTool.py -M Impacts -d datacard_part6_combined_multiSignalModel.root -m 125 --freezeParameters MH -n .impacts_part6_multiSignal --robustFit 1 --cminDefaultMinimizerStrategy 0 -P r_ggH -P r_VBF --doFits\n\ncombineTool.py -M Impacts -d datacard_part6_combined_multiSignalModel.root -m 125 --freezeParameters MH -n .impacts_part6_multiSignal --robustFit 1 --cminDefaultMinimizerStrategy 0 -P r_ggH -P r_VBF -o impacts_part6.json\n\nplotImpacts.py -i impacts_part6.json -o impacts_part6_r_ggH --POI r_ggH\nplotImpacts.py -i impacts_part6.json -o impacts_part6_r_VBF --POI r_VBF\n
  • Look at the output PDF files. How does the ranking of the nuisance parameters change for the different signal strengths?
"},{"location":"tutorial2023/parametric_exercise/#advanced-exercises-to-be-added","title":"Advanced exercises (to be added)","text":"

The combine experts will include additional exercises here in due course. These will include:

  • Convolution of model pdfs: RooAddPdf
  • Application of the spurious signal method
  • Advanced physics models including parametrised signal strengths e.g. SMEFT
  • Mass fits
  • Two-dimensional parametric models
"},{"location":"tutorial2023_unfolding/unfolding_exercise/","title":"Likelihood Based Unfolding Exercise in Combine","text":""},{"location":"tutorial2023_unfolding/unfolding_exercise/#getting-started","title":"Getting started","text":"

To get started, you should have a working setup of Combine and CombineHarvester. This setup can be done following any of the installation instructions.

After setting up CMSSW, you can access the working directory for this tutorial which contains all of the inputs and scripts needed to run the unfolding fitting exercise:

cd $CMSSW_BASE/src/HiggsAnalysis/CombinedLimit/data/tutorials/tutorial_unfolding_2023/\n
"},{"location":"tutorial2023_unfolding/unfolding_exercise/#exercise-outline","title":"Exercise outline","text":"

The hands-on exercise is split into seven parts:

1) \"Simple\" Unfolding Experiment

2) Producing the Migration matrix from the datacards

3) Advanced Unfolding with more detector-level information and control regions

4) Extracting the expected intervals

5) Producing Impacts for multiple POIs

6) Unfold to the generator-level quantities

7) Extracting POI correlations from the FitDiagnostics output

Throughout the tutorial there are a number of questions and exercises for you to complete. These are shown in the boxes like this one.

Note that some additional information on unfolding in Combine are available here, which also includes some information on regularization, which is not discussed in this tutorial.

"},{"location":"tutorial2023_unfolding/unfolding_exercise/#analysis-overview","title":"Analysis overview","text":"

In this tutorial we will look at the cross section measurements of on of the SM Higgs processes VH, in \\(H\\to b\\bar{b}\\) (VHbb) final state.

The measurement is performed within the Simplified Template Cross Section (STXS) framework, which provides the prediction in the bins of generator-level quantities \\(p_{T}(V)\\) and number of additional jets. The maximum likelihood based unfolding is performed to measure the cross section in the generator-level bins defined by STXS scheme. At the detector-level we define appropriate categories to match the STXS bins as closely as possible so that there is a good correspondence between the detector-level observable and the underlying generator-level quantity we are interested in.

Note that for this STXS measurement, as well as measuring the cross-section as a function of the \\(p_{T}\\) of the vector boson, the measurement includes some information on the number of additional jets and is performed over multiple different production modes, for different production processes. However, it is common to focus on a single distribution (e.g. \\(p_{T}\\)) for a signle process, (e.g. \\(t\\bar{t}\\)).

In this tutorial we will focus on the ZH production, with the Z boson decaying to charged leptons, and Higgs boson reconstructed with the resolved \\(b\\bar{b}\\) pair. We will also use only a part of the Run 2 categories, we will not achieve the same sensitivity as the full analysis. Note that ggZH and ZH production modes are combined in the fit, since it is not possible to resolve them at this stage of the analysis. The STXS categories are defined independently of the Higgs decay channel, to streamline the combinations of the cross section measurement.

In the first part of the tutorial, we will setup a relatively simple unfolding, where there is a single detector-level bin for every generator-level bin we are trying to measure. We will then perform a blind analysis using this setup to see the expected sensitivity.

In this simple version of the analysis, we use a series of datacards, one for each detector-level bin, implemented as a counting experiment. We then combine the datacards for the full measurement. It is also possible to implement the same analysis as a single datacard, passing a histogram with each of the detector-level bins. Either method can be used, depending on which is more practical for the analysis being considered.

In the second part of the tutorial we will perform the same measurement with a more advanced setup, making use of differential distributions per generator-level bin we are trying to measure, as well as control regions. By providing this additional information to the fit, we are able to achieve a better and more robust unfolding result. After checking the expected sensitivity, we will take a look at the impacts and pulls of the nuisance parameters. Then we will unblind and look at the results of the measurement, produce generator-level plots and provide the correlation matrix for our measured observables.

"},{"location":"tutorial2023_unfolding/unfolding_exercise/#simplified-unfolding","title":"Simplified unfolding","text":"

When determining the detector-level binning for any differential analysis the main goal is to chose a binning that distinguishes contributions from the various generator-level bins well. In the simplest case it can be done with the cut-based approach, i.e. applying the same binning for the detector-level observables as is being applied to the generator-level quantities being measured. In this case, that means binning in \\(p_{T}(Z)\\) and \\(n_{\\text{add. jets}}\\). Due to the good lepton \\(p_{T}\\) resolution we can follow the original STXS scheme quite closely with the detector-level selection, with one exception, it is not possible to access the very-low transverse momenta bin \\(p_{T}(Z)<75\\) GeV.

In counting/regions dicrectory you can find the datacards with five detector-level categories, each targetting a corresponding generator-level bin. Below you can find an example of the datacard for the detector-level bin with \\(p_{T}(Z)>400\\) GeV.

imax    1 number of bins\njmax    9 number of processes minus 1\nkmax    * number of nuisance parameters\n--------------------------------------------------------------------------------\n--------------------------------------------------------------------------------\nbin          vhbb_Zmm_gt400_13TeV\nobservation  12.0\n--------------------------------------------------------------------------------\nbin                                   vhbb_Zmm_gt400_13TeV   vhbb_Zmm_gt400_13TeV vhbb_Zmm_gt400_13TeV   vhbb_Zmm_gt400_13TeV     vhbb_Zmm_gt400_13TeV vhbb_Zmm_gt400_13TeV vhbb_Zmm_gt400_13TeV vhbb_Zmm_gt400_13TeV vhbb_Zmm_gt400_13TeV vhbb_Zmm_gt400_13TeV\nprocess                               ggZH_lep_PTV_GT400_hbb ZH_lep_PTV_GT400_hbb ZH_lep_PTV_250_400_hbb ggZH_lep_PTV_250_400_hbb Zj1b            Zj0b_c          Zj0b_udsg       VVLF            Zj2b            VVHF\nprocess                               -3                     -2                   -1                     0                        1               2               3               4               5               6\nrate                                  0.0907733              0.668303             0.026293               0.00434588               3.78735         2.58885         4.09457         0.413716        7.02731         0.642605\n--------------------------------------------------------------------------------\n\n

You can see the contributions from various background processes, namely Z+jets, \\(t\\bar{t}\\) and the single top, as well as the signal processes (ggZH and ZH) corresponding to the STXS scheme discussed above. Note that for each generator-level bin being measured, we assign a different process in combine. This is so that the signal strengths for each of their contributions can float independently in the measurement. Also note, that due to migrations, each detector-level bin will receive contributions from multiple generator-level bins.

One of the most important stages in the analysis design, is to make sure that the detector-level categories are well-chosen to target the corresponding generator-level processes.

To explicitly check the correspondance between detector- and generator-level, one can plot the contributions of each of the generator-level bins in all of the detector-level bins. You can use the script provided in the tutorial git-lab page. This script uses CombineHarvester to loop over detector-level bins, and get the rate at which each of the signal processes (generator-level bins) contributes to that detector-level bin; which is then used to plot the migration matrix.

python scripts/get_migration_matrix.py counting/combined_ratesOnly.txt\n\n

The migration matrix shows the generator-level bins on the x-axis and the corresponding detector-level bins on the y-axis. The entries are normalized such that the sum of all contributions for a given generator-level bin sum up to 1. With this convention, the numbers in each bin represent the probability that an event from a given generator-level bin is reconstructed in a given detector-level bin if it is reconstructed at all within the considered bins.

Now that we checked the response matrix we can attempt the maximum likelihood unfolding. We can use the multiSignalModel physics model available in Combine, which assigns a parameter of interest poi to a process p within a bin b using the syntax --PO 'map=b/p:poi[init, min, max]' to linearly scale the normalisation of this process under the parameter of interest (POI) variations. To create the workspace we can run the following command:

text2workspace.py -m 125  counting/combined_ratesOnly.txt -P HiggsAnalysis.CombinedLimit.PhysicsModel:multiSignalModel  --PO verbose --PO 'map=.*/.*ZH_lep_PTV_75_150_hbb:r_zh_75_150[1,-5,5]' --PO 'map=.*/.*ZH_lep_PTV_150_250_0J_hbb:r_zh_150_250noj[1,-5,5]'  --PO 'map=.*/.*ZH_lep_PTV_150_250_GE1J_hbb:r_zh_150_250wj[1,-5,5]' --PO 'map=.*/.*ZH_lep_PTV_250_400_hbb:r_zh_250_400[1,-5,5]' --PO 'map=.*/.*ZH_lep_PTV_GT400_hbb:r_zh_gt400[1,-5,5]' -o ws_counting.root\n

In the example given above a signal POI is assigned to each generator-level bin independent of detector-level bin. This allows the measurement to take into account migrations.

To extract the measurement let's run the initial fit first using the MultiDimFit method implemented in Combine to extract the best-fit values and uncertainties on all floating parameters:

combineTool.py -M MultiDimFit --datacard ws_counting.root --setParameters r_zh_250_400=1,r_zh_150_250noj=1,r_zh_75_150=1,r_zh_150_250wj=1,r_zh_gt400=1 --redefineSignalPOIs r_zh_75_150,r_zh_150_250noj,r_zh_150_250wj,r_zh_250_400,r_zh_gt400 -t -1 \n

With the option -t -1 we set Combine to fit the asimov dataset instead of actual data. The --setParameters <param>=<value> set the initial value of parameter named . --redefineSignalPOIs r_zh_75_150,r_zh_150_250noj,r_zh_150_250wj,r_zh_250_400,r_zh_gt400 set the POIs to the comma-separated list, instead of the default one r.

While the uncertainties on the parameters of interest (POIs) can be extracted in multiple ways, the most robust way is to run the likelihood scans for a POI corresponding to each generator-level bin, it allows you to spot discontinuities in the likelihood shape in case of problems with the fit or the model.

combineTool.py -M MultiDimFit --datacard ws_counting.root -t -1 --setParameters r_zh_250_400=1,r_zh_150_250noj=1,r_zh_75_150=1,r_zh_150_250wj=1,r_zh_gt400=1 --redefineSignalPOIs r_zh_75_150,r_zh_150_250noj,r_zh_150_250wj,r_zh_250_400,r_zh_gt400 --algo=grid --points=100 -P r_zh_75_150 --floatOtherPOIs=1 -n scan_r_zh_75_150\n\n

Now we can plot the likelihood scan and extract the expected intervals.

python scripts/plot1DScan.py higgsCombinescan_r_zh_75_150.MultiDimFit.mH120.root -o r_zh_75_150 --POI r_zh_75_150\n
  • Repeat for all POIs
"},{"location":"tutorial2023_unfolding/unfolding_exercise/#shape-analysis-with-control-regions","title":"Shape analysis with control regions","text":"

One of the advantages of the maximum likelihood unfolding is the flexibility to choose the analysis observable and include more information on the event kinematics, consequently improving the analysis sensitivity. This analysis benefits from the shape information of the DNN output trained to differentiate the VH(bb) signal from the SM backgrounds.

The datacards for this part of the exercise located full_model_datacards/, where you can find a separate datacard for each region within full_model_datacards/regions directory and also a combined datacard full_model_datacards/comb_full_model.txt. In this case, each of the detector-level bins being used in the unfolding above is now split into multiple bins according to the DNN output score. This provides extra discrimination power to separate the signal from background and improve the measurement.

As you will find, the datacards also contain several background processes. To control them properly we will also add regions enriched in the respective backgrounds. Then we can define a common set of rate parameters for signal and control regions to scale the rates or other parameters affecting their shape.

For the shape datacards one has to specify the mapping of histograms and channels/processes as given described below:

shapes [process] [channel] [file] [nominal] [systematics_templates]\n

Then the shape nuisance parameters can be defined in the systematics block in the datacard. More details can be found in Combine documentation pages.

In many CMS analyses there are hundreds of nuisance parameters corresponding to various source of systematics.

When we unfold to the generator-level quantities we should remove the nuisances affecting the rate of the generator-level bins, i.e. when measuring a given cross-section such as \\(\\sigma_{\\textrm{gen1}}\\), the nuisance parameters should not change the value of that parameter itself; they should only change the relationship between that parameter and the observations. This means that, for example, effects of renormalization and factorization scales on the generator-level cross section within each bin need to be removed. Only their effects on the detector-level distribution through changes of shape within each bin as well as acceptances and efficiencies should be considered.

For this analysis, that means removing the lnN nuisance parameters: THU_ZH_mig* and THU_ZH_inc; keeping only the acceptance shape uncertainties: THU_ZH_acc and THU_ggZH_acc, which do not scale the inclusive cross sections by construction. In this analysis the normalisation effects in the THU_ZH_acc and THU_ggZH_acc templates were already removed from the shape histograms. Removing the normalization effects can be achieved by removing them from the datacard. Alternatively, freezing the respective nuisance parameters with the option --freezeParameters par_name1,par_name2. Or you can create a group following the syntax given below at the end of the combined datacard, and freeze the parameters with the --freezeNuisanceGroups group_name option.

[group_name] group = uncertainty_1 uncertainty_2 ... uncertainty_N\n

Now we can create the workspace using the same multiSignalmodel as before:

text2workspace.py -m 125  full_model_datacards/comb_full_model.txt -P HiggsAnalysis.CombinedLimit.PhysicsModel:multiSignalModel  --PO verbose --PO 'map=.*/.*ZH_lep_PTV_75_150_hbb:r_zh_75_150[1,-5,5]' --PO 'map=.*/.*ZH_lep_PTV_150_250_0J_hbb:r_zh_150_250noj[1,-5,5]'  --PO 'map=.*/.*ZH_lep_PTV_150_250_GE1J_hbb:r_zh_150_250wj[1,-5,5]' --PO 'map=.*/.*ZH_lep_PTV_250_400_hbb:r_zh_250_400[1,-5,5]' --PO 'map=.*/.*ZH_lep_PTV_GT400_hbb:r_zh_gt400[1,-5,5]' --for-fits --no-wrappers --X-pack-asympows --optimize-simpdf-constraints=cms --use-histsum -o ws_full.root\n

As you might have noticed we are using a few extra versions --for-fits --no-wrappers --X-pack-asympows --optimize-simpdf-constraints=cms --use-histsum to create a workspace. They are needed to construct a more optimised pdf using the CMSHistSum class implemented in Combine to significantly lower the memory consumption.

  • Following the instructions given earlier, create the workspace and run the initial fit with -t -1.

Since this time the datacards include shape uncertainties as well as additional categories to improve the background description the fit might take much longer, but we can submit jobs to a batch system by using the combine tool and have results ready to look at in a few minutes.

combineTool.py -M MultiDimFit -d ws_full.root --setParameters r_zh_250_400=1,r_zh_150_250noj=1,r_zh_75_150=1,r_zh_150_250wj=1,r_zh_gt400=1 --redefineSignalPOIs r_zh_75_150,r_zh_150_250noj,r_zh_150_250wj,r_zh_250_400,r_zh_gt400  -t -1 --X-rtd FAST_VERTICAL_MORPH --algo=grid --points=50 --floatOtherPOIs=1 -n .scans_blinded --job-mode condor --task-name scans_zh  --split-points 1 --generate P:n::r_zh_gt400,r_zh_gt400:r_zh_250_400,r_zh_250_400:r_zh_150_250wj,r_zh_150_250wj:r_zh_150_250noj,r_zh_150_250noj:r_zh_75_150,r_zh_75_150\n

The option --X-rtd FAST_VERTICAL_MORPH is added here and for all combineTool.py -M MultiDimFit ... to speed up the minimisation.

The job submission is handled by the CombineHarvester, the combination of options --job-mode condor --task-name scans_zh --split-points 1 --generate P:n::r_zh_gt400,r_zh_gt400:r_zh_250_400,r_zh_250_400:r_zh_150_250wj,r_zh_150_250wj:r_zh_150_250noj,r_zh_150_250noj:r_zh_75_150,r_zh_75_150 will submit the jobs to HTCondor for each POI. The --generate option is is being used to automatically generate jobs attaching the options -P <POI> -n <name> with each of the pairs of values <POI>,<name> specified between the colons. You can add --dry-run option to create the submissions files first and check them, and then submit the jobs with condor_submit condor_scans_zh.sub.

If you are running the tutorial from a cluster where HTCondor is not available you can also submit the jobs to the slurm system, just change the --job-mode condor to --job-mode slurm.

After all jobs are completed we can combine the files for each POI:

for p in r_zh_75_150 r_zh_150_250noj r_zh_150_250wj r_zh_250_400 r_zh_gt400\ndo\n    hadd -k -f scan_${p}_blinded.root higgsCombine.scans_blinded.${p}.POINTS.*.MultiDimFit.mH120.root\ndone\n

And finally plot the likelihood scans

python scripts/plot1DScan.py scan_r_zh_75_150_blinded.root  -o scan_r_zh_75_150_blinded --POI r_zh_75_150 --json summary_zh_stxs_blinded.json\n

"},{"location":"tutorial2023_unfolding/unfolding_exercise/#impacts","title":"Impacts","text":"

One of the important tests before we move to the unblinding stage is to check the impacts of nuisance parameters on each POI. For this we can run the combineTool.py with -M Impacts method. We start with the initial fit, which should take about 20 minutes (good time to have a coffee break!):

combineTool.py -M Impacts -d ws_full.root -m 125 --robustFit 1 --doInitialFit --redefineSignalPOIs r_zh_75_150,r_zh_150_250noj,r_zh_150_250wj,r_zh_250_400,r_zh_gt400 --X-rtd FAST_VERTICAL_MORPH\n

Note that it is important to add the option --redefineSignalPOIs [list of parameters], to produce the impacts for all POIs we defined when the workspace was created with the multiSignalModel.

After the initial fit is completed we can perform the likelihood scans for each nuisance parameter. We will submit the jobs to the HTCondor to speed up the process.

combineTool.py -M Impacts -d ws_full.root -m 125 --robustFit 1 --doFits --redefineSignalPOIs r_zh_75_150,r_zh_150_250noj,r_zh_150_250wj,r_zh_250_400,r_zh_gt400 --job-mode condor --task-name impacts_zh --X-rtd FAST_VERTICAL_MORPH \n

Now we can combine the results into the .json format and use it to produce the impact plots.

combineTool.py -M Impacts -d ws_full.root -m 125 --redefineSignalPOIs r_zh_75_150,r_zh_150_250noj,r_zh_150_250wj,r_zh_250_400,r_zh_gt400 --output impacts.json \n\nplotImpacts.py -i impacts.json -o impacts_r_zh_75_150 --POI r_zh_75_150\n

* Do you observe differences in impacts plots for different POIs, do these differences make sense to you?

"},{"location":"tutorial2023_unfolding/unfolding_exercise/#unfolded-measurements","title":"Unfolded measurements","text":"

Now that we studied the nuisance parameter impacts for each POI, we can finally perform the measurement. Note that for the purposes of the tutorial, we are skipping further checks and validation that you should do on your analysis. Namely the goodness of fit test and the post-fit plots of folded observables. Both of these checks were detailed in the previous exercises, which you can find under the following link.

At this stage we'll run the MultiDimFit again scanning each POI to calculate the intervals, but this time we'll remove the -t -1 option to extract the unblinded results.

Also since we want to unfold the measurements to the generator-level observables, i.e. extract the cross sections, we remove the theoretical uncertainties affecting the rates of signal processes, we can do this be freezing them --freezeNuisanceGroups <group_name>, using the group_name you assigned earlier in the tutorial.

Now plot the scans and collect the measurements in the json file summary_zh_stxs.json.

python scripts/plot1DScan.py scan_r_zh_75_150.root -o r_zh_75_150 --POI r_zh_75_150 --json summary_zh_stxs.json  \n

Repeat the same command for other POIs to fill the summary_zh_stxs.json, which can then be used to make the cross section plot by multiplying the standard model cross sections by the signal strengths' best-fit values as shown below.

python scripts/make_XSplot.py summary_zh_stxs.json\n

"},{"location":"tutorial2023_unfolding/unfolding_exercise/#poi-correlations","title":"POI correlations","text":"

In addition to the cross-section measurements it is very important to publish covariance or correlation information of the measured cross sections. This allows the measurement to be properly intepreted or reused in combined fits.

The correlation matrix or covariance matrix can be extracted from the results after the fit. Here we can use the FitDiagnostics or MultiDimFit method.

combineTool.py -M FitDiagnostics --datacard ws_full.root --setParameters r_zh_250_400=1,r_zh_150_250noj=1,r_zh_75_150=1,r_zh_150_250wj=1,r_zh_gt400=1 --redefineSignalPOIs r_zh_75_150,r_zh_150_250noj,r_zh_150_250wj,r_zh_250_400,r_zh_gt400  --robustHesse 1 -n .full_model --X-rtd FAST_VERTICAL_MORPH\n

Then the RooFitResult, containing correlations matrix, can be found in the fitDiagnostics.full_model.root file under the name fit_s. The script plotCorrelations_pois.py from the exercise git-lab repository can help to plot the correlation matrix.

python scripts/plotCorrelations_pois.py -i fitDiagnostics.full_model.root:fit_s -p r_zh_75_150,r_zh_150_250noj,r_zh_150_250wj,r_zh_250_400,r_zh_gt400\n\n

"},{"location":"what_combine_does/fitting_concepts/","title":"Likelihood based fitting","text":"

\"Fitting\" simply means estimating some parameters of a model (or really a set of models) based on data. Likelihood-based fitting does this through the likelihood function.

In frequentist frameworks, this typically means doing maximum likelihood estimation. In bayesian frameworks, usually posterior distributions of the parameters are calculated from the likelihood.

"},{"location":"what_combine_does/fitting_concepts/#fitting-frameworks","title":"Fitting Frameworks","text":"

Likelihood fits typically either follow a frequentist framework of maximum likelihood estimation, or a bayesian framework of updating estimates to find posterior distributions given the data.

"},{"location":"what_combine_does/fitting_concepts/#maximum-likelihood-fits","title":"Maximum Likelihood fits","text":"

A maximum likelihood fit means finding the values of the model parameters \\((\\vec{\\mu}, \\vec{\\nu})\\) which maximize the likelihood, \\(\\mathcal{L}(\\vec{\\mu},\\vec{\\nu})\\) The values which maximize the likelihood, are the parameter estimates, denoted with a \"hat\" (\\(\\hat{}\\)):

\\[(\\vec{\\hat{\\mu}}, \\vec{\\hat{\\nu}}) \\equiv \\underset{\\vec{\\mu},\\vec{\\nu}}{\\operatorname{argmax}} \\mathcal{L}(\\vec{\\mu}, \\vec{\\nu})\\]

These values provide point estimates for the parameter values.

Because the likelihood is equal to the probability of observing the data given the model, the maximum likelihood estimate finds the parameter values for which the data is most probable.

"},{"location":"what_combine_does/fitting_concepts/#bayesian-posterior-calculation","title":"Bayesian Posterior Calculation","text":"

In a bayesian framework, the likelihood represents the probability of observing the data given the model and some prior probability distribution over the model parameters.

The prior probability of the parameters, \\(\\pi(\\vec{\\Phi})\\), are updated based on the data to provide a posterior distributions

\\[ p(\\vec{\\Phi};\\mathrm{data}) = \\frac{ p(\\mathrm{data};\\vec{\\Phi}) \\pi(\\vec{\\Phi}) }{\\int p(\\mathrm{data};\\vec{\\Phi}') \\pi(\\vec{\\Phi}') \\mathrm{d}\\vec{\\Phi}' } = \\frac{ \\mathcal{L}(\\vec{\\Phi}) \\pi(\\vec{\\Phi}) }{ \\int \\mathcal{L}(\\vec{\\Phi}') \\pi(\\vec{\\Phi}') \\mathrm{d}\\vec{\\Phi}' }\\]

The posterior distribution \\(p(\\vec{\\Phi};\\mathrm{data})\\) defines the updated belief about the parameters \\(\\vec{\\Phi}\\).

"},{"location":"what_combine_does/fitting_concepts/#methods-for-considering-subsets-of-models","title":"Methods for considering subsets of models","text":"

Often, one is interested in some particular aspect of a model. This may be for example information related to the parameters of interest, but not the nuisance parameters. In this case, one needs a method for specifying precisely what is meant by a model considering only those parameters of interest.

There are several methods for considering sub models which each have their own interpretations and use cases.

"},{"location":"what_combine_does/fitting_concepts/#conditioning","title":"Conditioning","text":"

Conditional Sub-models can be made by simply restricting the values of some parameters. The conditional likelihood of the parameters \\(\\vec{\\mu}\\) conditioned on particular values of the parameters \\(\\vec{\\nu}\\) is:

\\[ \\mathcal{L}(\\vec{\\mu},\\vec{\\nu}) \\xrightarrow{\\mathrm{conditioned\\ on\\ } \\vec{\\nu} = \\vec{\\nu}_0} \\mathcal{L}(\\vec{\\mu}) = \\mathcal{L}(\\vec{\\mu},\\vec{\\nu}_0) \\]"},{"location":"what_combine_does/fitting_concepts/#profiling","title":"Profiling","text":"

The profiled likelihood \\(\\mathcal{L}(\\vec{\\mu})\\) is defined from the full likelihood, \\(\\mathcal{L}(\\vec{\\mu},\\vec{\\nu})\\), such that for every point \\(\\vec{\\mu}\\) it is equal to the full likelihood at \\(\\vec{\\mu}\\) maximized over \\(\\vec{\\nu}\\).

\\[ \\mathcal{L}(\\vec{\\mu},\\vec{\\nu}) \\xrightarrow{\\mathrm{profiling\\ } \\vec{\\nu}} \\mathcal{L}({\\vec{\\mu}}) = \\max_{\\vec{\\nu}} \\mathcal{L}(\\vec{\\mu},\\vec{\\nu})\\]

In some sense, the profiled likelihood is the best estimate of the likelihood at every point \\(\\vec{\\mu}\\), it is sometimes also denoted with a double hat notation \\(\\mathcal{L}(\\vec{\\mu},\\vec{\\hat{\\hat{\\nu}}}(\\vec{\\mu}))\\).

"},{"location":"what_combine_does/fitting_concepts/#marginalization","title":"Marginalization","text":"

Marginalization is a procedure for producing a probability distribution \\(p(\\vec{\\mu};\\mathrm{data})\\) for a set of parameters \\(\\vec{\\mu}\\), which are only a subset of the parameters in the full distribution \\(p(\\vec{\\mu},\\vec{\\nu};\\mathrm{data})\\). The marginal probability density \\(p(\\vec{\\mu})\\) is defined such that for every point \\(\\vec{\\mu}\\) it is equal to the probability at \\(\\vec{\\mu}\\) integrated over \\(\\vec{\\nu}\\).

\\[ p(\\vec{\\mu},\\vec{\\nu}) \\xrightarrow{\\mathrm{marginalizing\\ } \\vec{\\nu}} p({\\vec{\\mu}}) = \\int_{\\vec{\\nu}} p(\\vec{\\mu},\\vec{\\nu})\\]

The marginalized probability \\(p(\\vec{\\mu})\\) is the probability for the parameter values \\(\\vec{\\mu}\\) taking into account all possible values of \\(\\vec{\\nu}\\).

Marginalized likelihoods can also be defined, by their relationship to the probability distributions.

"},{"location":"what_combine_does/fitting_concepts/#parameter-uncertainties","title":"Parameter Uncertainties","text":"

Parameter uncertainties describe regions of parameter values which are considered reasonable parameter values, rather than single estimates. These can be defined either in terms of frequentist confidence regions or bayesian credibility regions.

In both cases the region is defined by a confidence or credibility level \\(CL\\), which quantifies the meaning of the region. For frequentist confidence regions, the confidence level \\(CL\\) describes how often the confidence region will contain the true parameter values if the model is a sufficiently accurate approximation of the truth. For bayesian credibility regions, the credibility level \\(CL\\) describes the bayesian probability that the true parameter value is in that region for under the given model.

The confidence or credibility regions are described by a set of points \\(\\{ \\vec{\\mu} \\}_{\\mathrm{CL}}\\) which meet some criteria. In most situations of interest, the credibility region or confidence region for a single parameter, \\(\\mu\\), is effectively described by an interval:

\\[ \\{ \\mu \\}_{\\mathrm{CL}} = [ \\mu^{-}_{\\mathrm{CL}}, \\mu^{+}_{\\mathrm{CL}} ] \\]

Typically indicated as:

\\[ \\mu = X^{+\\mathrm{up}}_{-\\mathrm{down}} \\]

or, if symmetric intervals are used:

\\[ \\mu = X \\pm \\mathrm{unc.} \\]"},{"location":"what_combine_does/fitting_concepts/#frequentist-confidence-regions","title":"Frequentist Confidence Regions","text":"

Frequentist confidence regions are random variables of the observed data. These are very often the construction used to define the uncertainties reported on a parameter.

If the same experiment is repeated multiple times, different data will be osbserved each time and a different confidence set \\(\\{ \\vec{\\mu}\\}_{\\mathrm{CL}}^{i}\\) will be found for each experiment. If the data are generated by the model with some set of values \\(\\vec{\\mu}_{\\mathrm{gen}}\\), then the fraction of the regions \\(\\{ \\vec{\\mu}\\}_{\\mathrm{CL}}^{i}\\) which contain the values \\(\\vec{\\mu}_{\\mathrm{gen}}\\) will be equal to the confidence level \\({\\mathrm{CL}}\\). The fraction of intervals which contain the generating parameter value is referred to as the \"coverage\".

From first principles, the intervals can be constructed using the Neyman construction.

In practice, the likelihood can be used to construct confidence regions for a set of parameters \\(\\vec{\\mu}\\) by using the profile likelikhood ratio:

\\[ \\Lambda \\equiv \\frac{\\mathcal{L}(\\vec{\\mu},\\vec{\\hat{\\nu}}(\\vec{\\mu}))}{\\mathcal{L}(\\vec{\\hat{\\mu}},\\vec{\\hat{\\nu}})} \\]

i.e. the ratio of the profile likelihood at point \\(\\vec{\\mu}\\) to the maxmimum likelihood. For technical reasons, the negative logarithm of this quantity is typically used in practice.

Each point \\(\\vec{\\mu}\\) can be tested to see if it is in the confidence region, by checking the value of the likelihood ratio at that point and comparing it to the expected distribution if that point were the true generating value of the data.

\\[ \\{ \\vec{\\mu} \\}_{\\mathrm{CL}} = \\{ \\vec{\\mu} : -\\log(\\Lambda) \\lt \\gamma_{\\mathrm{CL}}(\\vec{\\mu}) \\} \\]

The cutoff value \\(\\gamma_{\\mathrm{CL}}\\) must be chosen to match this desired coverage of the confidence set.

Under some conditions, the value of \\(\\gamma_{\\mathrm{CL}}\\) is known analytically for any desired confidence level, and is independent of \\(\\vec{\\mu}\\), which greatly simplifies estimating confidence regions.

Constructing Frequentist Confidence Regions in Practice

When a single fit is performed by some numerical minimization program and parameter values are reported along with some uncertainty values, they are usually reported as frequentist intervals. The MINUIT minimizer which evaluates likelihood functions has two methods for estimating parameter uncertainties.

These two methods are the most commonly used methods for estimating confidence regions in a fit; they are the minos method, and the hessian method. In both cases, Wilk's theorem is assumed to hold at all points in parameter space, such that \\(\\gamma_{\\mathrm{CL}}\\) is independent of \\(\\vec{\\mu}\\).

When \\(\\gamma_{\\mathrm{CL}}\\) is independent of \\(\\vec{\\mu}\\) the problem simplifies to finding the boundaries where \\(-\\log(\\Lambda) = \\gamma_{\\mathrm{CL}}\\). This boundary point is referred to as the \"crossing\", i.e. where \\(-\\log(\\Lambda)\\) crosses the threshold value.

"},{"location":"what_combine_does/fitting_concepts/#the-minos-method-for-estimating-confidence-regions","title":"The Minos method for estimating confidence regions","text":"

In the minos method, once the best fit point \\(\\vec{\\hat{\\mu}}\\) is determined, the confidence region for any parameter \\(\\mu_i\\) can be found by moving away from its best fit value \\(\\hat{\\mu}_i\\). At each value of \\(\\mu_i\\), the other parameters are profiled, and \\(-\\log{\\Lambda}\\) is calculated.

Following this procedure, \\(\\mu_i\\) is searched for the boundary of the confidence regions, where \\(-\\log{\\Lambda} = \\gamma_{\\mathrm{CL}}\\).

The search is performed in both directions, away from the best fit value of the parameter and the two crossings are taken as the borders of the confidence region.

This procedure has to be followed sepately for each parameter \\(\\mu_i\\) for which a confidence interval is calculated.

"},{"location":"what_combine_does/fitting_concepts/#the-hessian-method-for-estimating-confidence-regions","title":"The Hessian method for estimating confidence regions","text":"

The Hessian method relies on the second derivatives (i.e. the hessian) of the likelihood at the best fit point.

By assuming that the shape of the likelihood function is well described by its second-order approximation, the values at which \\(-\\log(\\Lambda) = \\gamma_{\\mathrm{CL}}\\) can be calculated analytically without the need for a seach

\\[ \\mu_i^{\\mathrm{crossing}} - \\hat{\\mu} \\propto \\left(\\frac{\\partial^2{\\mathcal{L}(\\vec{\\hat{\\mu}})}}{\\partial\\mu_i^2}\\right)^{-2} \\]

By computing and then inverting the full hessian matrix, all individual confidence regions and the full covariance matrix are determined. By construction, this method always reports symmetric confidence intervals, as it assumes that the likelihood is well described by a second order expansion.

"},{"location":"what_combine_does/fitting_concepts/#bayesian-credibility-regions","title":"Bayesian Credibility Regions","text":"

Often the full posterior probability distribution is summarized in terms of some credible region which contains some specified portion of the posterior probability of the parameter.

\\[ \\{ \\vec{\\mu} \\}_{\\mathrm{CL}} = \\{ \\vec{\\mu} : \\vec{\\mu} \\in \\Omega, \\int_{\\Omega} p(\\vec{\\mu};\\mathrm{data}) = \\mathrm{CL} \\}\\]

The credible region represents a region in which the bayesian probability of the parameter being in that region is equal to the chosen Credibility Level.

"},{"location":"what_combine_does/introduction/","title":"Introduction And Capabilities","text":"

Combine is a tool for making statistical analyses based on a model of expected observations and a dataset. Example statistical analyses are claiming discovery of a new particle or process, setting limits on the existence of new physics, and measuring cross sections.

The package has no physics-specific knowledge, it is completely agnostic to the interpretation of the analysis being performed, but its usage and development is based around common cases in High Energy Physics. This documentation is a description of what combine does and how you can use it to run your analyses.

Roughly, combine does three things:

  1. Helps you to build a statistical model of expected observations;
  2. Runs statistical tests on the model and observed data;
  3. Provides tools for validating, inspecting, and understanding the model and the statistical tests.

Combine can be used for analyses in HEP ranging from simple counting experiments to unfolded measurements, new physics searches, combinations of measurements, and EFT fits.

"},{"location":"what_combine_does/introduction/#model-building","title":"Model Building","text":"

Combine provides a powerful, human-readable, and lightweight interface for building likelihood models for both binned and unbinned data. The likelihood definition allows the user to define many processes which contribute to the observation, as well as multiple channels which may be fit simultaneously.

Furthermore, combine provides a powerful and intuitive interface for combining models, as it was originally developped for combinations of higgs boson analysis at the CMS experiment.

The interface simplifies many common tasks, while providing many options for customizations. Common nuisance parameter types are defined for easy use, while user-defined functions can also be provided. Input histograms defining the model can be provide in root format, or in other tabular formats compatable with pandas.

Custom physics models can be defined in python which determine how the parameters of interest alter the model, and a number of predefined models are provided by default.

A number of tools are also provided for run-time alterations of the model, allowing for straightforward comparisons of alternative models.

"},{"location":"what_combine_does/introduction/#statistical-tests","title":"Statistical Tests","text":"

Combine can be used for statistical tests in frequentist or bayesian frameworks as well as perform some hybrid frequentist-bayesian analysis tasks.

Combine implements various methods for commonly used statistical tests in high energy physics, including for discovery, limit setting, and parameter estimation. Statistical tests can be customized to use various test statistics and confidence levels, as well as providing different output formats.

A number of asymptotic methods, relying on Wilks' theorem, and valid in appropriate conditions are implemented for fast evaluation. Generation of pseudo-data from the model can also be performed, and tests are implemented to automatically run over emprical distributions without relying on asymptotic approximations. Pseudo-data generation and fitting over the pseudo-data can be customized in a number of ways.

"},{"location":"what_combine_does/introduction/#validation-and-inspection","title":"Validation and Inspection","text":"

Combine provides tools for inspecting the model for things like potentially problematic input templates.

Various methods are provided for inspecting the likelihood function and the performance of the fits.

Methods are provided for comparing pre-fit and postfit results of all values including nuisance parameters, and summaries of the results can produced.

Plotting utilities allow the pre- and post-fit model expectations and their uncertainties to be plotted, as well as plotted summaries of debugging steps such as the nuisance parameter values and likelihood scans.

"},{"location":"what_combine_does/model_and_likelihood/","title":"Observation Models and Likelihoods","text":""},{"location":"what_combine_does/model_and_likelihood/#the-observation-model","title":"The Observation Model","text":"

The observation model, \\(\\mathcal{M}( \\vec{\\Phi})\\) defines the probability for any set of observations given specific values of the input parameters of the model \\(\\vec{\\Phi}\\). The probability for any observed data is denoted:

\\[ p_{\\mathcal{M}}(\\mathrm{data}; \\vec{\\Phi} ) \\]

where the subscript \\(\\mathcal{M}\\) is given here to remind us that these are the probabilities according to this particular model (though usually we will omit it for brevity).

Combine is designed for counting experiments, where the number of events with particular features are counted. The events can either be binned, as in histograms, or unbinned, where continuous values are stored for each event. The event counts are assumed to be of independent events, such as individual proton-proton collisions, which are not correlated with each other.

The event-count portion of the model consists of a sum over different processes. The expected observations, \\(\\vec{\\lambda}\\), are then the sum of the expected observations for each of the processes, \\(\\vec{\\lambda} =\\sum_{p} \\vec{\\lambda}_{p}\\).

The model can also be composed of multiple channels, in which case the expected observation is the set of all expected observations from the various channels \\(\\vec{\\lambda} = \\{ \\vec{\\lambda}_{c1}, \\vec{\\lambda}_{c2}, .... \\vec{\\lambda}_{cN}\\}\\).

The model can also include data and parameters related to non-count values, such as the observed luminosity or detector calibration constant. These non-count data are usually considered as auxiliary information which are used to constrain our expectations about the observed event counts.

The full model therefore defines the probability of any given observations over all the channels, given all the processes and model parameters.

Combining full models is possible by combining their channels, assuming that the channels are mutually independent.

A Simple Example

Consider performing an analysis searching for a Higgs boson by looking for events where the Higgs decays into two photons.

The event count data may be binned histograms of the number of events with two photons with different bins of invariant mass of the photons. The expected counts would include signal contributions from processes where a Higgs boson is produced, as well as background contributions from processes where two photons are produced through other mechanisms, like radiation off a quark. The expected counts may also depend on parameters such as the energy resolution of the measured photons and the total luminosity of collisions being considered in the dataset, these can be parameterized in the model as auxiliary information.

The analysis itself might be split into multiple channels, targetting different Higgs production modes with different event selection criteria. Furthermore, the analysis may eventually be combined with other analyses, such as a measurement targetting Higgs production where the Higgs boson decays into four leptons, rather than two photons.

Combine provides the functionality for building the statistical models and combining all the channels or analyses together into one common analysis.

"},{"location":"what_combine_does/model_and_likelihood/#sets-of-observation-models","title":"Sets of Observation Models","text":"

We are typically not interested in a single model, but in a set of models, parameterized by a set of real numbers representing possible versions of the model.

Model parameters include the parameters of interest ( \\(\\vec{\\mu}\\), those being measured such as a cross section) as well as nuisance parameters (\\(\\vec{\\nu}\\)), which may not be of interest but still affect the model expectation.

Combine provides tools and interfaces for defining the model as pre-defined or user-defined functions of the input parameters. In practice, however, there are a number of most commonly used functional forms which define how the expected events depend on the model parameters. These are discussed in detail in the context of the full likelihood below.

"},{"location":"what_combine_does/model_and_likelihood/#the-likelihood","title":"The Likelihood","text":"

For any given model, \\(\\mathcal{M}(\\vec{\\Phi})\\), the likelihood defines the probability of observing a given dataset. It is numerically equal to the probability of observing the data, given the model.

\\[ \\mathcal{L}_\\mathcal{M}(\\vec{\\Phi}) = p_{\\mathcal{M}}(\\mathrm{data};\\vec{\\Phi}) \\]

Note, however that the likelihood is a function of the model parameters, not the data, which is why we distinguish it from the probability itself.

The likelihood in combine takes the general form:

\\[ \\mathcal{L} = \\mathcal{L}_{\\textrm{primary}} \\cdot \\mathcal{L}_{\\textrm{auxiliary}} \\]

Where \\(\\mathcal{L}_{\\mathrm{auxiliary}}\\) is equal to the probability of observing the event count data for a given set of model parameters, and \\(\\mathcal{L}_{\\mathrm{auxiliary}}\\) represent some external constraints on the parameters. The constraint term may be constraints from previous measurements (such as Jet Energy Scales) or prior beliefs about the value some parameter in the model should have.

Both \\(\\mathcal{L}_{\\mathrm{primary}}\\) and \\(\\mathcal{L}_{\\mathrm{auxiliary}}\\) can be composed of many sublikelihoods, for example for observations of different bins and constraints on different nuisance parameters.

This form is entirely general. However, as with the model itself, there are typical forms that the likelihood takes which will cover most use cases, and for which combine is primarily designed.

"},{"location":"what_combine_does/model_and_likelihood/#primary-likelihoods-for-binned-data","title":"Primary Likelihoods for binned data","text":"

For a binned likelihood, the probability of observing a certain number of counts, given a model takes on a simple form. For each bin:

\\[ \\mathcal{L}_{\\mathrm{bin}}(\\vec{\\Phi}) = \\mathrm{Poiss}(n_{\\mathrm{obs}}; n_{\\mathrm{exp}}(\\vec{\\Phi})) \\]

i.e. it is a poisson distribution with the mean given by the expected number of events in that bin. The full primary likelihood for binned data is simply the product of each of the bins' likelihoods:

\\[ \\mathcal{L}_\\mathrm{primary} = \\prod_\\mathrm{bins} \\mathcal{L}_\\mathrm{bin}. \\]

This is the underlying likelihood model used for every binned analysis. The freedom in the analysis comes in how \\(n_\\mathrm{exp}\\) depends on the model parameters, and the constraints that are placed on those parameters.

"},{"location":"what_combine_does/model_and_likelihood/#primary-likelihoods-for-unbinned-data","title":"Primary Likelihoods for unbinned data","text":"

For unbinned likelihood models, a likelihood can be given to each data point. It is proportional to the probability density function at that point, \\(\\vec{x}\\). For the full set of observed data points, information about the total number of data points is also included:

\\[ \\mathcal{L}_\\mathrm{data} = \\mathrm{Poiss}(n_{\\mathrm{obs}} ; n_{\\mathrm{exp}}(\\vec{\\Phi})) \\prod_{i}^{N_{\\mathrm{obs}}} \\mathrm{pdf}(\\vec{x}_i ; \\vec{\\Phi} ) \\]

Where \\(n_{\\mathrm{obs}}\\) and \\(n_{\\mathrm{exp}}\\) are the total number of observed and expected events, respectively. This is sometimes referred to as an 'extended' likelihood, as the probability density has been 'extended' to include information about the total number of observations.

"},{"location":"what_combine_does/model_and_likelihood/#auxiliary-likelihoods","title":"Auxiliary Likelihoods","text":"

The auxiliary likelihood terms encode the probability of model nuisance parameters taking on a certain value, without regards to the primary data. In frequentist frameworks, this usually represents the result of a previous measurement (such as of the jet energy scale). We will write in a mostly frequentist framework, though combine can be used for either frequentist or bayesian analyses[^1].

[^1]: see: the first paragraphs of the PDGs statistics review for more information on these two frameworks

In this framework, each auxiliary term represents the likelihood of some parameter, \\(\\nu\\), given some previous observation \\(y\\); the quantity \\(y\\) is sometimes referred to as a \"global observable\".

\\[ \\mathcal{L}_{\\mathrm{auxiliary}}( \\nu ) = p( y ; \\nu ) \\]

In principle the form of the likelihood can be any function where the corresponding \\(p\\) is a valid probability distribution. In practice, most of the auxiliary terms are gaussian, and the definition of \\(\\nu\\) is chosen such that the central observation \\(y = 0\\) , and the width of the gaussian is one.

Note that on its own, the form of the auxiliary term is not meaningful; what is meaningful is the relationship between the auxiliary term and how the model expectation is altered by the parameter. Any co-ordinate transformation of the parameter values can be absorbed into the definition of the parameter. A reparameterization would change the mathematical form of the auxiliary term, but would also simultaneously change how the model depends on the parameter in such a way that the total likelihood is unchanged. e.g. if you define \\(\\nu = \\sigma(tt)\\) or \\(\\nu = \\sigma(tt) - \\sigma_0\\) you will change the form of the constraint term, but the you will not change the overall likelihood.

"},{"location":"what_combine_does/model_and_likelihood/#likelihoods-implemented-in-combine","title":"Likelihoods implemented in Combine","text":"

Combine builds on the generic forms of the likelihood for counting experiments given above to provide specific functional forms which are commonly most useful in high energy physics, such as separating contributions between different processes.

"},{"location":"what_combine_does/model_and_likelihood/#binned-likelihoods-using-templates","title":"Binned Likelihoods using Templates","text":"

Binned likelihood models can be defined by the user by providing simple inputs such as a set of histograms and systematic uncertainties. These likelihood models are referred to as template-based because they rely heavily on histograms as templates for building the full likelihood function.

Here, we describe the details of the mathematical form of these likelihoods. As already mentioned, the likelihood can be written as a product of two parts:

\\[ \\mathcal{L} = \\mathcal{L}_\\mathrm{primary} \\cdot \\mathcal{L}_\\mathrm{auxiliary} = \\prod_{c=1}^{N_c} \\prod_{b=1}^{N_b^c} \\mathrm{Poiss}(n_{cb}; n^\\mathrm{exp}_{cb}(\\vec{\\mu},\\vec{\\nu})) \\cdot \\prod_{e=1}^{N_E} p_e(y_e ; \\nu_e) \\]

Where \\(c\\) indexes the channel, \\(b\\) indexes the histogram bin, and \\(e\\) indexes the nuisance parameter.

"},{"location":"what_combine_does/model_and_likelihood/#model-of-expected-event-counts-per-bin","title":"Model of expected event counts per bin","text":"

The generic model of the expected event count in a given bin, \\(n^\\mathrm{exp}_{cb}\\), implemented in combine for template based analyses is given by:

\\[n^\\mathrm{exp}_{cb} = \\mathrm{max}(0, \\sum_{p} M_{cp}(\\vec{\\mu})N_{cp}(\\nu_G, \\vec{\\nu}_L,\\vec{\\nu}_S,\\vec{\\nu}_{\\rho})\\omega_{cbp}(\\vec{\\nu}_S) + E_{cb}(\\vec{\\nu}_B) ) \\]

where here:

  • \\(p\\) indexes the processes contributing to the channel;
  • \\(\\nu_{G}, \\vec{\\nu}_L, \\vec{\\nu}_S, \\vec{\\nu}_{\\rho}\\) and \\(\\vec{\\nu}_B\\) are different types of nuisance parameters which modify the processes with different functional forms;
    • \\(\\nu_{G}\\) is a gamma nuisances,
    • \\(\\vec{\\nu}_{L}\\) are log-normal nuisances,
    • \\(\\vec{\\nu}_{S}\\) are \"shape\" nuisances,
    • \\(\\vec{\\nu}_{\\rho}\\) are user defined rate parameters, and
    • \\(\\vec{\\nu}_{B}\\) are nuisance parameters related to the statistical uncertainties in the simulation used to build the model.
  • \\(M\\) defines the effect of the parameters of interest on the signal process;
  • \\(N\\) defines the overall normalization effect of the nuisance parameters;
  • \\(\\omega\\) defines the shape effects (i.e. bin-dependent effects) of the nuisance parameters; and
  • \\(E\\) defines the impact of statistical uncertainties from the samples used to derive the histogram templates used to build the model.
"},{"location":"what_combine_does/model_and_likelihood/#parameter-of-interest-model","title":"Parameter of Interest Model","text":"

The function \\(M\\) can take on custom functional forms, as defined by the user, but in the most common case, the parameter of interest \\(\\mu\\) simply scales the contributions from signal processes:

\\[\\label{eq:sig_param} M_{cp}(\\mu) = \\begin{cases} \\mu &\\mathrm{if\\ } p \\in \\mathrm{signal} \\\\ 1 &\\mathrm{otherwise} \\end{cases} \\]

However, combine supports many more models beyond this. As well as built-in support for models with multiple parameters of interest, combine comes with many pre-defined models which go beyond simple process normalization, which are targetted at various types of searches and measurements.

"},{"location":"what_combine_does/model_and_likelihood/#normalization-effects","title":"Normalization Effects","text":"

The overall normalization \\(N\\) is affected differently by the different types of nuisances parameters, and takes the general form

\\[N = \\prod_X \\prod_i f_X(\\vec{\\nu}_{X}^{i})\\mathrm{,}\\]

With \\(X\\) identifying a given nuisance parameter type; i.e. \\(N\\) multiplies together the morphings from each of the individual nuisance parameters from each of the nuisance types.

Normalization Parameterization Details

The full functional form of the normalization term is given by:

\\[ N_{cp} = N_{\\mathrm{0}}(\\nu_{G})\\prod_{n} {\\kappa_{n}}^{\\nu_{L,n}}\\prod_{a} {\\kappa^{\\mathrm{A}}_{a}(\\nu_{L(S)}^{a},\\kappa^{+}_{a}, \\kappa^{-}_{a})}^{\\nu_{L(S)}^{a}} \\prod_{r}F_{r}(\\nu_\\rho) \\]

where:

  • \\(N_{\\mathrm{0}}(\\nu_{G}) \\equiv \\frac{\\nu_{G}}{y_{G}}\\), is the normalization effect of a gamma uncertainty. \\(y_{G}\\) is taken as the observed number of events in some external control region and \\(\\nu_{G}\\) has a constraint pdf \\(\\mathrm{Poiss}(\\nu; y)\\)
  • \\(\\kappa_{n}^{\\nu_{L,n}}\\), are log-normal uncertainties specified by a fixed value \\(\\kappa\\);
  • \\(\\kappa^{\\mathrm{A}}_{a}(\\nu_{L(S)}^{a},\\kappa^{+}_{a}, \\kappa^{-}_{a})^{\\nu_{L(S)}^{a}}\\) are asymmetric log-normal uncertainties, in which the value of \\(\\kappa^{\\mathrm{A}}\\) depends on the nuisance parameter and two fixed values \\(\\kappa^{+}_{a}\\) and \\(\\kappa^{-}_{a}\\). The functions, \\(\\kappa^A\\), define a smooth interpolation for the asymmetric uncertainty; and
  • \\(F_{r}(\\vec{\\nu}_\\rho)\\) are user-defined functions of the user defined nuisance parameters which may have uniform or gaussian constraint terms.

The function for the asymmetric normalization modifier, \\(\\kappa^A\\) is

\\[ \\kappa^{\\mathrm{A}}(\\nu,\\kappa^{+}, \\kappa^{-}) = \\begin{cases} \\kappa^{+}, &\\mathrm{for\\,} \\nu \\geq 0.5 \\\\ \\frac{1}{\\kappa^{-}}, &\\mathrm{for\\,} \\nu \\leq -0.5 \\\\ \\exp\\left(\\frac{1}{2} \\left( (\\ln{\\kappa^{+}}-\\ln{\\kappa^{-}}) + \\frac{1}{4}(\\ln{\\kappa^{+}}+\\ln{\\kappa^{-}})I(\\nu)\\right)\\right), &\\mathrm{otherwise}\\end{cases} \\]

where \\(I(\\nu) = 48\\nu^5 - 40\\nu^3 + 15\\nu\\), which ensures \\(\\kappa^{\\mathrm{A}}\\) and its first and second derivatives are continuous for all values of \\(\\nu\\).

and the \\(\\kappa^{+}\\) and \\(\\kappa^{-}\\) are the relative normalizations of the two systematics variations; i.e.:

\\[ \\kappa^{\\pm}_{s} = \\frac{\\sum_{b}\\omega_{b}^{s,\\pm}}{\\sum_{b}\\omega_{b}^{0}}. \\]

where \\(\\omega_{b}^{s,\\pm}\\) is the bin yield as defined by the two shifted values \\(\\nu_{S} = \\nu_{S}^{\\pm}\\), and \\(\\omega_{b}^{0}\\) is the bin yield when \\(\\nu_{S} = \\omega_{S}\\).

"},{"location":"what_combine_does/model_and_likelihood/#shape-morphing-effects","title":"Shape Morphing Effects","text":"

The number of events in a given bin \\(b\\), \\(\\omega_{cbp}\\), is a function of the shape parameters \\(\\vec{\\nu}_{S}\\). The shape interpolation works with the fractional yields in each bin, where the interpolation can be performed either directly on the fractional yield, or on the logarithm of the fraction yield, which is then exponentiated again.

Shape parameterization Details

In the following, the channel and process labels \\(c\\) and \\(p\\) apply to every term, and so are omitted.

The fixed nominal number of events is denoted \\(\\omega_{b}^{0}\\). For each applicable shape uncertainty \\(s\\), two additional predictions are specified, \\(\\omega_{b}^{s,+}\\) and \\(\\omega_{b}^{s,-}\\), typically corresponding to the \\(+1\\sigma\\) and \\(-1\\sigma\\) variations, respectively. These may change both the shape and normalization of the process. The two effects are separated; the shape transformation is constructed in terms of the fractional event counts in the templates via a smooth vertical interpolation, and the normalization is treated as an asymmetric log-normal uncertainty, as described above in the description of the \\(N\\) term in the likelihood.

For a given process, the shape may be interpolated either directly in terms of the fractional bin yields, \\(f_b = \\omega_b / \\sum \\omega_{b}\\) or their logarithms, \\(\\ln(f_b)\\). The transformed yield is then given as, respectively,

\\[ \\omega_{b}(\\vec{\\nu}) = \\begin{cases} \\max\\left(0, y^{0}\\left(f^{0}_{b} + \\sum_{s} F(\\nu_{s}, \\delta^{s,+}_{b}, \\delta^{s,-}_{b}, \\epsilon_{s})\\right)\\right) & \\text{(direct),}\\\\ \\max\\left(0, y^{0}\\exp\\left(\\ln(f^{0}_{b}) + \\sum_{s} F(\\nu_{s}, \\Delta^{s,+}_{b}, \\Delta^{s,-}_{b}, \\epsilon_{s})\\right) \\right) & \\text{(logarithmic)}, \\end{cases} \\]

where \\(\\omega^{0} = \\sum \\omega_{b}^{0}\\), \\(\\delta^{\\pm} = f^{\\pm}_{i} - f^{0}_{i}\\), and \\(\\Delta^{\\pm} = \\ln\\left(\\frac{f^{\\pm}_{i}}{f^{0}_{i}}\\right)\\).

The smooth interpolating function \\(F\\), defined below, depends on a set of coefficients, \\(\\epsilon_{s}\\). These are assumed to be unity by default, but may be set to different values, for example if the \\(\\omega_{b}^{s,\\pm}\\) correspond to the \\(\\pm X\\sigma\\) variations, then \\(\\epsilon_{s} = 1/X\\) is typically set. The minimum value of \\(\\epsilon\\) over the shape uncertainties for a given process is \\(q = \\min({{\\epsilon_{s}}})\\). The function \\({F}\\) is then defined as

\\[ F(\\nu, \\delta^{+}, \\delta^{-}, \\epsilon) = \\begin{cases} \\frac{1}{2}\\nu^{'} \\left( (\\delta^{+}-\\delta^{-}) + \\frac{1}{8}(\\delta^{+}+\\delta^{-})(3\\bar{\\nu}^5 - 10\\bar{\\nu}^3 + 15\\bar{\\nu}) \\right), & \\text{for } -q < \\nu' < q; \\\\ \\nu^{'}\\delta^{+}, & \\text{for } \\nu' \\ge q;\\\\ -\\nu^{'}\\delta^{-}, & \\text{for } \\nu' \\le -q;\\\\ \\end{cases} \\]

where \\(\\nu^{'} = \\nu\\epsilon\\), \\(\\bar{\\nu} = \\nu^{'} / q\\), and the label \\(s\\) has been omitted. This function ensures the yield and its first and second derivatives are continuous for all values of \\(\\nu\\).

"},{"location":"what_combine_does/model_and_likelihood/#statistical-uncertainties-in-the-simulation-used-to-build-the-model","title":"Statistical Uncertainties in the Simulation used to build the Model","text":"

Since the histograms used in a binned shape analysis are typically created from simulated samples, the yields in each bin are also subject to statistical uncertainties on the bin yields. These are taken into account by either assigning one nuisance parameter per bin, or as many parameters as contributing processes per bin.

Model Statistical Uncertainty Details

If the uncertainty in each bin is modelled as a single nuisance parameter it takes the form:

\\[ E_{cb}(\\vec{\\mu},\\vec{\\nu},\\nu) = \\nu\\left(\\sum_{p} (e_{cpb}N_{cp}M_{cp}(\\vec{\\mu},\\vec{\\nu}))^{2}\\right)^{\\frac{1}{2}}. \\]

where \\(e_{cbp}\\) is the uncertainty in the bin content for the histogram defining process \\(p\\) in the channel \\(c\\).

Alternatively, one parameter is assigned per process, which may be modelled with either a Poisson or Gaussian constraint pdf:

\\[ E_{cb}(\\vec{\\mu},\\vec{\\nu},\\vec{\\nu}_{\\alpha},\\vec{\\nu}_{\\beta}) = \\sum_{\\alpha}^{\\text{Poisson}} \\left(\\frac{\\nu_{\\alpha}}{\\omega_{\\alpha}} - 1\\right)\\omega_{c\\alpha b}N_{c\\alpha}(\\vec{\\nu})M_{c\\alpha}(\\vec{\\mu},\\vec{nu}) + \\sum_{\\beta}^{\\text{Gaussian}} \\nu_{\\beta}e_{c\\beta b}N_{c\\beta}(\\vec{\\nu})M_{c\\beta}(\\vec{\\mu},\\vec{\\nu}), \\]

where the indices \\(\\alpha\\) and \\(\\beta\\) runs over the Poisson- and Gaussian-constrained processes, respectively. The parameters \\(\\omega_{\\alpha}\\) represent the nominal unweighted numbers of events, and are treated as the external measurements and \\(N_{cp}\\) and \\(\\omega_{c\\alpha b}\\) are defined as above.

"},{"location":"what_combine_does/model_and_likelihood/#customizing-the-form-of-the-expected-event-counts","title":"Customizing the form of the expected event counts","text":"

Although the above likelihood defines some specific functional forms, users are also able to implement custom functional forms for \\(M\\), \\(N\\), and \\(\\omega_{cbp}\\). In practice, this makes the functional form much more general than the default forms used above.

However, some constraints do exist, such as the requirement that bin contents be positive, and that the function \\(M\\) only depends on \\(\\vec{\\mu}\\), whereas \\(N\\), and \\(\\omega_{cbp}\\) only depend on \\(\\vec{\\nu}\\).

"},{"location":"what_combine_does/model_and_likelihood/#auxiliary-likelihood-terms","title":"Auxiliary Likelihood terms","text":"

The auxiliary constraint terms implemented in combine are Gaussian, Poisson or Uniform:

\\[ p_{e} \\propto \\exp{\\left(-0.5 \\left(\\frac{(\\nu_{e} - y_{e})}{\\sigma}\\right)^2 \\right)}\\mathrm{;~} \\\\ p_{e} = \\mathrm{Poiss}( \\nu_{e}; y_{e} ) \\mathrm{;\\ or~} \\\\ p_{e} \\propto \\mathrm{constant\\ (on\\ some\\ interval\\ [a,b])}. \\]

Which form they have depends on the type of nuisance paramater:

  • The shape (\\(\\vec{\\nu}_{S}\\)) and log-normal (\\(\\vec{\\nu}_{L}\\)), nuisance parameters always use gaussian constraint terms;
  • The gamma (\\(\\vec{\\nu}_{G}\\)) nuisance parameters always use Poisson constraints;
  • The rate parameters (\\(\\vec{\\nu}_{\\rho}\\)) may have either Gaussian or Uniform constraints; and
  • The model statistical uncertiainties (\\(\\vec{\\nu}_{B}\\)) may use Gaussian or Poisson Constraints.

While combine does not provide functionality for user-defined auxiliary pdfs, the effect of nuisance paramters is highly customizable through the form of the dependence of \\(n^\\mathrm{exp}_{cb}\\) on the parameter.

"},{"location":"what_combine_does/model_and_likelihood/#overview-of-the-template-based-likelihood-model-in-combine","title":"Overview of the template-based likelihood model in Combine","text":"

An overview of the binned likelihood model built by combine is given below. Note that \\(M_{cp}\\) can be chosen by the user from a set of predefined models, or defined by the user themselves.

"},{"location":"what_combine_does/model_and_likelihood/#parametric-likelihoods-in-combine","title":"Parametric Likelihoods in Combine","text":"

As with the template likelihood, the parameteric likelihood implemented in combine implements likelihoods for multiple process and multiple channels. Unlike the template likelihoods, the parametric likelihoods are defined using custom probability density functions, which are functions of continuous observables, rather than discrete, binned counts. Because the pdfs are functions of a continuous variable, the likelihood can be evaluated over unbinned data. They can still, also, be used for analysis on binned data.

The unbinned model implemented in combine is given by:

\\[ \\mathcal{L} = \\mathcal{L}_\\mathrm{primary} \\cdot \\mathcal{L}_\\mathrm{auxiliary} = \\\\ \\left(\\prod_c \\mathrm{Poiss}(n_{c,\\mathrm{tot}}^{\\mathrm{obs}} ; n_{c,\\mathrm{tot}}^{\\mathrm{exp}}(\\vec{\\mu},\\vec{\\nu})) \\prod_{i}^{n_c^{\\mathrm{obs}}} \\sum_p f_{cp}^{\\mathrm{exp}} \\mathrm{pdf}_{cp}(\\vec{x}_i ; \\vec{\\mu}, \\vec{\\nu} ) \\right) \\cdot \\prod_e p_e( y_e ; \\nu_e) \\]

where \\(c\\) indexes the channel, \\(p\\) indexes the process, and \\(e\\) indexes the nuisance parameter.

  • \\(n_{c,\\mathrm{tot}}\\) is the total number of expected events in channel \\(c\\);
  • \\(\\mathrm{pdf}_{cp}\\) are user defined probability density functions, which may take on the form of any valid probability density; and
  • \\(f_{cp}^{\\mathrm{exp}}\\) is the fraction of the total events in channel \\(c\\) from process \\(p\\), \\(f_{cp} = \\frac{n_{cp}}{\\sum_p n_{cp}}\\).

for parametric likelihoods on binned data, the data likelihood is first converted into the binned data likelihood format before evaluation. i.e.

\\[ \\mathcal{L} = \\prod_c \\prod_b \\mathrm{Poiss}(n_{cb}^{\\mathrm{obs}}; n_{cb}^{\\mathrm{exp}}) \\prod_e p_e( y_e ; \\nu_e) \\]

where \\(n^\\mathrm{exp}\\) is calculated from the input pdf and normalization, based on the model parameters.

"},{"location":"what_combine_does/model_and_likelihood/#model-of-expected-event-counts","title":"Model of expected event counts","text":"

The total number of expected events is modelled as:

\\[n_{c,\\mathrm{tot}}^\\mathrm{exp} = \\mathrm{max}(0, \\sum_{p} n^{cp}_0 M_{cp}(\\vec{\\mu})N_{cp}(\\nu_{G},\\vec{\\nu}_L,\\vec{\\nu}_{\\rho})) \\]

where, \\(n^{cp}_0\\) is a default normalization for the process; and as for the binned likelihoods \\(\\nu_G, \\vec{\\nu}_L\\), and \\(\\vec{\\nu}_{\\rho}\\) are different types of nuisance parameters which modify the processes normalizations with different functional forms, as in the binned case;

Details of Process Normalization

As in the template-based case, the different types of nuisance parameters affecting the process normalizations are:

  • \\(\\nu_{G}\\) is a gamma nuisance, with linear normalization effects and a poisson constraint term.
  • \\(\\vec{\\nu}_{L}\\) are log-normal nuisances, with log-normal normalization effects and gaussian constraint terms.
  • \\(\\vec{\\nu}_{\\rho}\\) are user defined rate parameters, with user-defined normalization effects and gaussian or uniform constraint terms.
  • \\(N\\) defines the overall normalization effect of the nuisance parameters;

and \\(N\\) is defined as in the template-based case, except that there are no \\(\\vec{\\nu}_S\\) uncertainties.

\\[ N_{cp} = N_{\\mathrm{0}}(\\nu_{G})\\prod_{n} {\\kappa_{n}}^{\\nu_{L,n}}\\prod_{a} {\\kappa^{\\mathrm{A}}_{a}(\\nu_{L}^{a},\\kappa^{+}_{a}, \\kappa^{-}_{a})}^{\\nu_{L}^{a}} \\prod_{r}F_{r}(\\nu_\\rho) \\]

The function \\(F_{r}\\) is any user-defined mathematical expression. The functions \\(\\kappa(\\nu,\\kappa^+,\\kappa^-)\\) are defined to create smooth asymmetric log-normal uncertainties. The details of the interpolations which are used are found in the section on normalization effects in the binned model.

"},{"location":"what_combine_does/model_and_likelihood/#parameter-of-interest-model_1","title":"Parameter of Interest Model","text":"

As in the template-based case, the parameter of interest model, \\(M_{cp}(\\vec{\\mu})\\), can take on different forms defined by the user. The default model is one where \\(\\vec{\\mu}\\) simply scales the signal processes' normalizations.

"},{"location":"what_combine_does/model_and_likelihood/#shape-morphing-effects_1","title":"Shape Morphing Effects","text":"

The user may define any number of nuisance parameters which morph the shape of the pdf according to functional forms defined by the user. These nuisance parameters are included as \\(\\vec{\\nu}_\\rho\\) uncertainties, which may have gaussian or uniform constraints, and include user-defined process normalization effects.

"},{"location":"what_combine_does/model_and_likelihood/#combining-template-based-and-parametric-likelihoods","title":"Combining template-based and parametric Likelihoods","text":"

While we presented the likelihoods for the template and parameteric models separately, they can also be combined into a single likelihood, by treating them each as separate channels. When combining the models, the data likelihoods of the binned and unbinned channels are multiplied.

\\[ \\mathcal{L}_{\\mathrm{combined}} = \\mathcal{L}_{\\mathrm{primary}} \\cdot \\mathcal{L}_\\mathrm{auxiliary} = \\left(\\prod_{c_\\mathrm{template}} \\mathcal{L}_{\\mathrm{primary}}^{c_\\mathrm{template}}\\right) \\left(\\prod_{c_\\mathrm{parametric}} \\mathcal{L}_{\\mathrm{primary}}^{c_\\mathrm{parametric}}\\right)\\cdot \\mathcal{L}_{\\mathrm{auxiliary}} \\]"},{"location":"what_combine_does/model_and_likelihood/#references-and-external-literature","title":"References and External Literature","text":"
  • See the Particle Data Group's Review of Statistics for various fundamental concepts used here.
  • The Particle Data Group's Review of Probability also has definitions of commonly used distributions, some of which are used here.
"},{"location":"what_combine_does/statistical_tests/","title":"Statistical Tests","text":"

Combine is a likelihood based statistical tool. That means that it uses the likelihood function to define statistical tests.

Combine provides a number of customization options for each test; as always it is up to the user to chose an appropriate test and options.

"},{"location":"what_combine_does/statistical_tests/#general-framework","title":"General Framework","text":""},{"location":"what_combine_does/statistical_tests/#statistical-tests_1","title":"Statistical tests","text":"

Combine implements a number of different customizable statistical tests. These tests can be used for purposes such as determining the significance of some new physics model over the standard model, setting limits, estimating parameters, and checking goodness of fit.

These tests are all performed on a given model (null hypothesis), and often require additional specification of an alternative model. The statistical test then typically requires defining some \"test statistic\", \\(t\\), which is simply any real-valued function of the observed data:

\\[ t(\\mathrm{data}) \\in \\mathbb{R} \\]

For example, in a simple coin-flipping experiment, the number of heads could be used as the test statistic.

The distribution of the test statistic should be estimated under the null hypothesis (and the alternative hypothesis, if applicable). Then the value of the test statistic on the actual observed data, \\(t^{\\mathrm{obs}}\\) is compared with its expected value under the relevant hypotheses.

This comparison, which depends on the test in question, defines the results of the test, which may be simple binary results (e.g. this model point is rejected at a given confidence level), or continuous (e.g. defining the degree to which the data are considered surprising, given the model). Often, as either a final result or as an intermediate step, the p-value of the observed test statistic under a given hypothesis is calculated.

How p-values are calculated

The distribution of the test statistic, \\(t\\) under some model hypothesis \\(\\mathcal{M}\\) is:

\\[t \\stackrel{\\mathcal{M}}{\\sim} D_{\\mathcal{M}}\\]

And the observed value of the test statistic is \\(t_{\\mathrm{obs}}\\). The p-value of the observed result gives the probability of having observed a test statistic at least as extreme as the actual observation. For example, this may be:

\\[p = \\int_{t_{\\mathrm{min}}}^{t_\\mathrm{obs}} D_{\\mathcal{M}} \\mathrm{d}t\\]

In some cases, the bounds of the integral may be modified, such as \\(( t_{\\mathrm{obs}}, t_{\\mathrm{max}} )\\) or \\((-t_{\\mathrm{obs}}, t_{\\mathrm{obs}} )\\), depending on the details of the test being performed. And specifically, for the distribution in question, whether an observed value in the right tail, left tail, or either tail of the distribution is considered as unexpected.

The p-values using the left-tail and right tail are related to each other via \\(p_{\\mathrm{left}} = 1 - p_{\\mathrm{right}}\\).

"},{"location":"what_combine_does/statistical_tests/#test-statistics","title":"Test Statistics","text":"

The test statistic can be any real valued function of the data. While in principle, many valid test statistics can be used, the choice of tests statistic is very important as it influences the power of the statistical test.

By associating a single real value with every observation, the test statistic allows us to recast the question \"how likely was this observation?\" in the form of a quantitative question about the value of the test statistic. Ideally a good test statistic should return different values for likely outcomes as compared to unlikely outcomes and the expected distributions under the null and alternate hypotheses should be well-separated.

In many situations, extremely useful test statistics, sometimes optimal ones for particular tasks, can be constructed from the likelihood function itself:

\\[ t(\\mathrm{data}) = f(\\mathcal{L}) \\]

Even for a given statistical test, several likelihood-based test-statistics may be suitable, and for some tests combine implements multiple test-statistics from which the user can choose.

"},{"location":"what_combine_does/statistical_tests/#tests-with-likelihood-ratio-test-statistics","title":"Tests with Likelihood Ratio Test Statistics","text":"

The likelihood function itself often forms a good basis for building test statistics.

Typically the absolute value of the likelihood itself is not very meaningful as it depends on many fixed aspects we are usually not interested in on their own, like the size of the parameter space and the number of observations. However, quantities such as the ratio of the likelihood at two different points in parameter space are very informative about the relative merits of those two models.

"},{"location":"what_combine_does/statistical_tests/#the-likelihood-ratio-and-likelihood-ratio-based-test-statistics","title":"The likelihood ratio and likelihood ratio based test statistics","text":"

A very useful test statistic is the likelihood ratio of two models:

\\[ \\Lambda \\equiv \\frac{\\mathcal{L}_{\\mathcal{M}}}{\\mathcal{L}_{\\mathcal{M}'}} \\]

For technical and convenience reasons, often the negative logarithm of the likelihood ratio is used:

\\[t \\propto -\\log(\\Lambda) = \\log(\\mathcal{L}_{\\mathcal{M}'}) - \\log(\\mathcal{L}_{\\mathcal{M}})\\]

With different proportionality constants being most convenient in different circumstances. The negative sign is used by convention since usually the ratios are constructed so that the larger likelihood value must be in the denominator. This way, \\(t\\) is positive, and larger values of \\(t\\) represent larger differences between the likelihoods of the two models.

"},{"location":"what_combine_does/statistical_tests/#sets-of-test-statistics","title":"Sets of test statistics","text":"

If the parameters of both likelihoods in the ratio are fixed to a single value, then that defines a single test statistic. Often, however, we are interested in testing \"sets\" of models, parameterized by some set of values \\((\\vec{\\mu}, \\vec{\\nu})\\).

This is important in limit setting for example, where we perform statistical tests to exclude entire ranges of the parameter space.

In these cases, the likelihood ratio (or a function of it) can be used to define a set of test statistics parameterized by the model parameters. For example, a very useful set of test statistics is:

$$ t_{\\vec{\\mu}} \\propto -\\log\\left(\\frac{\\mathcal{L}(\\vec{\\mu})}{\\mathcal{L}(\\vec{\\hat{\\mu}})}\\right) $$.

Where the likelihood parameters in the bottom are fixed to their maximum likelihood values, but the parameter \\(\\vec{\\mu}\\) indexing the test statistic appears in the numerator of the likelihood ratio.

When calculating the p-values for these statistical tests, the p-values are calculated at each point in parameter space using the test statistic for that point. In other words, the observed and expected distributions of the test statistics are computed separately at each parameter point \\(\\vec{\\mu}\\) being considered.

"},{"location":"what_combine_does/statistical_tests/#expected-distributions-of-likelihood-ratio-test-statistics","title":"Expected distributions of likelihood ratio test statistics","text":"

Under appropriate conditions, the distribution of \\(t_\\vec{\\mu}\\) can be approximated analytically, via Wilks' Theorem or other extensions of that work. Then, the p-value of the observed test statistic can be calculated from the known form of the expected distribution. This is also true for a number of the other test statistics derived from the likelihood ratio, where asymptotic approximations have been derived.

Combine provides asymptotic methods, for limit setting, significance tests, and computing confidence intervals which make used of these approximations for fast calculations.

In the general case, however, the distribution of the test statistic is not known, and it must be estimated. Typically it is estimated by generating many sets of pseudo-data from the model and using the emprirical distribution of the test statistic.

Combine also provides methods for limit setting, significance tests, and computing confidence intervals which use pseudodata generation to estimate the expected test-statistic distributions, and therefore don't depend on the asymptotic approximation. Methods are also provided for generating pseudodata without running a particular test, which can be saved and used for estimating expected distributions.

"},{"location":"what_combine_does/statistical_tests/#parameter-estimation-using-the-likelihood-ratio","title":"Parameter Estimation using the likelihood ratio","text":"

A common use case for likelihood ratios is estimating the values of some parameters, such as the parameters of interest, \\(\\vec{\\mu}\\). The point estimate for the parameters is simply the maximum likelihood estimate, but the likelihood ratio can be used for estimating the uncertainty as a confidence region.

A confidence region for the parameters \\(\\vec{\\mu}\\) can be defined by using an appropriate test statistic. Typically, we use the profile likelihood ratio:

\\[ t_{\\vec{\\mu}} \\propto -\\log\\left(\\frac{\\mathcal{L}(\\vec{\\mu},\\vec{\\hat{\\nu}}(\\vec{\\mu}))}{\\mathcal{L}(\\vec{\\hat{\\mu}},\\vec{\\hat{\\nu}})}\\right) \\]

Where the likelihood in the top is the value of the likelihood at a point \\(\\vec{\\mu}\\) profiled over \\(\\vec{\\nu}\\); and the likelihood on the bottom is at the best fit point.

Then the confidence region can be defined as the region where the p-value of the observed test-statistic is less than the confidence level:

\\[ \\{ \\vec{\\mu}_{\\mathrm{CL}} \\} = \\{ \\vec{\\mu} : p_{\\vec{\\mu}} \\lt \\mathrm{CL} \\}.\\]

This construction will satisfy the frequentist coverage property that the confidence region contains the parameter values used to generate the data in \\(\\mathrm{CL}\\) fraction of cases.

In many cases, Wilks' theorem can be used to calculate the p-value and the criteria on \\(p_{\\vec{\\mu}}\\) can be converted directly into a criterion on \\(t_{\\vec{\\mu}}\\) itself, \\(t_{\\vec{\\mu}} \\lt \\gamma_{\\mathrm{CL}}\\). Where \\(\\gamma_{\\mathrm{CL}}\\) is a known function of the confidence level which depends on the parameter space being considered.

"},{"location":"what_combine_does/statistical_tests/#discoveries-using-the-likelihood-ratio","title":"Discoveries using the likelihood ratio","text":"

A common method for claiming discovery is based on a likelihood ratio test by showing that the new physics model has a \"significantly\" larger likelihood than the standard model.

This could be done by using the standard profile likelihood ratio test statistic:

\\[ t_{\\mathrm{NP}} = -2\\log\\left(\\frac{\\mathcal{L}(\\mu_{\\mathrm{NP}} = 0, \\vec{\\hat{\\nu}}(\\mu_{\\mathrm{NP}} = 0))}{\\mathcal{L}(\\hat{\\mu}_{\\mathrm{NP}},\\vec{\\hat{\\nu}})}\\right) \\]

Where \\(\\mu_{\\mathrm{NP}}\\) represents the strength of some new physics quantity, such as the cross section for creation of a new particle. However, this would also allow for claiming \"discovery\" in cases where the best fit value is negative, i.e. \\(\\hat{\\mu} \\lt 0\\), which in particle physics is often an unphysical model, such as a negative cross section. In order to avoid such a situation, we typically use a modified test statistic:

\\[ q_{0} = \\begin{cases} 0 & \\hat{\\mu} \\lt 0 \\\\ -2\\log\\left(\\frac{\\mathcal{L}(\\mathrm{\\mu}_{\\mathrm{NP}} = 0)}{\\mathcal{L}(\\hat{\\mu}_{\\mathrm{NP}})}\\right) & \\hat{\\mu} \\geq 0 \\end{cases} \\]

which excludes the possibility of claiming discovery when the best fit value of \\(\\mu\\) is negative.

As with the likelihood ratio test statistic, \\(t\\), defined above, under suitable conditions, analytic expressions for the distribution of \\(q_0\\) are known.

Once the value \\(q_{0}(\\mathrm{data})\\) is calculated, it can be compared to the expected distribution of \\(q_{0}\\) under the standard model hypothesis to calculate the p-value. If the p-value is below some threshold, discovery is often claimed. In high-energy physics the standard threshold is \\(\\sim 5\\times10^{-7}\\).

"},{"location":"what_combine_does/statistical_tests/#limit-setting-using-the-likelihood-ratio","title":"Limit Setting using the likelihood ratio","text":"

Various test statistics built from likelihood ratios can be used for limit setting, i.e. excluding some parameter values.

One could set limits on a parameter \\(\\mu\\) by finding the values of \\(\\mu\\) that are outside the confidence regions defined above by using the likelihood ratio test statistic:

\\[ t_{\\mu} = -2\\log\\left(\\frac{\\mathcal{L}(\\mu)}{\\mathcal{L}(\\hat{\\mu})}\\right) \\]

However, this could \"exclude\" \\(\\mu = 0\\) or small values of \\(\\mu\\) at a typical limit setting confidence level, such as 95%, while still not claiming a discovery. This is considered undesirable, and often we only want to set upper limits on the value of \\(\\mu\\), rather than excluding any possible set of parameters outside our chosen confidence interval.

This can be done using a modified test statistic:

\\[ \\tilde{t}_{\\mu} = -2\\log\\left(\\frac{\\mathcal{L}(\\mu)}{\\mathcal{L}(\\min(\\mu,\\hat{\\mu}))}\\right) = \\begin{cases} -2\\log\\left(\\frac{\\mathcal{L}(\\mu)}{\\mathcal{L}(\\hat{\\mu})}\\right)& \\hat{\\mu} \\lt \\mu \\\\ 0 & \\mu \\leq \\hat{\\mu} \\end{cases} \\]

However, this can also have undesirable properties when the best fit value, \\(\\hat{\\mu}\\), is less than 0. In that case, we may set limits below 0. In order to avoid these situations, another modified test statistic can be used:

\\[ \\tilde{q}_{\\mu} = \\begin{cases} -2\\log\\left(\\frac{\\mathcal{L}(\\mu)}{\\mathcal{L}(\\mu = 0)}\\right)& \\hat{\\mu} \\lt 0 \\\\ -2\\log\\left(\\frac{\\mathcal{L}(\\mu)}{\\mathcal{L}(\\hat{\\mu})}\\right)& 0 \\lt \\hat{\\mu} \\lt \\mu \\\\ 0& \\mu \\lt \\hat{\\mu} \\end{cases} \\]

Which also has a known distribution under appropriate conditions, or can be estimated from pseudo-experiments. One can then set a limit at a given confidence level, \\(\\mathrm{CL}\\), by finding the value of \\(\\mu\\) for which \\(p_{\\mu} \\equiv p(t_{\\mu}(\\mathrm{data});\\mathcal{M}_{\\mu}) = 1 - \\mathrm{CL}\\). Larger values of \\(\\mu\\) will have smaller p-values and are considered excluded at the given confidence level.

However, this procedure is rarely used, in almost every case we use a modified test procedure which uses the \\(\\mathrm{CL}_{s}\\) criterion, explained below.

"},{"location":"what_combine_does/statistical_tests/#the-cls-criterion","title":"The CLs criterion","text":"

Regardless of which of these test statistics is used, the standard test-methodology has some undesirable properties for limit setting.

Even for an experiment with almost no sensitivity to new physics, 5% of the time the experiment is performed we expect the experimenter to find \\(p_{\\mu} \\lt 0.05\\) for small values of \\(\\mu\\) and set limits on parameter values to which the experiment is not sensitive!

In order to avoid such situations the \\(\\mathrm{CL}_{s}\\) criterion was developped, as explained in these two papers. Rather than requiring \\(p_{\\mu} \\lt (1-\\mathrm{CL})\\) to exclude \\(\\mu\\), as would be done in the general framework described above, the \\(\\mathrm{CL}_{s}\\) criterion requires:

\\[ \\frac{p_{\\mu}}{1-p_{b}} \\lt (1-\\mathrm{CL}) \\]

Where \\(p_{\\mu}\\) is the usual probability of observing the observed value of the test statistic under the signal + background model with signal strength \\(\\mu\\), and \\(p_{b}\\) is the p-value for the background-only hypothesis, with the p-value defined using the opposite tail from the definition of \\(p_{\\mu}\\).

Using the \\(\\mathrm{CL}_{s}\\) criterion fixes the issue of setting limits much stricter than the experimental sensitivity, because for values of \\(\\mu\\) to which the experiment is not sensitive the distribution of the test statistic under the signal hypothesis is nearly the same as under the background hypothesis. Therefore, given the use of opposite tails in the p-value definition, \\(p_{\\mu} \\approx 1-p_{b}\\), and the ratio approaches 1.

Note that this means that a limit set using the \\(\\mathrm{CL}_{s}\\) criterion at a given \\(\\mathrm{CL}\\) will exclude the true parameter value \\(\\mu\\) with a frequency less than the nominal rate of \\(1-\\mathrm{CL}\\). The actual frequency at which it is excluded depends on the sensitivity of the experiment to that parameter value.

"},{"location":"what_combine_does/statistical_tests/#goodness-of-fit-tests-using-the-likelihood-ratio","title":"Goodness of fit tests using the likelihood ratio","text":"

The likelihood ratio can also be used as a measure of goodness of fit, i.e. describing how well the data match the model for binned data.

A standard likelihood-based measure of the goodness of fit is determined by using the log likelihood ratio with the likelihood in the denominator coming from the saturated model.

\\[ t_{\\mathrm{saturated}} \\propto -\\log\\left(\\frac{\\mathcal{L}_{\\mathcal{M}}}{\\mathcal{L}_{\\mathcal{M}_\\mathrm{saturated}}}\\right) \\]

Here \\(\\mathcal{M}\\) is whatever model one is testing the goodness of fit for, and the saturated model is a model for which the prediction matches the observed value in every bin. Typically, the saturated model would be one in which there are as many free parameters as bins.

This ratio is then providing a comparison between how well the actual data are fit as compared to a hypothetical optimal fit.

Unfortunately, the distribution of \\(t_{\\mathcal{saturated}}\\) usually is not known a priori and has to be estimated by generating pseudodata from the model \\(\\mathcal{L}\\) and calculating the empirical distribution of the statistic.

Once the distribution is determined, a p-value for the statistic can be derived which indicates the probability of observing data with that quality of fit given the model, and therefore serves as a measure of the goodness of fit.

"},{"location":"what_combine_does/statistical_tests/#channel-compatibility-test-using-the-likelihood-ratio","title":"Channel Compatibility test using the likelihood ratio","text":"

When performing an anlysis across many different channels (for example, different Higgs decay modes), it is often interesting to check the level of compatibility of the various channels.

Combine implements a channel compatibility test, by considering the a model, \\(\\mathcal{M}_{\\mathrm{c-independent}}\\), in which the signal is independent in every channel. As a test statistic, this test uses the likelihood ratio between the best fit value of the nominal model and the model with independent signal strength for each channel:

\\[ t = -\\log\\left(\\frac{\\mathcal{L}_{\\mathcal{M}}(\\vec{\\hat{\\mu}},\\vec{\\hat{\\nu}})}{\\mathcal{L}_{\\mathcal{M}_{\\mathrm{c-indep}}}(\\vec{\\hat{\\mu}}_{c1}, \\vec{\\hat{\\mu}}_{c2}, ..., \\vec{\\hat{\\nu}})}\\right) \\]

The distribution of the test statistic is not known a priori, and needs to be calculated by generating pseudo-data samples.

"},{"location":"what_combine_does/statistical_tests/#other-statistical-tests","title":"Other Statistical Tests","text":"

While combine is a likelihood based statistical framework, it does not require that all statistical tests use the likelihood ratio.

"},{"location":"what_combine_does/statistical_tests/#other-goodness-of-fit-tests","title":"Other Goodness of Fit Tests","text":"

As well as the saturated goodness of fit test, defined above, combine implements Kolmogorov-Smirnov and Anderson-Darling goodness of fit tests.

For the Kolomogorov-Smirnov (KS) test, the test statistic is the maximum absolute difference between the cumulative distribution function between the data and the model:

\\[ D = \\max_{x} | F_{\\mathcal{M}}(x) - F_{\\mathrm{data}}(x) | \\]

Where \\(F(x)\\) is the Cumulative Distribution Function (i.e. cumulative sum) of the model or data at point \\(\\vec{x}\\).

For the Anderson-Darling (AD) test, the test statistic is based on the integral of the square of the difference between the two cumulative distribution functions. The square difference is modified by a weighting function which gives more importance to differences in the tails:

\\[ A^2 = \\int_{x_{\\mathrm{min}}}^{x_{\\mathrm{max}}} \\frac{ (F_{\\mathcal{M}}(x) - F_{\\mathrm{data}}(x))^2}{ F_\\mathcal{M}(x) (1 - F_{\\mathcal{M}}(x)) } \\mathrm{d}F_\\mathcal{M}(x) \\]

Notably, both the Anderson-Darling and Kolmogorov-Smirnov test rely on the cumulative distribution. Because the ordering of different channels of a model is not well defined, the tests themselves are not unambiguously defined over multiple channels.

"}]} \ No newline at end of file diff --git a/sitemap.xml.gz b/sitemap.xml.gz index d0f1820ec35..ee236b03733 100644 Binary files a/sitemap.xml.gz and b/sitemap.xml.gz differ