Domains
Data plays a crucial role in the process of training any ML model. It is something from which the model learns to solve a given task and therefore care needs to be taken with its handling. There are two main considerations when collecting and preparing data for an ML task:
- The data set should be relevant to the problem and should represent the underlying structure of the problem without containing potential biases and irrelevant deviations (e.g. MC simulation artefacts).
- A proper preprocessing of the data set should be performed so that the training step goes smoothly.
In this section a general domain perspective on data will be covered. In the following sections a more granular look will be taken from the side of features and construction of inputs to the model.
Coverage¶
To begin with, one needs to bear in mind that training data should be as close as possible to data they expect to have in the context of analysis. Speaking in more formal terms,
Domains of training (used to train the model) and inference (used to make final predictions) data sets should not sizeably diverge.
Examples
- In most of the cases the model is usually trained on MC simulated data and later on applied to data to produce predictions which are then passed on to statistical inference step. MC simulation isn't perfect and therefore there are always differences between simulation and data domains. This can lead to the cases when model learns simulation artefacts which come e.g. from detector response mismodelling. Thus, its performance on data may be at least suboptimal and at most meaningless.
- Consider the model which is trained to predict the energy of a hadron given its energy deposits in the calorimeter (represented e.g. in the form of image or graph). Data consists of the showers initiated by a particle generated by a particle gun and having discrete values of energies (e.g. 1 GeV, 10 GeV, 20 GeV, etc.). However, in the real world settings, the model will be applied to showers produced by particles with underlying continuous energy spectrum. Although ML models are known for their capability to interpolate beyond their training domain, without apropriate tests model performance in the parts of the energy spectrum outside of its training domain is not a priori clear.
Solution¶
It is particularly not easy to build a model entirely robust to domain shift, so there is no general framework yet to approach and recover for discrepancies between training and inference domains altogether. However, there is research ongoing in this direction and several methods to recover for specific deviations have been already proposed.
It is a widely known practice to introduce scale factor (SF) corrections to account for possible discrepancies between data and MC simulation. Effectively, that means that the model is probed on some part of the domain on which it wasn't trained on (data) and then corrected for any differences by using a meaningful set of observables to derive SFs. One particularly promising approaches to remedy for data/MC domain difference is to use adversarial approaches to fully leverage the multidimensionality of the problem, as described in a DeepSF note.
Another solution would be to incorporate methods of domain adaptation into an ML pipeline, which essentially guide the model to be invariant and robust towards domain shift. Particularly in HEP, a Learning to Pivot with Adversarial Networks paper was one of the pioneers to investigate how a pile-up dependency can be mitigated, which can also be easily expanded to building a model robust to domain shift1.
Last but not the least, a usage of Bayesian neural networks has a great advantage of getting uncertainties estimate along with each prediction. If these uncertainties are significantly larger for some samples, this could indicate that they come from the domain beyond the training one (a so-called out-of-distribution samples). This post hoc analysis of prediction uncertainties, for example, can point to inconsistencies in or incompleteness of MC simulation/ data-driven methods of the background estimation.
Population¶
Furthermore, nowadays analyses are searching for very rare processes and therefore are interested in low-populated regions of the phase space. And even though the domain of interest may be covered in the training data set, it may also not be sufficiently covered in terms of the number of samples in the training data set, which populate those regions. That makes the model behaviour on an event which falls into those regions unpredictable - because it couldn't learn how to generalise in those areas due to a lack of data to learn from. Therefore,
It is important to make sure that the phase space of interest is well-represented in the training data set.
Example
This is what is often called in HEP jargon "little statistics in the tails": meaning that too few events can be found in the tails of the corresponding distribution, e.g. in the high-pt region. This might be important because the topology of events changes when one enters high-pt areas of the phase space (aka boosted regime). This further means that the model should be able to capture this change in the event signature. However, it might fail to do so due to a little available data to learn from comparing to a low-pt region.
Solution¶
Clearly, a way out in that case would be to provide enough training data to cover those regions (also ensuring that the model has enough capacity to embrace diverse and complex topologies).
Another solution would be to communicate to the model importance of specific topologies, which can be done for example by upweighting those events' contribution to the loss function.
Lastly, it might be worth trying to train several models, each targeting its specific region, instead of a general-purpose one (e.g. low-pt & boosted/merged topology tagger). Effectively, factorisation of various regions disentangle the problem of their separation for a single model and delegates it to an ensemble of dedicated models, each targeting its specific region.
-
From that paper on, the HEP community started to explore a similar topic of model decorrelation, i.e. how to build a model which would be invariant to a particular variable or property of data. For a more detailed overview please refer to Section 2 of this paper. ↩