TMVA version 4.2.0 is included in ROOT release 5.34/11.
The (main) changes with respect to TMVA-v4.1.2 / ROOT 5.30 are listed below.
Factory:
NormMode: as called in PrepareTrainingAndTestTree
The default has changed and is now "EqualNumEvents" (with fixed meaning)
While previously NumEvents and EqualNumEvents (by mistake/miscommunication)
took into account training+test events, they are now correctly
normalising only "Training Events" (note the reason for these
normalisations was to have the possibility to easily force the
effective (weighted) number of training events used for Signal (class
0) to equal the number of training events in the Backgr. (sum of all
remaining classes in multiclass mode)
NumEvents:
- the weighted number of events is scaled, independently for signal and
backgroundm, such that the sum of weights equals the number of events given in the
Factory::PrepareTrainingAndTestTree("",nTrain_Signal=3000,nTrain_Background=6000) call.
This example call will give hence end up in having 2x more background
events in the training compared to the signal, no matter what the
individual event weights have been. (watch out! if you specify
nTrain_Signal=0,nTrain_Background=0), then the ratio will be according
to total numbers of MC events in the signal and background
respectively, which could be very different from the usually good
ratio of having about the same weighted number signal and background
events in the training. In that case it is better to use:
EqualNumEvents:
- for the signal events, the same is done as for NumEvents.
The background events however are reweighted such, that their sum of weights
equals that for the signal events. This results in the same effective (weighted)
number of signal and background events to be seen in the training.
Transformations=I is default again in Factory (this defines which
variables distribution plots are added to the TMVA output file - and
hence displayable via the TMVAGui)
Boosted Decision Trees:
Some changes to the training options:
nEventsMin (deprecated) please replace by → MinNodeSize
The option nEventsMin which specified the minimum number of training event
in a leaf node as an absolute number has been replaced by "MinNodeSize"
which is given in "percentage of the trainin sample". Like this the training
options become less dependent on the actual number of training sample size
NNodesMax (deprecated) please replace by → MaxDepth
GradBaggingFraction and UseNTrainEvents replaced by BaggedSampleFraction
- they both meant the same thing and are now
deprecated → use BaggedSampleFraction instead
UsedBaggedGrad replaced by UseBaggedBoost
- like this, the use of a bagged sample in Grad-Boost or AdaBoost have the same option name
- the 'random subsample' in GradBoost has also been replaced by a properly resampled bootstrap sample (including replacement)
UseWeightedTrees → removed
- it was default anyway and the only reasonable choice there is
PruneBeforeBoost → removed
- it has been mostly a debug/trial option
NegWeightTreatment=IgnoreNegWeights → replaced by NegWeightTreatment=IgnoreNegWeightsInTraining
- Unfortunatly the default "IgnoreNegWeights" to the BDT option "NegWeightTreatment"
collided with the a global option and had to be replaced.
Other changes to the training:
Regardless of the NormMode set in the TMVA::Factory, the BDT training will always start with reweighting the background such that its 'sum of weights' equals that of the signal. An imbalance here previously only resulted in many misclassified events, causing the same re-weighting to be done effectively only in the first boosting step.
remove options MethodWeightType... have it defined by the Boost Method
(these have been trial options.. but for clarity it is much better to stick
to the "standard" ones (i.e log(alpha) for AdaBoost etc)
up to now, the first classifier was trained with
the full sample, I think however, it should also be a bagged
sample (i.e. particularily if smaller sample sizes for the bagged
samples were demanded) .. it's changed now, accordingly
The Toolkit for Multivariate Analysis (TMVA) provides
a ROOT-integrated machine learning environment
for the processing and parallel evaluation of multivariate
classification and regression techniques. TMVA is specifically designed to the needs of
high-energy physics (HEP) applications, but should not be restricted to these.
The package includes:
TMVA consists of object-oriented implementations in C++ for each of these
multivariate methods and provides training, testing and performance evaluation
algorithms and visualization scripts. The MVA training and testing is
performed with the use of user-supplied data sets in form of ROOT trees or text
files, where each event can have an individual weight. The true event classification
or target value (for regression problems)
in these data sets must be known. Preselection requirements and transformations
can be applied on this data. TMVA supports the use of variable combinations and
formulas.
TMVA works in transparent factory mode to guarantee an unbiased performance
comparison between the algorithms: all MVA methods see the same training and
test data, and are evaluated following the same prescriptions within the same
execution job. A Factory class organises the interaction between the user
and the TMVA analysis steps. It performs preanalysis and preprocessing of the training
data to assess basic properties of the discriminating variables used as
input to the methods. The correlation coefficients of the input variables
are calculated and displayed, and a preliminary ranking is derived (which is later
superseded by method-specific variable rankings). The variables
can be linearly transformed (individually for each classifier) into a
non-correlated variable space or projected upon their principle components.
For performance comparison, the analysis job prints tabulated results
for some benchmark measures. Smooth efficiency versus background
rejection curves are stored in a ROOT output file, together with other graphical
evaluation information. These results can be displayed using ROOT
macros, which are conveniently executed via a graphical user interfaces (each one
for classification and regression) that comes with the TMVA distribution.
The TMVA training job runs alternatively as a ROOT script, as a standalone executable,
where libTMVA.so is linked as a shared library, or as a python script via the
PyROOT interface. Each classifier trained in one of these applications writes
its configuration and training results in result (``weight'') files,
which consist of text and (optionally) ROOT files.
An easy-to-use Reader class is provided, which reads and interprets the
weight files (interfaced by the corresponding classifiers), and which can
be included in any C++ executable, ROOT macro or python analysis job.
For standalone use of the trained classifiers, TMVA also generates lightweight
C++ response classes, which contain the encoded information from the
weight files so that these are not required anymore. These classes do not
depend on TMVA or ROOT, neither on any other external library.
We have put emphasis on the clarity and functionality of the Factory and
Reader interfaces to the user applications. All MVA methods run with reasonable
default configurations, so that for standard applications that do not require particular
tuning, the user script for a full TMVA analysis will hardly exceed a few lines
of code. For individual optimisation the user can (and should) customize the
classifiers via configuration strings.
Please report any problems and/or suggestions for improvements to the
authors.
If you depend on a newer version, the TMVA source code to build a new shared library can be
downloaded
from sourceforge.net. View SVN
gives you a snapshot of the current SVN trunk in ROOT.
The package contains classes (.h/.cxx extensions), ROOT scripts (.C extensions),
executables (.cxx extensions), and
a Makefile. It is best to first run the example macro
test/TMVAClassification.C (for classification), and
test/TMVARegression.C (for regression)
to check that everything works as it should. These macros should also serve as
templates for your own applications.
The relevant analysis steps through the TMVA factory are the following:
Start.
Create the factory (with some bookkeeping arguments).
Define the input samples (signal and background for classification,
a single input sample for regression): these can be separate ROOT Trees,
a single ROOT Tree with a type identifier for classification, or separate ASCII files.
Add the input variable names to be used
to train the MVAs to the factory.
Prepare the training and test ROOT Trees using the numbers of events
given in the arguments. Precuts can be applied at this step.
Book the MVA methods. The first argument to the factory is the instance
name of the method. It is used to tag the evaluation plots and weight files
for this method. The second argument is the unique method type (enum).
The third argument consists of an option string,
which individually configures each method. Detailed information
can be found in the corresponding C++ class implementations.
Propagate the methods through the training, testing and evaluation
phases.
End.
Once the interesting MVA methods have been identified, they can be included
into the data analysis. How to do this latter step is shown in the
example:
test/TMVAClassificationApplication.C (classification) and
test/TMVARegressionApplication.C (regression).
Compilation under Linux should be straightforward by just typing "make"
in the main TMVA directory. Note that the ROOT
environment needs to be properly set up: $ROOTSYS
should be set, $ROOTSYS/lib and
$ROOTSYS/bin should appear in
$PATH, and $ROOTSYS/lib
in $LD_LIBRARY_PATH.
Before running the macros in the "test" directory, source the "setup.[c]sh" script.
Optimal cuts maximise the signal efficiency at given background
efficiency. Other optimisation criteria, such as maximising the
signal significance-squared, S2/(S+B),
with S and B being the signal and background yields,
then correspond to a particular point in the optimised
background-rejection versus signal-efficiency curve. This working point
requires the knowledge of the expected yields, which is not the
case in general. Note also that for rare signals, Poissonian statistics
should be used, which modifies the significance criterion.
If linear input correlations are present, it may be useful to use decorrelated
input variables before optimising cuts (option ":D").
Technically, the cut optimisation is achieved in TMVA by three
optional methods:
Monte Carlo generation (option: MC).
Fitting using a Genetic Algorithm (option: GA).
Fitting using Simulated Annealing (option: SA - still in testing phase).
Attempts using MINUIT (Simplex or Migrad) have not shown satisfactory
results, as the fits often fail because of convergence at local minima.
For most examples tested by us GA was the most performing.
The rectangular cut of a volume in the variable space is performed
using a binary tree to sort the training events. This provides
a significant reduction in computing time.
The method of maximum likelihood is among the most straightforward
multivariate analyser approaches.
We define the likelihood ratio, R, for an event
by the ratio of the signal to the signal plus background
likelihoods.
The individual likelihoods are products of the corresponding probability
densities of the discriminating input variables used.
In practice, TMVA uses polynomial splines fitted to histograms, or
unbinned Gaussian kernel density estimators, to estimate the probability
density functions (PDF) obtained from the distributions of the
training variables.
Likelihood responses are often strongly peaked at 0/1. The booking option
"TransformOutput" zooms into these peaks (with no change in the performance)
using an inverse sigmoid transformation.
This is a generalization of the above Likelihood methods to Nvar
dimensions, where Nvar is the number of input variables
used in the MVA. If the multidimensional probability density functions
(PDFs) for signal and background were known, this method contains the entire
physical information, and is therefore optimal. Usually, kernel estimation
methods are used to approximate the PDFs using the events from the
training sample.
A very simple probability density estimator (PDE) has been suggested
in hep-ex/0211019. The
PDE for a given test event is obtained from counting the (normalized)
number of signal and background (training) events that occur in the
"vicinity" of the test event. The volume that describes "vicinity" is
user-defined. A search
method based on binary-trees is used to effectively reduce the
selection time for the range search. Three different volume definitions
are optional:
MinMax:
the volume is defined in each dimension with respect
to the full variable range found in the training sample.
RMS:
the volume is defined in each dimensions with respect
to the RMS estimated from the training sample.
Adaptive:
a volume element is defined in each dimensions with
respect to the RMS estimated from the training sample. The overall
scale of the volume element is then determined for each event so
that the total number of events confined in the volume be within
a user-defined range.
The adaptive range search is used by default. The option "UseKernelEstimate"
allows the user to weight the events found within the adaptive volume
by a multidimensional Gaussian function according to their distance to
the test event.
Similar to PDERS, the k-nearest neighbour method compares
an observed (test) event to reference events from a training data set
However, unlike PDERS, which in its original form uses a fixed-sized multidimensional volume
surrounding the test event, and in its augmented form resizes the volume as a function of
the local data density, the k-NN algorithm is intrinsically adaptive. It searches for a
fixed number of adjacent events, which then define a volume for the metric used. The k-NN
method has best performance when the boundary that separates signal and background
events has irregular features that cannot be easily approximated by parametric learning
methods.
The k-NN algorithm uses a kd-tree structure for the sorting of the training events
that significantly improves the performance. The TMVA implementation of the k-NN method is
reasonably fast to allow classification and regression for large data sets. In particular, it is faster
than the adaptive PDERS method.
Note that the k-NN method is not appropriate for problems where the number of input
variables exceeds about 10. In general, the larger the training set, the more the
algorithm probes small-scale features that distinguish signal and background events.
In the method of Fisher discriminants event selection is performed
in a transformed variable space with zero linear correlations, by
distinguishing the mean values of the signal and background
distributions.
The linear discriminant analysis determines an axis in the (correlated)
hyperspace of the input variables
such that, when projecting the output classes (signal and background)
upon this axis, they are pushed as far as possible away from each other,
while events of a same class are confined in a close vicinity.
The linearity property of this method is reflected in the metric with
which "far apart" and "close vicinity" are determined: the covariance
matrix of the discriminant variable space.
The classification of the events in signal and background classes
relies on the following characteristics (only): overall sample means
for each input variable, class-specific sample means,
and total covariance matrix. The covariance matrix
can be decomposed into the sum of a within- and
a between-class class matrix. They describe
the dispersion of events relative to the means of their own class (within-class
matrix), and relative to the overall sample means (between-class matrix).
The Fisher coefficients are then given by the product
of the difference vector of signal and background sample means and the
inverse within-class matrix.
This MVA approach is used by the DØ collaboration (FNAL) for the
purpose of electron identification (see, eg.,
hep-ex/9507007).
As it is implemented in TMVA, it is usually equivalent
to the Fisher-Mahalanobis discriminant, and it has only been added for
the purpose of completeness.
Two χ2 estimators are computed for an event, each one
for signal and background, using the estimates for the means and
covariance matrices obtained from the training sample.
TMVA then uses as normalised analyser for event the ratio:
(χS(i)2 − χB2(i))
/(χS2(i) + χB2(i)).
The common goal of all TMVA discriminators is to determine an optimal separating
function in the multivariate space represented by the input variables. The Fisher
discriminant solves this analytically for the linear case, while artificial neural
networks, support vector machines or boosted decision trees provide nonlinear
approximations with -- in principle -- arbitrary precision if enough training
statistics is available and the chosen architecture is flexible enough.
The function discriminant analysis (FDA) provides an intermediate solution to the
problem with the aim to solve relatively simple or partially nonlinear problems.
The user provides the desired function with adjustable parameters via the configuration
option string, and FDA fits the parameters to it, requiring the signal (background)
function value to be as close as possible to 1 (0). Its advantage over the more
involved and automatic nonlinear discriminators is the simplicity and transparency
of the discrimination expression. A shortcoming is that FDA will underperform for
involved problems with complicated, phase space dependent nonlinear correlations.
The FDA performance depends on the complexity and fidelity of the user-defined
discriminator function. As a general rule, it should be able to reproduce the
discrimination power of any linear discriminant analysis. To reach into the nonlinear
domain, it is useful to inspect the correlation profiles of the input variables, and
add quadratic and higher polynomial terms between variables as necessary. Comparison
with more involved nonlinear MVA methods can be used as a guide.
Three different ANN implementations are used in TMVA. The
TMlpANN,
implemented in ROOT, the
Clermont-Ferrand ANN (CFMlpANN),
which has been translated from FORTRAN, and a new ANN (MLP), which is
very similar to the ROOT ANN, but can be trained significantly faster. All ANNs
belong to the class of Multilayer Perceptrons (MLP), which are feed-forward
networks according to the following propagation schema:
The input layer contains as many neurons as input variables used in the MVA.
The output layer contains a single neuron for the signal weight.
In between the input and output layers are a variable number
of k hidden layers with arbitrary numbers of neurons. (While the
structure of the input and output layers is determined by the problem, the
hidden layers can be configured by the user through the option string
of the method booking.)
As indicated in the sketch, all neuron inputs to a layer are linear
combinations of the neuron output of the previous layer. The transfer
from input to output within a neuron is performed by means of an "activation
function". In general, the activation function of a neuron can be
zero (deactivated), one (linear), or non-linear. The above example uses
a sigmoid activation function. The transfer function of the output layer
is usually linear. As a consequence: an ANN without hidden layer should
give identical discrimination power as a linear discriminant analysis (Fisher).
In case of one hidden layer, the ANN computes a linear combination of
sigmoid.
The trained MLP architecture can be plotted with the macro macro/network.C.
Click here for an example plot.
Boosted decision trees have been successfully used in High
Energy Physics analysis for example by the MiniBooNE experiment
(Yang-Roe-Zhu, physics/0508045).
In Boosted Decision Trees, the selection is done on a majority vote
on the result of several decision trees, which are all derived from the
same training sample by supplying different event weights during the
training. Decision trees: successive decision nodes are used to categorize the
events out of the sample as either signal or background. Each node
uses only a single discriminating variable to decide if the event is
signal-like ("goes right") or background-like ("goes left"). This forms a
tree like structure with "baskets" at the end (leave nodes), and an
event is classified as either signal or background according to whether
the basket where it ends up has been classified signal or background during
the training. Training of a decision tree is the process to define the
"cut criteria" for each node. The training starts with the root
node. Here one takes the full training event sample and selects the
variable and corresponding cut value that gives the best separation
between signal and background at this stage. Using this cut criterion,
the sample is then divided into two subsamples, a signal-like (right)
and a background-like (left) sample. Two new nodes are then created
for each of the two sub-samples and they are constructed using the
same mechanism as described for the root node. The devision is stopped once
a certain node has reached either a minimum number of events, or a
minimum or maximum signal purity. These leave nodes are then called
"signal" or "background" if they contain more signal respective
background events from the training sample. Boosting: the idea behind the boosting is, that signal events from the
training sample, that end up in a background node (and vice versa) are
given a larger weight than events that are in the correct leave node.
This results in a re-weighed training event sample, with which then
a new decision tree can be developed. The boosting can be applied several
times (typically 100-500 times) and one ends up with a set of
decision trees (a forest). Bagging:
In this particular variant of the Boosted Decision Trees the boosting is
not done on the basis of previous training results, but by a simple
stochasitc re-sampling of the initial training event sample. Analysis: applying an individual decision tree to a test event
results in a classification of the event as either signal or background.
For the boosted decision tree selection, an event is successively
subjected to the whole set of decision trees and depending on
how often it is classified as signal, a "likelihood" estimator is constructed
for the event being signal or background. The value of this estimator is the
one which is then used to select the events from an event sample, and
the cut value on this estimator defines the efficiency and purity of
the selection.
where is the vector of observables. In this case the base learners
consist of so called rules. A single rule is essentially a function of a
series of cuts. A rule applied on a given event is non-zero only if all cuts in
the product are satisfied. In such a case the rule returns 1.
RuleFit consists of two main steps:
1. Rules generation
2. Fit of the rules to the training data, i.e,
find the optimum coefficients in (1).
An efficient way of generating rules is to use decision trees. Each node (except
for the root node) produces one rule. The rule is defined by the series of cuts
required to reach the node starting from the root. In the current implementation,
the forest is generated through boosting (default) or by training each tree on
random subsamples.
In the early 1960s the linear support vector method
was developed to construct
separating hyperplanes for pattern recognition problems.
It took 30 years before the method was generalised to nonlinear separating
functions and to estimate real-valued functions
(regression). At that moment it became a general purpose algorithm
performing classification and regression tasks which can compete with neural networks
and probability density estimators. Typical applications of SVMs include text
categorisation, character recognition, bioinformatics and face detection.
The main idea of the SVM approach is to build a hyperplane that separates signal and
background vectors (events) using only a minimal subset of all
training vectors (support vectors). The position of the hyperplane is
obtained by maximizing the margin between it and the support vectors.
The extension to nonlinear SVMs is performed by
mapping the input vectors onto a higher dimensional feature space in which signal
and background events can be separated by a linear procedure using an optimally
separating hyperplane. The use of kernel functions eliminates thereby the explicit
transformation to the feature space and simplifies the computation.
The implemented SVM algorithm performs the classification task using optionally linear,
polynomial, Gaussian or sigmoidal kernel functions. The Gaussian kernel allows one to apply
any discriminating shape in the input space.
An important task is to properly choose the kernel parameters (the width in case of the
Gaussian kernel) and a cost parameter. They must be optimized experimentally by
the user, the advised method is to run SVM several times with different sets of parameters
to perform a grid scan and to chose the optimal configuration.
The SVM training time scales quadratically with the number of vectors in the training
event sample.