New reference page for configuration options.
The page is automatically generated for each new release. Next to the classifiers
also exist information links for hints to improve the classifier performance (click on the "i" button). Many thanks to Zhiyi Liu (Fraser U) for suggesting this.
Bug fixes:
Calculation of "Separation": fixed bin-shift and normalisation bugs. Thanks to
Dag Gillberg (Fraser U) for spotting these.
Fixed problem in "SetSignal(Background)WeightExpression": signal (background weight
expressions not existing in the background (signal) tree led to an abort of the tree
reading ("Bad numerical expression"). Thanks to Alfio Rizzo (Brussels) for pointing
this out.
Fixed problem when specifying train and test tree explicitly. Some code was forgotten
in the background part, creating incompatibilities. Thanks to Zhiyi Liu (Fraser U)
for reporting this.
Apr 16, 2008 (TMVA-v3.9.2)
Bug fixes:
Fixed problem introduced in 3.9.0 and 3.9.1: preprocessing cuts could not be applied on a
variable that was not declared via AddVariable (thanks to Daniel Stricker,
Karlsruhe U., for spotting this).
Corrected configurable random seed in GeneticAlgorithm (thanks to David Gonzalez
Maline, CERN, for pointing this out).
Fixed Cuts (optimisaton) method -> event with smallest value was not included
in search for optimal cut (thanks to Dimitris Varouchas, LAL-Orsay, for helping
us detecting the problem).
Apr 8, 2008 (TMVA-v3.9.1)
Bug fix:
Added missing link to libMLP shared library to Makefile; also fixed compilation issue with ROOT 5.08.
Apr 3, 2008 (TMVA-v3.9.0)
New Simulated Annealing (SA) algorithm for global minimisation in presence
of local minima (optionally used in cut optimisation (MethodCuts) and the Function
Discriminant (MethodFDA)). The SA algorithm features two approaches, one starting
at minimal temperature (ie, from within a local minimum), slowly increasing, and
another one starting at high temperature, slowly decreasing into a minimum.
Code developed and written by Kamil Bartlomiej Kraszewski, Maciej Kruk and
Krzysztof Danielowski from IFJ and AGH/UJ, Krakow, Poland.
Plugin capability: custom multivariate classifier can now be plugged into
the TMVA framework to benefit from TMVA's analysis and performance comparison
tools. The user needs to derive the custom class from TMVA::MethodBase and
implement the (few) virtual methods required by the TMVA::IMethod interface.
The classifier can then be directly called via ROOT's plugin mechanism. An
example for this is given in TMVA/macros/TMVAnalysis.C. Many thanks to Daniel
Martschei and Thomas Kuhr (Karlsruhe U.) for suggesting and implementing this
feature.
Preselection cuts now work on arrays. Previously used TEventlists (only event
wise pass/fail) were replaced by TreeFormulas (sensitive to array position).
Thanks to Arnaud Robert (LPNHE) for his contributions.
Framework/dataset preparation: Signal and background trees can now be assigned
individually to training and test purposes. This is achieved by setting the third
parameter of the Factory::AddSignalTree/AddBackgroundTree() methods to "Train" or
"Test" (const string). The only restriction is that either none or all signal
(background) trees need to be specified with that option. It is possible to mix
the two modes, for instance one can assign individual training and test trees
for signal, but not for background.
For increased flexibility, users can also directly input signal and background,
training and test events to TMVA, instead of letting TMVA interpret user-given
trees. Note that either one of the two approaches must be chosen (no mix). The
syntax of the new calls is described in the macros/TMVAnalysis.C test macro.
--> The User runs the event loop, copies for each event the input variables
into a std:vector, and "adds" them to TMVA, using the dedicated calls:
factory->AddSignalTrainingEvent( vars, signalWeight );
(and replacing "Signal" by "Background", and "Training" by "Test").
After the event loop, everything continues as in the standard method.
Cut optimisation: added physical limits to min/max cuts if smart option is used.
TMlpANN: fixed crash with ROOT>=5.17 when using large number of test events;
also corrected bias in cross validation: before the test events were
used, which led to an overestimated performance evaluation in case
of a small number of degrees of freedom; separate now training tree in two
parts for training and validation with configurable ValidationFraction
Extended options to TMultilayerPerceptron learning methods.
BDT: removed hard-coded weight file name; now, paths and names of weight files are
written as TObjStrings into ROOT target file, and retrieved for plotting;
available weight files (corresponding to target used) can be chosen from
pop-up GUI.
Changes in handling negative weights in BDT algorithm. Events with negative
weights now get their weight reduced (*= 1/boostweight) rather than increased
(*= boostweight) as the other events do. Otherwise these events tend to receive
increasingly stronger boosts, because their effects on the separation gain
are as if background events were selected as signal and vice versa (hence
the events tend to be "wanted" in signal nodes, but are boosted as if they
were misclassified). In addition, the separation indices are protected
against negative S or S+B returning 0.5 (no separation at all) in case that
occurs.
In addition there is a new BDT option to ignore events with negative event
weights for the training. This option could be used as a cross check of a
"worst case" solution for Monte Carlo samples with negative weights. Note that
the results of the testing phase still include these events and are hence
objective.
Added randomised trees: similar to the "Random Forests" technique of Leo Breiman
and Adele Cutler, it uses the "bagging" algorithm and bases the determination of
the best node-split during the training on a random subset of variables only,
which is individually chosen for each split.
Move to TRandom2 for the "bagging" algorithm and throw random weights according
to Poisson statistics. (This way the random weights are closer to a resampling
with replacement algorithm.)
GUI: New macro (and GUI button) for Parallel Coordinate plotting.
Python (PyROOT): added example for reader application: TMVA/python/TMVApplication.py
Bug fixes:
Corrected inconsistency in MethodCuts: the signal efficiency written out into
the weight file does not correspond to the center of the bin within which the
background rejection is maximised (as before) but to the lower left edge of it.
This is because the cut optimisation algorithm determines the best background
rejection for all signal efficiencies belonging into a bin. Since the best background
rejection is in general obtained for the lowest possible signal efficiency, the
reference signal efficiency is the lowest value in the bin.
Fixes in input-variable and MVA plotting: under/over-flow numbers given on plots
were not properly normalised; the maximum histogram ranges have been increased
to avoid cut-offs. Thanks to Andreas Wenger, Zuerich, for pointing these out.
The Toolkit for Multivariate Analysis (TMVA) provides
a ROOT-integrated machine learning environment
for the processing and parallel evaluation of sophisticated multivariate
classification techniques. TMVA is specifically designed to the needs of
high-energy physics (HEP) applications, but should not be restricted to these.
The package includes:
TMVA consists of object-oriented implementations in C++ for each of these
discrimination techniques and provides training, testing and performance evaluation
algorithms and visualization scripts. The classifier training and testing is
performed with the use of user-supplied data sets in form of ROOT trees or text
files, where each event can have an individual weight. The true event classification
in these data sets must be known. Preselection requirements and transformations
can be applied on this data. TMVA supports the use of variable combinations and
formulas.
TMVA works in transparent factory mode to guarantee an unbiased performance
comparison between the classifiers: all classifiers see the same training and
test data, and are evaluated following the same prescriptions within the same
execution job. A Factory class organises the interaction between the user
and the TMVA analysis steps. It performs preanalysis and preprocessing of the training
data to assess basic properties of the discriminating variables used as
input to the classifiers. The linear correlation coefficients of the input variables
are calculated and displayed, and a preliminary ranking is derived (which is later
superseded by classifier-specific variable rankings). The variables
can be linearly transformed (individually for each classifier) into a
non-correlated variable space or projected upon their principle components.
To compare the signal-efficiency and background-rejection
performance of the classifiers, the analysis job prints tabulated results
for some benchmark values, besides other criteria such as a measure of the separation
and the maximum signal significance. Smooth efficiency versus background
rejection curves are stored in a ROOT output file, together with other graphical
evaluation information. These results can be displayed using ROOT
macros, which are conveniently executed via a graphical user interface that
comes with the TMVA distribution.
The TMVA training job runs alternatively as a ROOT script, as a standalone executable,
where libTMVA.so is linked as a shared library, or as a python script via the
PyROOT interface. Each classifier trained in one of these applications writes
its configuration and training results in result (``weight'') files,
which consist of text and (optionally) ROOT files.
An easy-to-use Reader class is provided, which reads and interprets the
weight files (interfaced by the corresponding classifiers), and which can
be included in any C++ executable, ROOT macro or python analysis job.
For standalone use of the trained classifiers, TMVA also generates lightweight
C++ response classes, which contain the encoded information from the
weight files so that these are not required anymore. These classes do not
depend on TMVA or ROOT, neither on any other external library.
We have put emphasis on the clarity and functionality of the Factory and
Reader interfaces to the user applications. All classifiers run with reasonable
default configurations, so that for standard applications that do not require particular
tuning, the user script for a full TMVA analysis will hardly exceed a few lines
of code. For individual optimisation the user can (and should) customize the
classifiers via configuration strings.
Please report any problems and/or suggestions for improvements to the
authors.
The TMVA source code can be
downloaded from sourceforge.net
as a compressed tar file.
It can also be checked out via the CVS anonymous access (in one line):
cvs -z3 -d:pserver:anonymous@tmva.cvs.sourceforge.net:/cvsroot/tmva co -r V03-09-04 -P TMVA
Viewcvs
gives you a snapshot of the current CVS HEAD. For more information on
CVS at sourceforge, click here.
Follow this link for a brief tutorial on
"HowTo" get started.
The package contains classes (.h/.cxx extensions), ROOT scripts (.C extensions),
executables (.cxx extensions), and
a Makefile. It is best to first run the example macro
macros/TMVAnalysis.C
to check that everything works as it should. This macro should also serve as
template for your own application.
The relevant analysis steps through the TMVA factory are the following:
Start.
Create the factory (with some bookkeeping arguments).
Define the signal and background samples: these can be separate ROOT Trees,
a single ROOT Tree with a type identifier, or separate ASCII files.
Add the input variable names to be used
to train the MVAs to the factory.
Prepare the training and test ROOT Trees using the numbers of events
given in the arguments. Precuts can be applied at this step.
Book the MVA methods. The first argument to the factory is the instance
name of the method. It is used to tag the evaluation plots and weight files
for this method. The second argument is the unique method type (enum).
The third argument consists of an option string,
which individually configures each method. Detailed information
can be found in the corresponding C++ class implementations.
Propagate the methods through the training, testing and evaluation
phases.
End.
Once the interesting MVA methods have been identified, they can be included
into the data analysis. How to do this latter step is shown in the
example: macros/TMVApplication.C.
Compilation under Linux should be straightforward, once the TMVA/setup.(c)sh
script has been sourced. (Note that the ROOT
environment needs to be properly set up: $ROOTSYS
should be set, $ROOTSYS/lib and
$ROOTSYS/bin should appear in
$PATH, and $ROOTSYS/lib
in $LD_LIBRARY_PATH).
The best MVA to be used in an analysis strongly depends on the particular
application. The evaluation factory provides various numerical benchmark
results to directly assess the performance of the MVA training on an
independent test sample. These are:
The signal efficiency at three representative background efficiencies
(which is 1−rejection).
The significance of an MVA estimator, defined by the difference
between the MVA mean values for signal and background, divided by the
quadratic sum of their root mean squares.
The separation of an MVA x, defined by the integral
½∫(S(x) − B(x))2/(S(x) + B(x))dx, where
S(x) and B(x) are the (normalised) signal and background distributions.
The separation is zero for identical signal and background MVA shapes,
and it is one for shapes with no overlap.
The MVA output also prints the linear correlation coefficients between signal
and background, which can be useful to eliminate variables that are too correlated.
Once the TMVAnalysis.C macro has terminated execution, a GUI
(TMVAGui.C) will pop up that allows one to easily execute ROOT
scripts for the visualisation of the training, testing and evaluation results.
All ROOT scripts live in the subdirectory macros.
The GUI or scripts can also be directly executed from the command line
by typing
root -l ../macros/script.C\(<arguments>\)
All plots drawn can be saved as eps/png/gif files in the subdirectory "plots".
Optimal cuts maximise the signal efficiency at given background
efficiency. Other optimisation criteria, such as maximising the
signal significance-squared, S2/(S+B),
with S and B being the signal and background yields,
then correspond to a particular point in the optimised
background-rejection versus signal-efficiency curve. This working point
requires the knowledge of the expected yields, which is not the
case in general. Note also that for rare signals, Poissonian statistics
should be used, which modifies the significance criterion.
If linear input correlations are present, it may be useful to use decorrelated
input variables before optimising cuts (option ":D").
Technically, the cut optimisation is achieved in TMVA by three
optional methods:
Monte Carlo generation (option: MC).
Fitting using a Genetic Algorithm (option: GA).
Fitting using Simulated Annealing (option: SA - still in testing phase).
Attempts using MINUIT (Simplex or Migrad) have not shown satisfactory
results, as the fits often fail because of convergence at local minima.
For most examples tested by us GA was the most performing.
The rectangular cut of a volume in the variable space is performed
using a binary tree to sort the training events. This provides
a significant reduction in computing time.
The method of maximum likelihood is among the most straightforward
multivariate analyser approaches.
We define the likelihood ratio, R, for an event
by the ratio of the signal to the signal plus background
likelihoods.
The individual likelihoods are products of the corresponding probability
densities of the discriminating input variables used.
In practice, TMVA uses polynomial splines fitted to histograms, or
unbinned Gaussian kernel density estimators, to estimate the probability
density functions (PDF) obtained from the distributions of the
training variables.
Likelihood responses are often strongly peaked at 0/1. The booking option
"TransformOutput" zooms into these peaks (with no change in the performance)
using an inverse sigmoid transformation.
This is a generalization of the above Likelihood methods to Nvar
dimensions, where Nvar is the number of input variables
used in the MVA. If the multidimensional probability density functions
(PDFs) for signal and background were known, this method contains the entire
physical information, and is therefore optimal. Usually, kernel estimation
methods are used to approximate the PDFs using the events from the
training sample.
A very simple probability density estimator (PDE) has been suggested
in hep-ex/0211019. The
PDE for a given test event is obtained from counting the (normalized)
number of signal and background (training) events that occur in the
"vicinity" of the test event. The volume that describes "vicinity" is
user-defined. A search
method based on binary-trees is used to effectively reduce the
selection time for the range search. Three different volume definitions
are optional:
MinMax:
the volume is defined in each dimension with respect
to the full variable range found in the training sample.
RMS:
the volume is defined in each dimensions with respect
to the RMS estimated from the training sample.
Adaptive:
a volume element is defined in each dimensions with
respect to the RMS estimated from the training sample. The overall
scale of the volume element is then determined for each event so
that the total number of events confined in the volume be within
a user-defined range.
The adaptive range search is used by default. The option "UseKernelEstimate"
allows the user to weight the events found within the adaptive volume
by a multidimensional Gaussian function according to their distance to
the test event.
Similar to PDERS, the k-nearest neighbour method compares
an observed (test) event to reference events from a training data set
However, unlike PDERS, which in its original form uses a fixed-sized multidimensional volume
surrounding the test event, and in its augmented form resizes the volume as a function of
the local data density, the k-NN algorithm is intrinsically adaptive. It searches for a
fixed number of adjacent events, which then define a volume for the metric used. The k-NN
classifier has best performance when the boundary that separates signal and background
events has irregular features that cannot be easily approximated by parametric learning
methods.
The k-NN algorithm uses a kd-tree structure for the sorting of the training events
that significantly improves the performance. The TMVA implementation of the k-NN method is
reasonably fast to allow classification of large data sets. In particular, it is faster
than the adaptive PDERS method.
Note that the k-NN method is not appropriate for problems where the number of input
variables exceeds about 10. In general, the larger the training set, the more the
algorithm probes small-scale features that distinguish signal and background events.
In the method of Fisher discriminants event selection is performed
in a transformed variable space with zero linear correlations, by
distinguishing the mean values of the signal and background
distributions.
The linear discriminant analysis determines an axis in the (correlated)
hyperspace of the input variables
such that, when projecting the output classes (signal and background)
upon this axis, they are pushed as far as possible away from each other,
while events of a same class are confined in a close vicinity.
The linearity property of this method is reflected in the metric with
which "far apart" and "close vicinity" are determined: the covariance
matrix of the discriminant variable space.
The classification of the events in signal and background classes
relies on the following characteristics (only): overall sample means
for each input variable, class-specific sample means,
and total covariance matrix. The covariance matrix
can be decomposed into the sum of a within- and
a between-class class matrix. They describe
the dispersion of events relative to the means of their own class (within-class
matrix), and relative to the overall sample means (between-class matrix).
The Fisher coefficients are then given by the product
of the difference vector of signal and background sample means and the
inverse within-class matrix.
This MVA approach is used by the DØ collaboration (FNAL) for the
purpose of electron identification (see, eg.,
hep-ex/9507007).
As it is implemented in TMVA, it is usually equivalent
to the Fisher-Mahalanobis discriminant, and it has only been added for
the purpose of completeness.
Two χ2 estimators are computed for an event, each one
for signal and background, using the estimates for the means and
covariance matrices obtained from the training sample.
TMVA then uses as normalised analyser for event the ratio:
(χS(i)2 − χB2(i))
/(χS2(i) + χB2(i)).
The common goal of all TMVA discriminators is to determine an optimal separating
function in the multivariate space represented by the input variables. The Fisher
discriminant solves this analytically for the linear case, while artificial neural
networks, support vector machines or boosted decision trees provide nonlinear
approximations with -- in principle -- arbitrary precision if enough training
statistics is available and the chosen architecture is flexible enough.
The function discriminant analysis (FDA) provides an intermediate solution to the
problem with the aim to solve relatively simple or partially nonlinear problems.
The user provides the desired function with adjustable parameters via the configuration
option string, and FDA fits the parameters to it, requiring the signal (background)
function value to be as close as possible to 1 (0). Its advantage over the more
involved and automatic nonlinear discriminators is the simplicity and transparency
of the discrimination expression. A shortcoming is that FDA will underperform for
involved problems with complicated, phase space dependent nonlinear correlations.
The FDA performance depends on the complexity and fidelity of the user-defined
discriminator function. As a general rule, it should be able to reproduce the
discrimination power of any linear discriminant analysis. To reach into the nonlinear
domain, it is useful to inspect the correlation profiles of the input variables, and
add quadratic and higher polynomial terms between variables as necessary. Comparison
with more involved nonlinear classifiers can be used as a guide.
Three different ANN implementations are used in TMVA. The
TMlpANN,
implemented in ROOT, the
Clermont-Ferrand ANN (CFMlpANN),
which has been translated from FORTRAN, and a new ANN (MLP), which is
very similar to the ROOT ANN, but can be trained significantly faster. All ANNs
belong to the class of Multilayer Perceptrons (MLP), which are feed-forward
networks according to the following propagation schema:
The input layer contains as many neurons as input variables used in the MVA.
The output layer contains a single neuron for the signal weight.
In between the input and output layers are a variable number
of k hidden layers with arbitrary numbers of neurons. (While the
structure of the input and output layers is determined by the problem, the
hidden layers can be configured by the user through the option string
of the method booking.)
As indicated in the sketch, all neuron inputs to a layer are linear
combinations of the neuron output of the previous layer. The transfer
from input to output within a neuron is performed by means of an "activation
function". In general, the activation function of a neuron can be
zero (deactivated), one (linear), or non-linear. The above example uses
a sigmoid activation function. The transfer function of the output layer
is usually linear. As a consequence: an ANN without hidden layer should
give identical discrimination power as a linear discriminant analysis (Fisher).
In case of one hidden layer, the ANN computes a linear combination of
sigmoid.
The trained MLP architecture can be plotted with the macro macro/network.C.
Click here for an example plot.
Boosted decision trees have been successfully used in High
Energy Physics analysis for example by the MiniBooNE experiment
(Yang-Roe-Zhu, physics/0508045).
In Boosted Decision Trees, the selection is done on a majority vote
on the result of several decision trees, which are all derived from the
same training sample by supplying different event weights during the
training. Decision trees: successive decision nodes are used to categorize the
events out of the sample as either signal or background. Each node
uses only a single discriminating variable to decide if the event is
signal-like ("goes right") or background-like ("goes left"). This forms a
tree like structure with "baskets" at the end (leave nodes), and an
event is classified as either signal or background according to whether
the basket where it ends up has been classified signal or background during
the training. Training of a decision tree is the process to define the
"cut criteria" for each node. The training starts with the root
node. Here one takes the full training event sample and selects the
variable and corresponding cut value that gives the best separation
between signal and background at this stage. Using this cut criterion,
the sample is then divided into two subsamples, a signal-like (right)
and a background-like (left) sample. Two new nodes are then created
for each of the two sub-samples and they are constructed using the
same mechanism as described for the root node. The devision is stopped once
a certain node has reached either a minimum number of events, or a
minimum or maximum signal purity. These leave nodes are then called
"signal" or "background" if they contain more signal respective
background events from the training sample. Boosting: the idea behind the boosting is, that signal events from the
training sample, that end up in a background node (and vice versa) are
given a larger weight than events that are in the correct leave node.
This results in a re-weighed training event sample, with which then
a new decision tree can be developed. The boosting can be applied several
times (typically 100-500 times) and one ends up with a set of
decision trees (a forest). Bagging:
In this particular variant of the Boosted Decision Trees the boosting is
not done on the basis of previous training results, but by a simple
stochasitc re-sampling of the initial training event sample. Analysis: applying an individual decision tree to a test event
results in a classification of the event as either signal or background.
For the boosted decision tree selection, an event is successively
subjected to the whole set of decision trees and depending on
how often it is classified as signal, a "likelihood" estimator is constructed
for the event being signal or background. The value of this estimator is the
one which is then used to select the events from an event sample, and
the cut value on this estimator defines the efficiency and purity of
the selection.
where is the vector of observables. In this case the base learners
consist of so called rules. A single rule is essentially a function of a
series of cuts. A rule applied on a given event is non-zero only if all cuts in
the product are satisfied. In such a case the rule returns 1.
RuleFit consists of two main steps:
1. Rules generation
2. Fit of the rules to the training data, i.e,
find the optimum coefficients in (1).
An efficient way of generating rules is to use decision trees. Each node (except
for the root node) produces one rule. The rule is defined by the series of cuts
required to reach the node starting from the root. In the current implementation,
the forest is generated through boosting (default) or by training each tree on
random subsamples.
In the early 1960s the linear support vector method
was developed to construct
separating hyperplanes for pattern recognition problems.
It took 30 years before the method was generalised to nonlinear separating
functions and to estimate real-valued functions
(regression). At that moment it became a general purpose algorithm
performing classification and regression tasks which can compete with neural networks
and probability density estimators. Typical applications of SVMs include text
categorisation, character recognition, bioinformatics and face detection.
The main idea of the SVM approach is to build a hyperplane that separates signal and
background vectors (events) using only a minimal subset of all
training vectors (support vectors). The position of the hyperplane is
obtained by maximizing the margin between it and the support vectors.
The extension to nonlinear SVMs is performed by
mapping the input vectors onto a higher dimensional feature space in which signal
and background events can be separated by a linear procedure using an optimally
separating hyperplane. The use of kernel functions eliminates thereby the explicit
transformation to the feature space and simplifies the computation.
The implemented SVM algorithm performs the classification task using optionally linear,
polynomial, Gaussian or sigmoidal kernel functions. The Gaussian kernel allows one to apply
any discriminating shape in the input space.
An important task is to properly choose the kernel parameters (the width in case of the
Gaussian kernel) and a cost parameter. They must be optimized experimentally by
the user, the advised method is to run SVM several times with different sets of parameters
to perform a grid scan and to chose the optimal configuration.
The SVM training time scales quadratically with the number of vectors in the training
event sample.