TMVA Home

NormMode: as called in PrepareTrainingAndTestTree
The default has changed and is now "EqualNumEvents" (with fixed meaning) While previously NumEvents and EqualNumEvents (by mistake/miscommunication) took into account training+test events, they are now correctly normalising only "Training Events" (note the reason for these normalisations was to have the possibility to easily force the effective (weighted) number of training events used for Signal (class 0) to equal the number of training events in the Backgr. (sum of all remaining classes in multiclass mode)

NumEvents:: - the weighted number of events is scaled, independently for signal and backgroundm, such that the sum of weights equals the number of events given in the Factory::PrepareTrainingAndTestTree("",nTrain_Signal=3000,nTrain_Background=6000) call. This example call will give hence end up in having 2x more background events in the training compared to the signal, no matter what the individual event weights have been. (watch out! if you specify nTrain_Signal=0,nTrain_Background=0), then the ratio will be according to total numbers of MC events in the signal and background respectively, which could be very different from the usually good ratio of having about the same weighted number signal and background events in the training. In that case it is better to use:
EqualNumEvents:: - for the signal events, the same is done as for NumEvents. The background events however are reweighted such, that their sum of weights equals that for the signal events. This results in the same effective (weighted) number of signal and background events to be seen in the training.

Transformations=I is default again in Factory (this defines which variables distribution plots are added to the TMVA output file - and hence displayable via the TMVAGui)

Boosted Decision Trees:

Some changes to the training options:

nEventsMin (deprecated) please replace by → MinNodeSize: The option nEventsMin which specified the minimum number of training event in a leaf node as an absolute number has been replaced by "MinNodeSize" which is given in "percentage of the trainin sample". Like this the training options become less dependent on the actual number of training sample size
NNodesMax (deprecated) please replace by → MaxDepth
GradBaggingFraction and UseNTrainEvents replaced by BaggedSampleFraction: - they both meant the same thing and are now deprecated → use BaggedSampleFraction instead
UsedBaggedGrad replaced by UseBaggedBoost: - like this, the use of a bagged sample in Grad-Boost or AdaBoost have the same option name; - the 'random subsample' in GradBoost has also been replaced by a properly resampled bootstrap sample (including replacement)
UseWeightedTrees → removed
: - it was default anyway and the only reasonable choice there is
PruneBeforeBoost → removed
: - it has been mostly a debug/trial option
NegWeightTreatment=IgnoreNegWeights → replaced by NegWeightTreatment=IgnoreNegWeightsInTraining: - Unfortunatly the default "IgnoreNegWeights" to the BDT option "NegWeightTreatment" collided with the a global option and had to be replaced.

Other changes to the training:

Regardless of the NormMode set in the TMVA::Factory, the BDT training will always start with reweighting the background such that its 'sum of weights' equals that of the signal. An imbalance here previously only resulted in many misclassified events, causing the same re-weighting to be done effectively only in the first boosting step.

MethodBoost:

some cleanup (removed strange experimental boosting option HighEdgeGaus, HighEdgeCoPara ..... )

remove options MethodWeightType... have it defined by the Boost Method (these have been trial options.. but for clarity it is much better to stick to the "standard" ones (i.e log(alpha) for AdaBoost etc)

up to now, the first classifier was trained with the full sample, I think however, it should also be a bagged sample (i.e. particularily if smaller sample sizes for the bagged samples were demanded) .. it's changed now, accordingly

TMVA Executive Summary

ROOT

Rectangular cut optimisation
Projective likelihood estimation (PDE approach)
Multidimensional probability density estimation (PDE - range-search approach and PDE-Foam)
Multidimensional k-nearest neighbour method
Linear discriminant analysis (H-Matrix, Fisher and linear (LD) discriminants)
Function discriminant analysis (FDA)
Artificial neural networks (three different MLP implementations)
Boosted/Bagged decision trees
Predictive learning via rule ensembles (RuleFit)
Support Vector Machine (SVM)

TMVA works in transparent factory mode to guarantee an unbiased performance comparison between the algorithms: all MVA methods see the same training and test data, and are evaluated following the same prescriptions within the same execution job. A Factory class organises the interaction between the user and the TMVA analysis steps. It performs preanalysis and preprocessing of the training data to assess basic properties of the discriminating variables used as input to the methods. The correlation coefficients of the input variables are calculated and displayed, and a preliminary ranking is derived (which is later superseded by method-specific variable rankings). The variables can be linearly transformed (individually for each classifier) into a non-correlated variable space or projected upon their principle components. For performance comparison, the analysis job prints tabulated results for some benchmark measures. Smooth efficiency versus background rejection curves are stored in a ROOT output file, together with other graphical evaluation information. These results can be displayed using ROOT macros, which are conveniently executed via a graphical user interfaces (each one for classification and regression) that comes with the TMVA distribution.

The TMVA training job runs alternatively as a ROOT script, as a standalone executable, where libTMVA.so is linked as a shared library, or as a python script via the PyROOT interface. Each classifier trained in one of these applications writes its configuration and training results in result (``weight'') files, which consist of text and (optionally) ROOT files.

An easy-to-use Reader class is provided, which reads and interprets the weight files (interfaced by the corresponding classifiers), and which can be included in any C++ executable, ROOT macro or python analysis job.

For standalone use of the trained classifiers, TMVA also generates lightweight C++ response classes, which contain the encoded information from the weight files so that these are not required anymore. These classes do not depend on TMVA or ROOT, neither on any other external library.

We have put emphasis on the clarity and functionality of the Factory and Reader interfaces to the user applications. All MVA methods run with reasonable default configurations, so that for standard applications that do not require particular tuning, the user script for a full TMVA analysis will hardly exceed a few lines of code. For individual optimisation the user can (and should) customize the classifiers via configuration strings.

Please report any problems and/or suggestions for improvements to the authors

The Code

TMVA comes with your local ROOT distributions.

If you depend on a newer version, the TMVA source code to build a new shared library can be downloaded from sourceforge.net.
View SVN gives you a snapshot of the current SVN trunk in ROOT.

Follow this link for a brief tutorial on "HowTo" get started.

test/TMVAClassification.C

test/TMVARegression.C

Start.

Create the factory (with some bookkeeping arguments).

Define the input samples (signal and background for classification, a single input sample for regression): these can be separate ROOT Trees, a single ROOT Tree with a type identifier for classification, or separate ASCII files.

Add the input variable names to be used to train the MVAs to the factory.

Prepare the training and test ROOT Trees using the numbers of events given in the arguments. Precuts can be applied at this step.

Book the MVA methods. The first argument to the factory is the instance name of the method. It is used to tag the evaluation plots and weight files for this method. The second argument is the unique method type (enum). The third argument consists of an option string, which individually configures each method. Detailed information can be found in the corresponding C++ class implementations.

Propagate the methods through the training, testing and evaluation phases.

End.

test/TMVAClassificationApplication.C

test/TMVARegressionApplication.C

$ROOTSYS

$ROOTSYS/lib

$ROOTSYS/bin

$PATH

$ROOTSYS/lib

$LD_LIBRARY_PATH

Documentation on the MVA Techniques

Users Guide

Rectangular Cut Optimisation

Optimal cuts maximise the signal efficiency at given background efficiency. Other optimisation criteria, such as maximising the signal significance-squared, S²/(S+B), with S and B being the signal and background yields, then correspond to a particular point in the optimised background-rejection versus signal-efficiency curve. This working point requires the knowledge of the expected yields, which is not the case in general. Note also that for rare signals, Poissonian statistics should be used, which modifies the significance criterion.

If linear input correlations are present, it may be useful to use decorrelated input variables before optimising cuts (option ":D"). Technically, the cut optimisation is achieved in TMVA by three optional methods:

Monte Carlo generation (option: MC).
Fitting using a Genetic Algorithm (option: GA).
Fitting using Simulated Annealing (option: SA - still in testing phase).

Attempts using MINUIT (Simplex or Migrad) have not shown satisfactory results, as the fits often fail because of convergence at local minima. For most examples tested by us GA was the most performing.

The rectangular cut of a volume in the variable space is performed using a binary tree to sort the training events. This provides a significant reduction in computing time.

Projective Likelihood (PDE Approach)

The method of maximum likelihood is among the most straightforward multivariate analyser approaches. We define the likelihood ratio, R, for an event by the ratio of the signal to the signal plus background likelihoods. The individual likelihoods are products of the corresponding probability densities of the discriminating input variables used. In practice, TMVA uses polynomial splines fitted to histograms, or unbinned Gaussian kernel density estimators, to estimate the probability density functions (PDF) obtained from the distributions of the training variables.

Likelihood responses are often strongly peaked at 0/1. The booking option "TransformOutput" zooms into these peaks (with no change in the performance) using an inverse sigmoid transformation.

Multidimensional Probability Density Estimator Range-Search (PDERS)

This is a generalization of the above Likelihood methods to N_var dimensions, where N_var is the number of input variables used in the MVA. If the multidimensional probability density functions (PDFs) for signal and background were known, this method contains the entire physical information, and is therefore optimal. Usually, kernel estimation methods are used to approximate the PDFs using the events from the training sample.

A very simple probability density estimator (PDE) has been suggested in hep-ex/0211019. The PDE for a given test event is obtained from counting the (normalized) number of signal and background (training) events that occur in the "vicinity" of the test event. The volume that describes "vicinity" is user-defined. A search method based on binary-trees is used to effectively reduce the selection time for the range search. Three different volume definitions are optional:

MinMax: the volume is defined in each dimension with respect to the full variable range found in the training sample.
RMS: the volume is defined in each dimensions with respect to the RMS estimated from the training sample.
Adaptive: a volume element is defined in each dimensions with respect to the RMS estimated from the training sample. The overall scale of the volume element is then determined for each event so that the total number of events confined in the volume be within a user-defined range.

The adaptive range search is used by default. The option "UseKernelEstimate" allows the user to weight the events found within the adaptive volume by a multidimensional Gaussian function according to their distance to the test event.

Multidimensional k-Nearest Neighbour (k-NN) method

Similar to PDERS, the k-nearest neighbour method compares an observed (test) event to reference events from a training data set However, unlike PDERS, which in its original form uses a fixed-sized multidimensional volume surrounding the test event, and in its augmented form resizes the volume as a function of the local data density, the k-NN algorithm is intrinsically adaptive. It searches for a fixed number of adjacent events, which then define a volume for the metric used. The k-NN method has best performance when the boundary that separates signal and background events has irregular features that cannot be easily approximated by parametric learning methods.

The k-NN algorithm uses a kd-tree structure for the sorting of the training events that significantly improves the performance. The TMVA implementation of the k-NN method is reasonably fast to allow classification and regression for large data sets. In particular, it is faster than the adaptive PDERS method.

Note that the k-NN method is not appropriate for problems where the number of input variables exceeds about 10. In general, the larger the training set, the more the algorithm probes small-scale features that distinguish signal and background events.

Fisher and Mahalanobis Discriminants

In the method of Fisher discriminants event selection is performed in a transformed variable space with zero linear correlations, by distinguishing the mean values of the signal and background distributions.

The linear discriminant analysis determines an axis in the (correlated) hyperspace of the input variables such that, when projecting the output classes (signal and background) upon this axis, they are pushed as far as possible away from each other, while events of a same class are confined in a close vicinity. The linearity property of this method is reflected in the metric with which "far apart" and "close vicinity" are determined: the covariance matrix of the discriminant variable space.

The classification of the events in signal and background classes relies on the following characteristics (only): overall sample means for each input variable, class-specific sample means, and total covariance matrix. The covariance matrix can be decomposed into the sum of a within- and a between-class class matrix. They describe the dispersion of events relative to the means of their own class (within-class matrix), and relative to the overall sample means (between-class matrix). The Fisher coefficients are then given by the product of the difference vector of signal and background sample means and the inverse within-class matrix.

H-Matrix (χ²) Estimator

This MVA approach is used by the DØ collaboration (FNAL) for the purpose of electron identification (see, eg., hep-ex/9507007). As it is implemented in TMVA, it is usually equivalent to the Fisher-Mahalanobis discriminant, and it has only been added for the purpose of completeness. Two χ² estimators are computed for an event, each one for signal and background, using the estimates for the means and covariance matrices obtained from the training sample. TMVA then uses as normalised analyser for event the ratio: (χ_S(i)² − χ_B²(i)) /(χ_S²(i) + χ_B²(i)).

Function Discriminant Analysis (FDA)

The common goal of all TMVA discriminators is to determine an optimal separating function in the multivariate space represented by the input variables. The Fisher discriminant solves this analytically for the linear case, while artificial neural networks, support vector machines or boosted decision trees provide nonlinear approximations with -- in principle -- arbitrary precision if enough training statistics is available and the chosen architecture is flexible enough.

The function discriminant analysis (FDA) provides an intermediate solution to the problem with the aim to solve relatively simple or partially nonlinear problems. The user provides the desired function with adjustable parameters via the configuration option string, and FDA fits the parameters to it, requiring the signal (background) function value to be as close as possible to 1 (0). Its advantage over the more involved and automatic nonlinear discriminators is the simplicity and transparency of the discrimination expression. A shortcoming is that FDA will underperform for involved problems with complicated, phase space dependent nonlinear correlations.

The FDA performance depends on the complexity and fidelity of the user-defined discriminator function. As a general rule, it should be able to reproduce the discrimination power of any linear discriminant analysis. To reach into the nonlinear domain, it is useful to inspect the correlation profiles of the input variables, and add quadratic and higher polynomial terms between variables as necessary. Comparison with more involved nonlinear MVA methods can be used as a guide.

Artificial Neural Networks (Non-Linear Discriminant Analysis)

Three different ANN implementations are used in TMVA. The TMlpANN, implemented in ROOT, the Clermont-Ferrand ANN (CFMlpANN), which has been translated from FORTRAN, and a new ANN (MLP), which is very similar to the ROOT ANN, but can be trained significantly faster. All ANNs belong to the class of Multilayer Perceptrons (MLP), which are feed-forward networks according to the following propagation schema:
Schema for artificial neural network

The input layer contains as many neurons as input variables used in the MVA. The output layer contains a single neuron for the signal weight. In between the input and output layers are a variable number of k hidden layers with arbitrary numbers of neurons. (While the structure of the input and output layers is determined by the problem, the hidden layers can be configured by the user through the option string of the method booking.)

As indicated in the sketch, all neuron inputs to a layer are linear combinations of the neuron output of the previous layer. The transfer from input to output within a neuron is performed by means of an "activation function". In general, the activation function of a neuron can be zero (deactivated), one (linear), or non-linear. The above example uses a sigmoid activation function. The transfer function of the output layer is usually linear. As a consequence: an ANN without hidden layer should give identical discrimination power as a linear discriminant analysis (Fisher). In case of one hidden layer, the ANN computes a linear combination of sigmoid.

The trained MLP architecture can be plotted with the macro macro/network.C. Click here for an example plot.

Boosted Decision Trees

Boosted decision trees have been successfully used in High Energy Physics analysis for example by the MiniBooNE experiment (Yang-Roe-Zhu, physics/0508045). In Boosted Decision Trees, the selection is done on a majority vote on the result of several decision trees, which are all derived from the same training sample by supplying different event weights during the training.

Decision trees: successive decision nodes are used to categorize the events out of the sample as either signal or background. Each node uses only a single discriminating variable to decide if the event is signal-like ("goes right") or background-like ("goes left"). This forms a tree like structure with "baskets" at the end (leave nodes), and an event is classified as either signal or background according to whether the basket where it ends up has been classified signal or background during the training. Training of a decision tree is the process to define the "cut criteria" for each node. The training starts with the root node. Here one takes the full training event sample and selects the variable and corresponding cut value that gives the best separation between signal and background at this stage. Using this cut criterion, the sample is then divided into two subsamples, a signal-like (right) and a background-like (left) sample. Two new nodes are then created for each of the two sub-samples and they are constructed using the same mechanism as described for the root node. The devision is stopped once a certain node has reached either a minimum number of events, or a minimum or maximum signal purity. These leave nodes are then called "signal" or "background" if they contain more signal respective background events from the training sample.

Boosting: the idea behind the boosting is, that signal events from the training sample, that end up in a background node (and vice versa) are given a larger weight than events that are in the correct leave node. This results in a re-weighed training event sample, with which then a new decision tree can be developed. The boosting can be applied several times (typically 100-500 times) and one ends up with a set of decision trees (a forest).

Bagging: In this particular variant of the Boosted Decision Trees the boosting is not done on the basis of previous training results, but by a simple stochasitc re-sampling of the initial training event sample.

Analysis: applying an individual decision tree to a test event results in a classification of the event as either signal or background. For the boosted decision tree selection, an event is successively subjected to the whole set of decision trees and depending on how often it is classified as signal, a "likelihood" estimator is constructed for the event being signal or background. The value of this estimator is the one which is then used to select the events from an event sample, and the cut value on this estimator defines the efficiency and purity of the selection.

Predictive Learning via Rule Ensembles (RuleFit)

This is a TMVA implementation of Friedman-Popescu's RuleFit method. The discriminator is a linear combinations of base learners $f_m(\vec{x})$

$\displaystyle F(\vec{x}) = a_\circ + \sum_{m=1}^M a_m f_m(\vec{x})$

(1)

where $\vec{x}$ is the vector of observables. In this case the base learners consist of so called rules. A single rule is essentially a function of a series of cuts. A rule applied on a given event is non-zero only if all cuts in the product are satisfied. In such a case the rule returns 1.

RuleFit consists of two main steps:

(1)

An efficient way of generating rules is to use decision trees. Each node (except for the root node) produces one rule. The rule is defined by the series of cuts required to reach the node starting from the root. In the current implementation, the forest is generated through boosting (default) or by training each tree on random subsamples.

Support Vector Machine (SVM)

In the early 1960s the linear support vector method was developed to construct separating hyperplanes for pattern recognition problems. It took 30 years before the method was generalised to nonlinear separating functions and to estimate real-valued functions (regression). At that moment it became a general purpose algorithm performing classification and regression tasks which can compete with neural networks and probability density estimators. Typical applications of SVMs include text categorisation, character recognition, bioinformatics and face detection.

The main idea of the SVM approach is to build a hyperplane that separates signal and background vectors (events) using only a minimal subset of all training vectors (support vectors). The position of the hyperplane is obtained by maximizing the margin between it and the support vectors. The extension to nonlinear SVMs is performed by mapping the input vectors onto a higher dimensional feature space in which signal and background events can be separated by a linear procedure using an optimally separating hyperplane. The use of kernel functions eliminates thereby the explicit transformation to the feature space and simplifies the computation.

The implemented SVM algorithm performs the classification task using optionally linear, polynomial, Gaussian or sigmoidal kernel functions. The Gaussian kernel allows one to apply any discriminating shape in the input space. An important task is to properly choose the kernel parameters (the width in case of the Gaussian kernel) and a cost parameter. They must be optimized experimentally by the user, the advised method is to run SVM several times with different sets of parameters to perform a grid scan and to chose the optimal configuration. The SVM training time scales quadratically with the number of vectors in the training event sample.

Credits

Contributed to TMVA have: Andreas Hoecker (CERN, Switzerland), Jörg Stelzer (CERN, Switzerland), Peter Speckmayer (CERN, Switzerland), Jan Therhaag (Universität Bonn, Germany), Eckhard von Toerne (Universität Bonn, Germany), Helge Voss (MPI für Kernphysik Heidelberg, Germany), Moritz Backes (Geneva University, Switzerland), Tancredi Carli (CERN, Switzerland), Asen Christov (Universität Freiburg, Germany), Or Cohen (CERN, Switzerland and Weizmann, Israel), Krzysztof Danielowski (IFJ and AGH/UJ, Krakow, Poland), Dominik Dannheim (CERN, Switzerland), Sophie Henrot-Versille (LAL Orsay, France), Matthew Jachowski (Stanford University, USA), Kamil Kraszewski (IFJ and AGH/UJ, Krakow, Poland), Attila Krasznahorkay Jr. (CERN, Switzerland, and Manchester U., UK), Maciej Kruk (IFJ and AGH/UJ, Krakow, Poland), Yair Mahalalel (Tel Aviv University, Israel), Rustem Ospanov (University of Texas, USA), Xavier Prudent (LAPP Annecy, France), Arnaud Robert (LPNHE Paris, France), Doug Schouten (S. Fraser U., Canada), Fredrik Tegenfeldt (Iowa University, USA, until Aug 2007), Alexander Voigt (CERN, Switzerland), Kai Voss (University of Victoria, Canada), Marcin Wolter (IFJ PAN Krakow, Poland), Andrzej Zemla (IFJ PAN Krakow, Poland). We thank the user community for the large number of valuable comments that help to constantly improve the functionality of TMVA (please see also acknowledgments in TMVA Users Guide).

Other successful projects pioneered the idea of parallel processing of MVA-based classification and evaluation in High-Energy Physics (HEP), and TMVA conceptually benefited from these developments. Here we would like to particularly mention the Cornelius package developed by the Tagging Group of the BABAR Collaboration (see Refs. [BABAR Physics Book] and [S. Versillé, PhD thesis]).

Many similar combined multivariate classification ("data mining") efforts exist. Within the HEP community, the package StatPatternRecognition developed by I. Narsky is frequently used. Outside of HEP, this Wikipedia page can be used as a starting point for web queries.

Redistribution and use of TMVA in source and binary forms, with or without modification, are permitted according to the terms listed in the BSD license.