TMVA Toolkit for Multivariate Data Analysis with ROOT
Cite TMVA Quickstart Tutorial Classifier Reference Users Guide Talks Mailing Lists Download
SF Project Page Bug Tracker TMVA Releases in ROOT

Executive Summary The Code Documentation on MVA Techniques Credits

TMVA News, Sep 25, 2013 (TMVA-v4.2.0)     [news archive]    

    TMVA version 4.2.0 is included in ROOT release 5.34/11.
    The (main) changes with respect to TMVA-v4.1.2 / ROOT 5.30 are listed below.


    • NormMode: as called in PrepareTrainingAndTestTree
      The default has changed and is now "EqualNumEvents" (with fixed meaning) While previously NumEvents and EqualNumEvents (by mistake/miscommunication) took into account training+test events, they are now correctly normalising only "Training Events" (note the reason for these normalisations was to have the possibility to easily force the effective (weighted) number of training events used for Signal (class 0) to equal the number of training events in the Backgr. (sum of all remaining classes in multiclass mode)
      - the weighted number of events is scaled, independently for signal and backgroundm, such that the sum of weights equals the number of events given in the Factory::PrepareTrainingAndTestTree("",nTrain_Signal=3000,nTrain_Background=6000) call. This example call will give hence end up in having 2x more background events in the training compared to the signal, no matter what the individual event weights have been. (watch out! if you specify nTrain_Signal=0,nTrain_Background=0), then the ratio will be according to total numbers of MC events in the signal and background respectively, which could be very different from the usually good ratio of having about the same weighted number signal and background events in the training. In that case it is better to use:
      - for the signal events, the same is done as for NumEvents. The background events however are reweighted such, that their sum of weights equals that for the signal events. This results in the same effective (weighted) number of signal and background events to be seen in the training.
    • Transformations=I is default again in Factory (this defines which variables distribution plots are added to the TMVA output file - and hence displayable via the TMVAGui)

    Boosted Decision Trees:

    • Some changes to the training options:
    • nEventsMin (deprecated) please replace by → MinNodeSize
      The option nEventsMin which specified the minimum number of training event in a leaf node as an absolute number has been replaced by "MinNodeSize" which is given in "percentage of the trainin sample". Like this the training options become less dependent on the actual number of training sample size
      NNodesMax (deprecated) please replace by → MaxDepth
      GradBaggingFraction and UseNTrainEvents replaced by BaggedSampleFraction
      - they both meant the same thing and are now deprecated → use BaggedSampleFraction instead
      UsedBaggedGrad replaced by UseBaggedBoost
      - like this, the use of a bagged sample in Grad-Boost or AdaBoost have the same option name
      - the 'random subsample' in GradBoost has also been replaced by a properly resampled bootstrap sample (including replacement)
      UseWeightedTrees → removed
      - it was default anyway and the only reasonable choice there is
      PruneBeforeBoost → removed
      - it has been mostly a debug/trial option
      NegWeightTreatment=IgnoreNegWeights → replaced by NegWeightTreatment=IgnoreNegWeightsInTraining
      - Unfortunatly the default "IgnoreNegWeights" to the BDT option "NegWeightTreatment" collided with the a global option and had to be replaced.
    • Other changes to the training:
    • Regardless of the NormMode set in the TMVA::Factory, the BDT training will always start with reweighting the background such that its 'sum of weights' equals that of the signal. An imbalance here previously only resulted in many misclassified events, causing the same re-weighting to be done effectively only in the first boosting step.


    • some cleanup (removed strange experimental boosting option HighEdgeGaus, HighEdgeCoPara ..... )
    • remove options MethodWeightType... have it defined by the Boost Method (these have been trial options.. but for clarity it is much better to stick to the "standard" ones (i.e log(alpha) for AdaBoost etc)
    • up to now, the first classifier was trained with the full sample, I think however, it should also be a bagged sample (i.e. particularily if smaller sample sizes for the bagged samples were demanded) .. it's changed now, accordingly

TMVA Executive Summary

    The Toolkit for Multivariate Analysis (TMVA) provides a ROOT-integrated machine learning environment for the processing and parallel evaluation of multivariate classification and regression techniques. TMVA is specifically designed to the needs of high-energy physics (HEP) applications, but should not be restricted to these. The package includes:

    TMVA consists of object-oriented implementations in C++ for each of these multivariate methods and provides training, testing and performance evaluation algorithms and visualization scripts. The MVA training and testing is performed with the use of user-supplied data sets in form of ROOT trees or text files, where each event can have an individual weight. The true event classification or target value (for regression problems) in these data sets must be known. Preselection requirements and transformations can be applied on this data. TMVA supports the use of variable combinations and formulas.

    TMVA works in transparent factory mode to guarantee an unbiased performance comparison between the algorithms: all MVA methods see the same training and test data, and are evaluated following the same prescriptions within the same execution job. A Factory class organises the interaction between the user and the TMVA analysis steps. It performs preanalysis and preprocessing of the training data to assess basic properties of the discriminating variables used as input to the methods. The correlation coefficients of the input variables are calculated and displayed, and a preliminary ranking is derived (which is later superseded by method-specific variable rankings). The variables can be linearly transformed (individually for each classifier) into a non-correlated variable space or projected upon their principle components. For performance comparison, the analysis job prints tabulated results for some benchmark measures. Smooth efficiency versus background rejection curves are stored in a ROOT output file, together with other graphical evaluation information. These results can be displayed using ROOT macros, which are conveniently executed via a graphical user interfaces (each one for classification and regression) that comes with the TMVA distribution.

    The TMVA training job runs alternatively as a ROOT script, as a standalone executable, where is linked as a shared library, or as a python script via the PyROOT interface. Each classifier trained in one of these applications writes its configuration and training results in result (``weight'') files, which consist of text and (optionally) ROOT files.

    An easy-to-use Reader class is provided, which reads and interprets the weight files (interfaced by the corresponding classifiers), and which can be included in any C++ executable, ROOT macro or python analysis job.

    For standalone use of the trained classifiers, TMVA also generates lightweight C++ response classes, which contain the encoded information from the weight files so that these are not required anymore. These classes do not depend on TMVA or ROOT, neither on any other external library.

    We have put emphasis on the clarity and functionality of the Factory and Reader interfaces to the user applications. All MVA methods run with reasonable default configurations, so that for standard applications that do not require particular tuning, the user script for a full TMVA analysis will hardly exceed a few lines of code. For individual optimisation the user can (and should) customize the classifiers via configuration strings.

    Please report any problems and/or suggestions for improvements to the authors.

The Code

    TMVA comes with your local ROOT distributions.

    If you depend on a newer version, the TMVA source code to build a new shared library can be downloaded from
    View SVN gives you a snapshot of the current SVN trunk in ROOT.

    Follow this link for a brief tutorial on "HowTo" get started.

    The package contains classes (.h/.cxx extensions), ROOT scripts (.C extensions), executables (.cxx extensions), and a Makefile. It is best to first run the example macro test/TMVAClassification.C (for classification), and test/TMVARegression.C (for regression) to check that everything works as it should. These macros should also serve as templates for your own applications. The relevant analysis steps through the TMVA factory are the following:

    1. Start.
    2. Create the factory (with some bookkeeping arguments).
    3. Define the input samples (signal and background for classification, a single input sample for regression): these can be separate ROOT Trees, a single ROOT Tree with a type identifier for classification, or separate ASCII files.
    4. Add the input variable names to be used to train the MVAs to the factory.
    5. Prepare the training and test ROOT Trees using the numbers of events given in the arguments. Precuts can be applied at this step.
    6. Book the MVA methods. The first argument to the factory is the instance name of the method. It is used to tag the evaluation plots and weight files for this method. The second argument is the unique method type (enum). The third argument consists of an option string, which individually configures each method. Detailed information can be found in the corresponding C++ class implementations.
    7. Propagate the methods through the training, testing and evaluation phases.
    8. End.

    Once the interesting MVA methods have been identified, they can be included into the data analysis. How to do this latter step is shown in the example: test/TMVAClassificationApplication.C (classification) and test/TMVARegressionApplication.C (regression).

    Compilation under Linux should be straightforward by just typing "make" in the main TMVA directory. Note that the ROOT environment needs to be properly set up: $ROOTSYS should be set, $ROOTSYS/lib and $ROOTSYS/bin should appear in $PATH, and $ROOTSYS/lib in $LD_LIBRARY_PATH.

    Before running the macros in the "test" directory, source the "setup.[c]sh" script.

Documentation on the MVA Techniques

    A brief introduction to the TMVA methods is given below. More details are available in the Users Guide.

    Rectangular Cut Optimisation

      Optimal cuts maximise the signal efficiency at given background efficiency. Other optimisation criteria, such as maximising the signal significance-squared, S2/(S+B), with S and B being the signal and background yields, then correspond to a particular point in the optimised background-rejection versus signal-efficiency curve. This working point requires the knowledge of the expected yields, which is not the case in general. Note also that for rare signals, Poissonian statistics should be used, which modifies the significance criterion.

      If linear input correlations are present, it may be useful to use decorrelated input variables before optimising cuts (option ":D"). Technically, the cut optimisation is achieved in TMVA by three optional methods:

      • Monte Carlo generation (option: MC).
      • Fitting using a Genetic Algorithm (option: GA).
      • Fitting using Simulated Annealing (option: SA - still in testing phase).

      Attempts using MINUIT (Simplex or Migrad) have not shown satisfactory results, as the fits often fail because of convergence at local minima. For most examples tested by us GA was the most performing.

      The rectangular cut of a volume in the variable space is performed using a binary tree to sort the training events. This provides a significant reduction in computing time.

    Projective Likelihood (PDE Approach)

      The method of maximum likelihood is among the most straightforward multivariate analyser approaches. We define the likelihood ratio, R, for an event by the ratio of the signal to the signal plus background likelihoods. The individual likelihoods are products of the corresponding probability densities of the discriminating input variables used. In practice, TMVA uses polynomial splines fitted to histograms, or unbinned Gaussian kernel density estimators, to estimate the probability density functions (PDF) obtained from the distributions of the training variables.

      Likelihood responses are often strongly peaked at 0/1. The booking option "TransformOutput" zooms into these peaks (with no change in the performance) using an inverse sigmoid transformation.

    Multidimensional Probability Density Estimator Range-Search (PDERS)

      This is a generalization of the above Likelihood methods to Nvar dimensions, where Nvar is the number of input variables used in the MVA. If the multidimensional probability density functions (PDFs) for signal and background were known, this method contains the entire physical information, and is therefore optimal. Usually, kernel estimation methods are used to approximate the PDFs using the events from the training sample.

      A very simple probability density estimator (PDE) has been suggested in hep-ex/0211019. The PDE for a given test event is obtained from counting the (normalized) number of signal and background (training) events that occur in the "vicinity" of the test event. The volume that describes "vicinity" is user-defined. A search method based on binary-trees is used to effectively reduce the selection time for the range search. Three different volume definitions are optional:

      • MinMax: the volume is defined in each dimension with respect to the full variable range found in the training sample.
      • RMS: the volume is defined in each dimensions with respect to the RMS estimated from the training sample.
      • Adaptive: a volume element is defined in each dimensions with respect to the RMS estimated from the training sample. The overall scale of the volume element is then determined for each event so that the total number of events confined in the volume be within a user-defined range.

      The adaptive range search is used by default. The option "UseKernelEstimate" allows the user to weight the events found within the adaptive volume by a multidimensional Gaussian function according to their distance to the test event.

    Multidimensional k-Nearest Neighbour (k-NN) method

      Similar to PDERS, the k-nearest neighbour method compares an observed (test) event to reference events from a training data set However, unlike PDERS, which in its original form uses a fixed-sized multidimensional volume surrounding the test event, and in its augmented form resizes the volume as a function of the local data density, the k-NN algorithm is intrinsically adaptive. It searches for a fixed number of adjacent events, which then define a volume for the metric used. The k-NN method has best performance when the boundary that separates signal and background events has irregular features that cannot be easily approximated by parametric learning methods.

      The k-NN algorithm uses a kd-tree structure for the sorting of the training events that significantly improves the performance. The TMVA implementation of the k-NN method is reasonably fast to allow classification and regression for large data sets. In particular, it is faster than the adaptive PDERS method.

      Note that the k-NN method is not appropriate for problems where the number of input variables exceeds about 10. In general, the larger the training set, the more the algorithm probes small-scale features that distinguish signal and background events.

    Fisher and Mahalanobis Discriminants

      In the method of Fisher discriminants event selection is performed in a transformed variable space with zero linear correlations, by distinguishing the mean values of the signal and background distributions.

      The linear discriminant analysis determines an axis in the (correlated) hyperspace of the input variables such that, when projecting the output classes (signal and background) upon this axis, they are pushed as far as possible away from each other, while events of a same class are confined in a close vicinity. The linearity property of this method is reflected in the metric with which "far apart" and "close vicinity" are determined: the covariance matrix of the discriminant variable space.

      The classification of the events in signal and background classes relies on the following characteristics (only): overall sample means for each input variable, class-specific sample means, and total covariance matrix. The covariance matrix can be decomposed into the sum of a within- and a between-class class matrix. They describe the dispersion of events relative to the means of their own class (within-class matrix), and relative to the overall sample means (between-class matrix). The Fisher coefficients are then given by the product of the difference vector of signal and background sample means and the inverse within-class matrix.

    H-Matrix (χ2) Estimator

      This MVA approach is used by the DØ collaboration (FNAL) for the purpose of electron identification (see, eg., hep-ex/9507007). As it is implemented in TMVA, it is usually equivalent to the Fisher-Mahalanobis discriminant, and it has only been added for the purpose of completeness. Two χ2 estimators are computed for an event, each one for signal and background, using the estimates for the means and covariance matrices obtained from the training sample. TMVA then uses as normalised analyser for event the ratio: (χS(i)2 − χB2(i)) /(χS2(i) + χB2(i)).

    Function Discriminant Analysis (FDA)

      The common goal of all TMVA discriminators is to determine an optimal separating function in the multivariate space represented by the input variables. The Fisher discriminant solves this analytically for the linear case, while artificial neural networks, support vector machines or boosted decision trees provide nonlinear approximations with -- in principle -- arbitrary precision if enough training statistics is available and the chosen architecture is flexible enough.

      The function discriminant analysis (FDA) provides an intermediate solution to the problem with the aim to solve relatively simple or partially nonlinear problems. The user provides the desired function with adjustable parameters via the configuration option string, and FDA fits the parameters to it, requiring the signal (background) function value to be as close as possible to 1 (0). Its advantage over the more involved and automatic nonlinear discriminators is the simplicity and transparency of the discrimination expression. A shortcoming is that FDA will underperform for involved problems with complicated, phase space dependent nonlinear correlations.

      The FDA performance depends on the complexity and fidelity of the user-defined discriminator function. As a general rule, it should be able to reproduce the discrimination power of any linear discriminant analysis. To reach into the nonlinear domain, it is useful to inspect the correlation profiles of the input variables, and add quadratic and higher polynomial terms between variables as necessary. Comparison with more involved nonlinear MVA methods can be used as a guide.

    Artificial Neural Networks (Non-Linear Discriminant Analysis)

      Three different ANN implementations are used in TMVA. The TMlpANN, implemented in ROOT, the Clermont-Ferrand ANN (CFMlpANN), which has been translated from FORTRAN, and a new ANN (MLP), which is very similar to the ROOT ANN, but can be trained significantly faster. All ANNs belong to the class of Multilayer Perceptrons (MLP), which are feed-forward networks according to the following propagation schema:
      Schema for artificial neural network
      The input layer contains as many neurons as input variables used in the MVA. The output layer contains a single neuron for the signal weight. In between the input and output layers are a variable number of k hidden layers with arbitrary numbers of neurons. (While the structure of the input and output layers is determined by the problem, the hidden layers can be configured by the user through the option string of the method booking.)

      As indicated in the sketch, all neuron inputs to a layer are linear combinations of the neuron output of the previous layer. The transfer from input to output within a neuron is performed by means of an "activation function". In general, the activation function of a neuron can be zero (deactivated), one (linear), or non-linear. The above example uses a sigmoid activation function. The transfer function of the output layer is usually linear. As a consequence: an ANN without hidden layer should give identical discrimination power as a linear discriminant analysis (Fisher). In case of one hidden layer, the ANN computes a linear combination of sigmoid.

      The trained MLP architecture can be plotted with the macro macro/network.C. Click here for an example plot.

    Boosted Decision Trees

      Boosted decision trees have been successfully used in High Energy Physics analysis for example by the MiniBooNE experiment (Yang-Roe-Zhu, physics/0508045). In Boosted Decision Trees, the selection is done on a majority vote on the result of several decision trees, which are all derived from the same training sample by supplying different event weights during the training.

      Decision trees: successive decision nodes are used to categorize the events out of the sample as either signal or background. Each node uses only a single discriminating variable to decide if the event is signal-like ("goes right") or background-like ("goes left"). This forms a tree like structure with "baskets" at the end (leave nodes), and an event is classified as either signal or background according to whether the basket where it ends up has been classified signal or background during the training. Training of a decision tree is the process to define the "cut criteria" for each node. The training starts with the root node. Here one takes the full training event sample and selects the variable and corresponding cut value that gives the best separation between signal and background at this stage. Using this cut criterion, the sample is then divided into two subsamples, a signal-like (right) and a background-like (left) sample. Two new nodes are then created for each of the two sub-samples and they are constructed using the same mechanism as described for the root node. The devision is stopped once a certain node has reached either a minimum number of events, or a minimum or maximum signal purity. These leave nodes are then called "signal" or "background" if they contain more signal respective background events from the training sample.

      Boosting: the idea behind the boosting is, that signal events from the training sample, that end up in a background node (and vice versa) are given a larger weight than events that are in the correct leave node. This results in a re-weighed training event sample, with which then a new decision tree can be developed. The boosting can be applied several times (typically 100-500 times) and one ends up with a set of decision trees (a forest).

      Bagging: In this particular variant of the Boosted Decision Trees the boosting is not done on the basis of previous training results, but by a simple stochasitc re-sampling of the initial training event sample.

      Analysis: applying an individual decision tree to a test event results in a classification of the event as either signal or background. For the boosted decision tree selection, an event is successively subjected to the whole set of decision trees and depending on how often it is classified as signal, a "likelihood" estimator is constructed for the event being signal or background. The value of this estimator is the one which is then used to select the events from an event sample, and the cut value on this estimator defines the efficiency and purity of the selection.

    Predictive Learning via Rule Ensembles (RuleFit)

      This is a TMVA implementation of Friedman-Popescu's RuleFit method. The discriminator is a linear combinations of base learners $ f_m(\vec{x})$
      $\displaystyle F(\vec{x}) = a_\circ + \sum_{m=1}^M a_m f_m(\vec{x})$ (1)
      where $ \vec{x}$ is the vector of observables. In this case the base learners consist of so called rules. A single rule is essentially a function of a series of cuts. A rule applied on a given event is non-zero only if all cuts in the product are satisfied. In such a case the rule returns 1.

      RuleFit consists of two main steps:

        1. Rules generation
        2. Fit of the rules to the training data, i.e, find the optimum coefficients in (1).

      An efficient way of generating rules is to use decision trees. Each node (except for the root node) produces one rule. The rule is defined by the series of cuts required to reach the node starting from the root. In the current implementation, the forest is generated through boosting (default) or by training each tree on random subsamples.

    Support Vector Machine (SVM)

      In the early 1960s the linear support vector method was developed to construct separating hyperplanes for pattern recognition problems. It took 30 years before the method was generalised to nonlinear separating functions and to estimate real-valued functions (regression). At that moment it became a general purpose algorithm performing classification and regression tasks which can compete with neural networks and probability density estimators. Typical applications of SVMs include text categorisation, character recognition, bioinformatics and face detection.

      The main idea of the SVM approach is to build a hyperplane that separates signal and background vectors (events) using only a minimal subset of all training vectors (support vectors). The position of the hyperplane is obtained by maximizing the margin between it and the support vectors. The extension to nonlinear SVMs is performed by mapping the input vectors onto a higher dimensional feature space in which signal and background events can be separated by a linear procedure using an optimally separating hyperplane. The use of kernel functions eliminates thereby the explicit transformation to the feature space and simplifies the computation.

      The implemented SVM algorithm performs the classification task using optionally linear, polynomial, Gaussian or sigmoidal kernel functions. The Gaussian kernel allows one to apply any discriminating shape in the input space. An important task is to properly choose the kernel parameters (the width in case of the Gaussian kernel) and a cost parameter. They must be optimized experimentally by the user, the advised method is to run SVM several times with different sets of parameters to perform a grid scan and to chose the optimal configuration. The SVM training time scales quadratically with the number of vectors in the training event sample.


Copyright © (2005-2010): Andreas Hoecker (CERN), Peter Speckmayer (CERN), Jörg Stelzer (CERN), Jan Therhaag (U Bonn, Germany), Eckhard von Toerne (U Bonn, Germany), Helge Voss (MPI-KP Heidelberg)

Redistribution and use of TMVA in source and binary forms, with or without modification, are permitted according to the terms listed in the BSD license.

Valid HTML 4.01! Logo rss feed