Direct inference with XGBoost¶

General¶

XGBoost is avaliable (at least) since CMSSW_9_2_4 cmssw#19377.

In CMSSW environment, XGBoost can be used via its Python API.

For UL era, there are different verisons available for different SCRAM_ARCH:

For slc7_amd64_gcc700 and above, ver.0.80 is available.
For slc7_amd64_gcc900 and above, ver.1.3.3 is available.
Please note that different major versions have different behavior( See Caveat Session).

Existing Examples¶

There are some existing good examples of using XGBoost under CMSSW, as listed below:

Offical sample for testing the integration of XGBoost library with CMSSW.
Useful codes created by Dr. Huilin Qu for inference with existing trained model.
C/C++ Interface for inference with existing trained model.

We will provide examples for both C/C++ interface and python interface of XGBoost under CMSSW environment.

Example: Classification of points from joint-Gaussian distribution.¶

In this specific example, you will use XGBoost to classify data points generated from two 8-dimension joint-Gaussian distribution.

Feature Index	0	1	2	3	4	5	6	7
μ₁	1	2	3	4	5	6	7	8
μ₂	0	1.9	3.2	4.5	4.8	6.1	8.1	11
σ_½ = σ	1	1	1	1	1	1	1	1
\|μ₁ - μ₂\| / σ	1	0.1	0.2	0.5	0.2	0.1	1.1	3

All generated data points for train(1:10000,2:10000) and test(1:1000,2:1000) are stored as Train_data.csv/Test_data.csv.

Preparing Model¶

The training process of a XGBoost model can be done outside of CMSSW. We provide a python script for illustration.

# importing necessary models
import numpy as np
import pandas as pd 
from xgboost import XGBClassifier # Or XGBRegressor for Logistic Regression
import matplotlib.pyplot as plt
import pandas as pd

# specify parameters via map
param = {'n_estimators':50}
xgb = XGBClassifier(param)

# using Pandas.DataFrame data-format, other available format are XGBoost's DMatrix and numpy.ndarray

train_data = pd.read_csv("path/to/the/data") # The training dataset is code/XGBoost/Train_data.csv

train_Variable = train_data['0', '1', '2', '3', '4', '5', '6', '7']
train_Score = train_data['Type'] # Score should be integer, 0, 1, (2 and larger for multiclass)

test_data = pd.read_csv("path/to/the/data") # The testing dataset is code/XGBoost/Test_data.csv

test_Variable = test_data['0', '1', '2', '3', '4', '5', '6', '7']
test_Score = test_data['Type']

# Now the data are well prepared and named as train_Variable, train_Score and test_Variable, test_Score.

xgb.fit(train_Variable, train_Score) # Training

xgb.predict(test_Variable) # Outputs are integers

xgb.predict_proba(test_Variable) # Output scores , output structre: [prob for 0, prob for 1,...]

xgb.save_model("\Path\To\Where\You\Want\ModelName.model") # Saving model

The saved model ModelName.model is thus available for python and C/C++ api to load. Please use the XGBoost major version consistently (see Caveat).

While training with data from different datasets, proper treatment of weights are necessary for better model performance. Please refer to Official Recommendation for more details.

C/C++ Usage with CMSSW¶

To use a saved XGBoost model with C/C++ code, it is convenient to use the XGBoost's offical C api. Here we provide a simple example as following.

Module setup¶

There is no official CMSSW interface for XGBoost while its library are placed in cvmfs of CMSSW. Thus we have to use the raw c_api as well as setting up the library manually.

To run XGBoost's c_api within CMSSW framework, in addition to the following standard setup.

export SCRAM_ARCH="slc7_amd64_gcc700" # To use higher version, please switch to slc7_amd64_900
export CMSSW_VERSION="CMSSW_X_Y_Z"

source /cvmfs/cms.cern.ch/cmsset_default.sh

cmsrel "$CMSSW_VERSION"
cd "$CMSSW_VERSION/src"

cmsenv
scram b

The addtional effort is to add corresponding xml file(s) to $CMSSW_BASE/toolbox$CMSSW_BASE/config/toolbox/$SCRAM_ARCH/tools/selected/ for setting up XGBoost.

For lower version (<1), add two xml files as below.

xgboost.xml

 <tool name="xgboost" version="0.80">
 <lib name="xgboost"/>
 <client>
    <environment name="LIBDIR" default="/cvmfs/cms.cern.ch/$SCRAM_ARCH/external/py2-xgboost/0.80-ikaegh/lib/python2.7/site-packages/xgboost/lib"/>
    <environment name="INCLUDE" default="/cvmfs/cms.cern.ch/$SCRAM_ARCH/external/py2-xgboost/0.80-ikaegh/lib/python2.7/site-packages/xgboost/include/"/>
  </client>
  <runtime name="ROOT_INCLUDE_PATH" value="$INCLUDE" type="path"/>
  <runtime name="PATH" value="$INCLUDE" type="path"/>
  <use name="rabit"/>
</tool>

rabit.xml

 <tool name="rabit" version="0.80">
   <client>
     <environment name="INCLUDE" default="/cvmfs/cms.cern.ch/$SCRAM_ARCH/external/py2-xgboost/0.80-ikaegh/lib/python2.7/site-packages/xgboost/rabit/include/"/>
   </client>
   <runtime name="ROOT_INCLUDE_PATH" value="$INCLUDE" type="path"/>
   <runtime name="PATH" value="$INCLUDE" type="path"/>  
 </tool>

Please note that the path in cvmfs is not fixed, one can list all available versions in the py2-xgboost directory and choose one to use.

For higher version (>=1), and one xml file

xgboost.xml

<tool name="xgboost" version="0.80">
  <lib name="xgboost"/>
  <client>
    <environment name="LIBDIR" default="/cvmfs/cms.cern.ch/$SCRAM_ARCH/external/xgboost/1.3.3/lib64"/>
    <environment name="INCLUDE" default="/cvmfs/cms.cern.ch/$SCRAM_ARCH/external/xgboost/1.3.3/include/"/>
  </client>
  <runtime name="ROOT_INCLUDE_PATH" value="$INCLUDE" type="path"/>
  <runtime name="PATH" value="$INCLUDE" type="path"/>  
</tool>

Also one has the freedom to choose the available xgboost version inside xgboost directory.

After adding xml file(s), the following commands should be executed for setting up.
1. For lower version (<1), use
```
scram setup rabit
scram setup xgboost
```
2. For higher version (>=1), use
```
scram setup xgboost
```
For using XGBoost as a plugin of CMSSW, it is necessary to add
```
<use name="xgboost"/>
<flags EDM_PLUGIN="1"/>
```
in your plugins/BuildFile.xml. If you are using the interface inside the src/ or interface/ directory of your module, make sure to create a global BuildFile.xml file next to theses directories, containing (at least):
```
<use name="xgboost"/>
<export>
  <lib   name="1"/>
</export>
```
The libxgboost.so would be too large to load for cmsRun job, please using the following commands for pre-loading:
```
export LD_PRELOAD=$CMSSW_BASE/external/$SCRAM_ARCH/lib/libxgboost.so
```

Basic Usage of C API¶

In order to use c_api of XGBoost to load model and operate inference, one should construct necessaries objects:

Files to include
```
#include <xgboost/c_api.h> 
```

BoosterHandle: worker of XGBoost

// Declare Object
BoosterHandle booster_;
// Allocate memory in C style
XGBoosterCreate(NULL,0,&booster_);
// Load Model
XGBoosterLoadModel(booster_,model_path.c_str()); // second argument should be a const char *.

DMatrixHandle: handle to dmatrix, the data format of XGBoost

float TestData[2000][8] // Suppose 2000 data points, each data point has 8 dimension
// Assign data to the "TestData" 2d array ... 
// Declare object
DMatrixHandle data_;
// Allocate memory and use external float array to initialize
XGDMatrixCreateFromMat((float *)TestData,2000,8,-1,&data_); // The first argument takes in float * namely 1d float array only, 2nd & 3rd: shape of input, 4th: value to replace missing ones

XGBoosterPredict: function for inference

bst_ulong outlen; // bst_ulong is a typedef of unsigned long
const float *f; // array to store predictions
XGBoosterPredict(booster_,data_,0,0,&out_len,&f);// lower version API
// XGBoosterPredict(booster_,data_,0,0,0,&out_len,&f);// higher version API
/*
lower version (ver.<1) API
XGB_DLL int XGBoosterPredict(   
BoosterHandle   handle,
DMatrixHandle   dmat,
int     option_mask, // 0 for normal output, namely reporting scores
int     training, // 0 for prediction
bst_ulong *     out_len,
const float **  out_result 
)

higher version (ver.>=1) API
XGB_DLL int XGBoosterPredict(   
BoosterHandle   handle,
DMatrixHandle   dmat,
int     option_mask, // 0 for normal output, namely reporting scores
int ntree_limit, // how many trees for prediction, set to 0 means no limit
int     training, // 0 for prediction
bst_ulong *     out_len,
const float **  out_result 
)
*/

Full Example¶

Click to expand full example

The example assumes the following directory structure:

MySubsystem/MyModule/
│
├── plugins/
│   ├── XGBoostExample.cc
│   └── BuildFile.xml
│
├── python/
│   └── xgboost_cfg.py
│
├── toolbox/ (storing necessary xml(s) to be copied to toolbox/ of $CMSSW_BASE)
│   └── xgboost.xml
│   └── rabit.xml (lower version only)
│
└── data/
    └── Test_data.csv
    └── lowVer.model / highVer.model

Please also note that in order to operate inference in an event-by-event way, please put XGBoosterPredict in analyze rather than beginJob.

plugins/XGBoostExample.cc for lower version XGBoostplugins/BuildFile.xml for lower version XGBoostpython/xgboost_cfg.py for lower version XGBoostplugins/XGBoostExample.cc for higher version XGBoostplugins/BuildFile.xml for higher version XGBoostpython/xgboost_cfg.py for higher version XGBoost

// -*- C++ -*-
//
// Package:    XGB_Example/XGBoostExample
// Class:      XGBoostExample
//
/**\class XGBoostExample XGBoostExample.cc XGB_Example/XGBoostExample/plugins/XGBoostExample.cc

 Description: [one line class summary]

 Implementation:
     [Notes on implementation]
*/
//
// Original Author:  Qian Sitian
//         Created:  Sat, 19 Jun 2021 08:38:51 GMT
//
//


// system include files
#include <memory>

// user include files
#include "FWCore/Framework/interface/Frameworkfwd.h"
#include "FWCore/Framework/interface/one/EDAnalyzer.h"

#include "FWCore/Framework/interface/Event.h"
#include "FWCore/Framework/interface/MakerMacros.h"

#include "FWCore/ParameterSet/interface/ParameterSet.h"
 #include "FWCore/Utilities/interface/InputTag.h"
 #include "DataFormats/TrackReco/interface/Track.h"
 #include "DataFormats/TrackReco/interface/TrackFwd.h"

#include <xgboost/c_api.h>
#include <vector>
#include <tuple>
#include <string>
#include <iostream>
#include <fstream>
#include <sstream>

using namespace std;

vector<vector<double>> readinCSV(const char* name){
    auto fin = ifstream(name);
    vector<vector<double>> floatVec;
    string strFloat;
    float fNum;
    int counter = 0;
    getline(fin,strFloat);
    while(getline(fin,strFloat))
    {
        std::stringstream  linestream(strFloat);
        floatVec.push_back(std::vector<double>());
        while(linestream>>fNum)
        {
            floatVec[counter].push_back(fNum);
            if (linestream.peek() == ',')
            linestream.ignore();
        }
        ++counter;
    }
    return floatVec;
}

//
// class declaration
//

// If the analyzer does not use TFileService, please remove
// the template argument to the base class so the class inherits
// from  edm::one::EDAnalyzer<>
// This will improve performance in multithreaded jobs.



class XGBoostExample : public edm::one::EDAnalyzer<>  {
   public:
      explicit XGBoostExample(const edm::ParameterSet&);
      ~XGBoostExample();

      static void fillDescriptions(edm::ConfigurationDescriptions& descriptions);


   private:
      virtual void beginJob() ;
      virtual void analyze(const edm::Event&, const edm::EventSetup&) ;
      virtual void endJob() ;

      // ----------member data ---------------------------

    std::string test_data_path;
    std::string model_path;




};

//
// constants, enums and typedefs
//

//
// static data member definitions
//

//
// constructors and destructor
//
XGBoostExample::XGBoostExample(const edm::ParameterSet& config):
test_data_path(config.getParameter<std::string>("test_data_path")),
model_path(config.getParameter<std::string>("model_path"))
{

}


XGBoostExample::~XGBoostExample()
{

   // do anything here that needs to be done at desctruction time
   // (e.g. close files, deallocate resources etc.)

}


//
// member functions
//

void
XGBoostExample::analyze(const edm::Event& iEvent, const edm::EventSetup& iSetup)
{
}


void
XGBoostExample::beginJob()
{
    BoosterHandle booster_;
    XGBoosterCreate(NULL,0,&booster_);
    cout<<"Hello World No.2"<<endl;
    XGBoosterLoadModel(booster_,model_path.c_str());
    unsigned long numFeature = 0;
    cout<<"Hello World No.3"<<endl;
    vector<vector<double>> TestDataVector = readinCSV(test_data_path.c_str());
    cout<<"Hello World No.4"<<endl;
    float TestData[2000][8];
    cout<<"Hello World No.5"<<endl;
    for(unsigned i=0; (i < 2000); i++)
    { 
        for(unsigned j=0; (j < 8); j++)
        {
            TestData[i][j] = TestDataVector[i][j];
        //  cout<<TestData[i][j]<<"\t";
        } 
        //cout<<endl;
    }
    cout<<"Hello World No.6"<<endl;
    DMatrixHandle data_;
    XGDMatrixCreateFromMat((float *)TestData,2000,8,-1,&data_);
    cout<<"Hello World No.7"<<endl;
    bst_ulong out_len=0;
      const float *f;
    cout<<out_len<<endl;
    auto ret=XGBoosterPredict(booster_, data_, 0,0,&out_len,&f);
    cout<<ret<<endl;
          for (unsigned int i=0;i<2;i++)
                    std::cout <<  i << "\t"<< f[i] << std::endl;
    cout<<"Hello World No.8"<<endl;
}

void
XGBoostExample::endJob()
{
}

void
XGBoostExample::fillDescriptions(edm::ConfigurationDescriptions& descriptions) {
  //The following says we do not know what parameters are allowed so do no validation
  // Please change this to state exactly what you do use, even if it is no parameters
  edm::ParameterSetDescription desc;
  desc.add<std::string>("test_data_path");
  desc.add<std::string>("model_path");
  descriptions.addWithDefaultLabel(desc);

  //Specify that only 'tracks' is allowed
  //To use, remove the default given above and uncomment below
  //ParameterSetDescription desc;
  //desc.addUntracked<edm::InputTag>("tracks","ctfWithMaterialTracks");
  //descriptions.addDefault(desc);
}

//define this as a plug-in
DEFINE_FWK_MODULE(XGBoostExample);

<use name="FWCore/Framework"/>
<use name="FWCore/PluginManager"/>
<use name="FWCore/ParameterSet"/>
<use name="DataFormats/TrackReco"/>
<use name="xgboost"/>
<flags EDM_PLUGIN="1"/>

# coding: utf-8

import os

import FWCore.ParameterSet.Config as cms
from FWCore.ParameterSet.VarParsing import VarParsing

# setup minimal options
#options = VarParsing("python")
#options.setDefault("inputFiles", "root://xrootd-cms.infn.it//store/mc/RunIIFall17MiniAOD/DYJetsToLL_M-10to50_TuneCP5_13TeV-madgraphMLM-pythia8/MINIAODSIM/94X_mc2017_realistic_v10-v2/00000/9A439935-1FFF-E711-AE07-D4AE5269F5FF.root")  # noqa
#options.parseArguments()

# define the process to run
process = cms.Process("TEST")

# minimal configuration
process.load("FWCore.MessageService.MessageLogger_cfi")
process.MessageLogger.cerr.FwkReport.reportEvery = 1
process.maxEvents = cms.untracked.PSet(input=cms.untracked.int32(1))
#process.source = cms.Source("PoolSource",
#    fileNames=cms.untracked.vstring('file:/afs/cern.ch/cms/Tutorials/TWIKI_DATA/TTJets_8TeV_53X.root'))
process.source = cms.Source("EmptySource")
# process options
process.options = cms.untracked.PSet(
    allowUnscheduled=cms.untracked.bool(True),
    wantSummary=cms.untracked.bool(True),
)

process.XGBoostExample = cms.EDAnalyzer("XGBoostExample")

# setup MyPlugin by loading the auto-generated cfi (see MyPlugin.fillDescriptions)
#process.load("XGB_Example.XGBoostExample.XGBoostExample_cfi")
process.XGBoostExample.model_path = cms.string("/Your/Path/data/lowVer.model")
process.XGBoostExample.test_data_path = cms.string("/Your/Path/data/Test_data.csv")

# define what to run in the path
process.p = cms.Path(process.XGBoostExample)

// -*- C++ -*-
//
// Package:    XGB_Example/XGBoostExample
// Class:      XGBoostExample
//
/**\class XGBoostExample XGBoostExample.cc XGB_Example/XGBoostExample/plugins/XGBoostExample.cc

 Description: [one line class summary]

 Implementation:
     [Notes on implementation]
*/
//
// Original Author:  Qian Sitian
//         Created:  Sat, 19 Jun 2021 08:38:51 GMT
//
//


// system include files
#include <memory>

// user include files
#include "FWCore/Framework/interface/Frameworkfwd.h"
#include "FWCore/Framework/interface/one/EDAnalyzer.h"

#include "FWCore/Framework/interface/Event.h"
#include "FWCore/Framework/interface/MakerMacros.h"

#include "FWCore/ParameterSet/interface/ParameterSet.h"
 #include "FWCore/Utilities/interface/InputTag.h"
 #include "DataFormats/TrackReco/interface/Track.h"
 #include "DataFormats/TrackReco/interface/TrackFwd.h"

#include <xgboost/c_api.h>
#include <vector>
#include <tuple>
#include <string>
#include <iostream>
#include <fstream>
#include <sstream>

using namespace std;

vector<vector<double>> readinCSV(const char* name){
    auto fin = ifstream(name);
    vector<vector<double>> floatVec;
    string strFloat;
    float fNum;
    int counter = 0;
    getline(fin,strFloat);
    while(getline(fin,strFloat))
    {
        std::stringstream  linestream(strFloat);
        floatVec.push_back(std::vector<double>());
        while(linestream>>fNum)
        {
            floatVec[counter].push_back(fNum);
            if (linestream.peek() == ',')
            linestream.ignore();
        }
        ++counter;
    }
    return floatVec;
}

//
// class declaration
//

// If the analyzer does not use TFileService, please remove
// the template argument to the base class so the class inherits
// from  edm::one::EDAnalyzer<>
// This will improve performance in multithreaded jobs.



class XGBoostExample : public edm::one::EDAnalyzer<>  {
   public:
      explicit XGBoostExample(const edm::ParameterSet&);
      ~XGBoostExample();

      static void fillDescriptions(edm::ConfigurationDescriptions& descriptions);


   private:
      virtual void beginJob() ;
      virtual void analyze(const edm::Event&, const edm::EventSetup&) ;
      virtual void endJob() ;

      // ----------member data ---------------------------

    std::string test_data_path;
    std::string model_path;




};

//
// constants, enums and typedefs
//

//
// static data member definitions
//

//
// constructors and destructor
//
XGBoostExample::XGBoostExample(const edm::ParameterSet& config):
test_data_path(config.getParameter<std::string>("test_data_path")),
model_path(config.getParameter<std::string>("model_path"))
{

}


XGBoostExample::~XGBoostExample()
{

   // do anything here that needs to be done at desctruction time
   // (e.g. close files, deallocate resources etc.)

}


//
// member functions
//

void
XGBoostExample::analyze(const edm::Event& iEvent, const edm::EventSetup& iSetup)
{
}


void
XGBoostExample::beginJob()
{
    BoosterHandle booster_;
    XGBoosterCreate(NULL,0,&booster_);
    XGBoosterLoadModel(booster_,model_path.c_str());
    unsigned long numFeature = 0;
    vector<vector<double>> TestDataVector = readinCSV(test_data_path.c_str());
    float TestData[2000][8];
    for(unsigned i=0; (i < 2000); i++)
    { 
        for(unsigned j=0; (j < 8); j++)
        {
            TestData[i][j] = TestDataVector[i][j];
        //  cout<<TestData[i][j]<<"\t";
        } 
        //cout<<endl;
    }
    DMatrixHandle data_;
    XGDMatrixCreateFromMat((float *)TestData,2000,8,-1,&data_);
    bst_ulong out_len=0;
      const float *f;
    auto ret=XGBoosterPredict(booster_, data_,0, 0,0,&out_len,&f);
          for (unsigned int i=0;i<out_len;i++)
                    std::cout <<  i << "\t"<< f[i] << std::endl;
}

void
XGBoostExample::endJob()
{
}

void
XGBoostExample::fillDescriptions(edm::ConfigurationDescriptions& descriptions) {
  //The following says we do not know what parameters are allowed so do no validation
  // Please change this to state exactly what you do use, even if it is no parameters
  edm::ParameterSetDescription desc;
  desc.add<std::string>("test_data_path");
  desc.add<std::string>("model_path");
  descriptions.addWithDefaultLabel(desc);

  //Specify that only 'tracks' is allowed
  //To use, remove the default given above and uncomment below
  //ParameterSetDescription desc;
  //desc.addUntracked<edm::InputTag>("tracks","ctfWithMaterialTracks");
  //descriptions.addDefault(desc);
}

//define this as a plug-in
DEFINE_FWK_MODULE(XGBoostExample);

<use name="FWCore/Framework"/>
<use name="FWCore/PluginManager"/>
<use name="FWCore/ParameterSet"/>
<use name="DataFormats/TrackReco"/>
<use name="xgboost"/>
<flags EDM_PLUGIN="1"/>

# coding: utf-8

import os

import FWCore.ParameterSet.Config as cms
from FWCore.ParameterSet.VarParsing import VarParsing

# setup minimal options
#options = VarParsing("python")
#options.setDefault("inputFiles", "root://xrootd-cms.infn.it//store/mc/RunIIFall17MiniAOD/DYJetsToLL_M-10to50_TuneCP5_13TeV-madgraphMLM-pythia8/MINIAODSIM/94X_mc2017_realistic_v10-v2/00000/9A439935-1FFF-E711-AE07-D4AE5269F5FF.root")  # noqa
#options.parseArguments()

# define the process to run
process = cms.Process("TEST")

# minimal configuration
process.load("FWCore.MessageService.MessageLogger_cfi")
process.MessageLogger.cerr.FwkReport.reportEvery = 1
process.maxEvents = cms.untracked.PSet(input=cms.untracked.int32(10))
#process.source = cms.Source("PoolSource",
#    fileNames=cms.untracked.vstring('file:/afs/cern.ch/cms/Tutorials/TWIKI_DATA/TTJets_8TeV_53X.root'))
process.source = cms.Source("EmptySource")
#process.source = cms.Source("PoolSource",
#    fileNames=cms.untracked.vstring(options.inputFiles))
# process options
process.options = cms.untracked.PSet(
    allowUnscheduled=cms.untracked.bool(True),
    wantSummary=cms.untracked.bool(True),
)

process.XGBoostExample = cms.EDAnalyzer("XGBoostExample")

# setup MyPlugin by loading the auto-generated cfi (see MyPlugin.fillDescriptions)
#process.load("XGB_Example.XGBoostExample.XGBoostExample_cfi")
process.XGBoostExample.model_path = cms.string("/Your/Path/data/highVer.model")  
process.XGBoostExample.test_data_path = cms.string("/Your/Path/data/Test_data.csv")

# define what to run in the path
process.p = cms.Path(process.XGBoostExample)

Python Usage¶

To use XGBoost's python interface, using the snippet below under CMSSW environment

# importing necessary models
import numpy as np
import pandas as pd 
from xgboost import XGBClassifier
import matplotlib.pyplot as plt
import pandas as pd


xgb = XGBClassifier()
xgb.load_model('ModelName.model')

# After loading model, usage is the same as discussed in the model preparation section.

Caveat¶

It is worth mentioning that both behavior and APIs of different XGBoost version can have difference.

When using c_api for C/C++ inference, for ver.<1, the API is XGB_DLL int XGBoosterPredict(BoosterHandle handle, DMatrixHandle dmat,int option_mask, int training, bst_ulong * out_len,const float ** out_result), while for ver.>=1 the API changes to XGB_DLL int XGBoosterPredict(BoosterHandle handle, DMatrixHandle dmat,int option_mask, unsigned int ntree_limit, int training, bst_ulong * out_len,const float ** out_result).
Model from ver.>=1 cannot be used for ver.<1.

Other important issue for C/C++ user is that DMatrix only takes in single precision floats (float), not double precision floats (double).

Appendix: Tips for XGBoost users¶

Importance Plot¶

XGBoost uses F-score to describe feature importance quantatitively. XGBoost's python API provides a nice tool,plot_importance, to plot the feature importance conveniently after finishing train.

# Once the training is done, the plot_importance function can thus be used to plot the feature importance.
from xgboost import plot_importance # Import the function

plot_importance(xgb) # suppose the xgboost object is named "xgb"
plt.savefig("importance_plot.pdf") # plot_importance is based on matplotlib, so the plot can be saved use plt.savefig()

The importance plot is consistent with our expectation, as in our toy-model, the data points differ by most on the feature "7". (see toy model setup).

ROC Curve and AUC¶

The receiver operating characteristic (ROC) and auccrency (AUC) are key quantities to describe the model performance. For XGBoost, ROC curve and auc score can be easily obtained with the help of sci-kit learn (sklearn) functionals, which is also in CMSSW software.

from sklearn.metrics import roc_auc_score,roc_curve,auc
# ROC and AUC should be obtained on test set
# Suppose the ground truth is 'y_test', and the output score is named as 'y_score'

fpr, tpr, _ = roc_curve(y_test, y_score)
roc_auc = auc(fpr, tpr)

plt.figure()
lw = 2
plt.plot(fpr, tpr, color='darkorange',
         lw=lw, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
# plt.show() # display the figure when not using jupyter display
plt.savefig("roc.png") # resulting plot is shown below

Reference of XGBoost¶

XGBoost Wiki: https://en.wikipedia.org/wiki/XGBoost
XGBoost Github Repo.: https://github.com/dmlc/xgboost
XGBoost offical api tutorial
Latest, Python: https://xgboost.readthedocs.io/en/latest/python/index.html
Latest, C/C++: https://xgboost.readthedocs.io/en/latest/tutorials/c_api_tutorial.html
Older (0.80), Python: https://xgboost.readthedocs.io/en/release_0.80/python/index.html
No Tutorial for older version C/C++ api, source code: https://github.com/dmlc/xgboost/blob/release_0.80/src/c_api/c_api.cc