AOT compilation of TensorFlow models¶
Introduction¶
The usual way of evaluating TensorFlow models in CMSSW is through the TF C++ interface provided in PhysicsTools/TensorFlow
as described in the TensorFlow inference documentation. This way of model inference requires saving a trained model in the so-called frozen graph format, and then loading it through the TF C++ API within CMSSW which builds an internal representation of the computational graph. While being straight forward and flexible, this approach entails two potential sources of overhead:
- The TF C++ library and runtime requires a sizeable amount of memory at runtime.
- The internal graph representation is mostly identical that defined during model training, meaning that it is not necessarily optimal for fast inference.
Ahead-of-time (AOT) compilation of TensorFlow models is a way to avoid this overhead while potentially also reducing compute runtimes and memory footprint. It consists of three steps (note that you do not have to run these steps manually as there are tools provided to automate them):
- First, computational graph is converted to a series of operations whose kernel implementations are based on the Accelerated Linear Algebra (XLA) framework.
- In this process, low-level optimization methods can be applied (kernel fusion, memory optimization, ...) that exploit the graph-structure of the model. More info can be found here and here.
- Using XLA, the sequence of operations can be compiled and converted to machine code which can then be invoked through a simple function call.
flowchart LR
SavedModel -- optimizations --> xla{XLA}
xla -- compilation --> aot{AOT}
aot --> model.h
aot --> model.o
One should note that the model integration and deployment workflow is signficantly different. Since self-contained executable code is generated, any custom code (e.g. a CMSSW plugin) that depends on the generated code needs to be compiled everytime a model changes - after each training, change of input / output signatures, or updated batching options. However, the tools that are described below greatly simplify this process.
This approach works for most models and supports multiple inputs and outputs of different types. In general, various compute backends are supported (GPU, TPU) but for now, the implementation in CMSSW focuses on CPU only.
Further info:
- TensorFlow documentation
- Summary gist (summarizes the central steps of the compilation process and model usage)
- Talk at Core Software Meeting
The AOT mechanism was introduced in CMSSW_14_1_0_pre3 (cmssw#43941, cmssw#44519, cmsdist#9005). The interface is located at cmssw/PhysicsTools/TensorFlowAOT.
Note on dynamic batching
The compiled machine code is created with a fixed layout for buffers storing input values, intermediate layer values, and final outputs. Due to this, models have to be compiled with one or more static batch sizes. However, a mechanism is provided in the CMSSW interface that emulates dynamic batching by stitching the results of multiple smaller batch sizes for which the model was compiled. More info is given below.
Software setup¶
To run the examples shown below, create a mininmal setup with the following snippet. Adapt the SCRAM_ARCH
according to your operating system and desired compiler.
1 2 3 4 5 6 7 8 9 10 |
|
Saving your model¶
The AOT compilation process requires a TensorFlow model saved in the so-called SavedModel
format. Its output is a directory that usually contains the graph structure, weights and meta data.
Instructions on how to save your model are shown below, depending on whether you use Keras or plain TensorFlow with tf.function
's. Also note that, in both examples, models are saved with a dynamic (that is, unspecified) batch size which is taken advantage of in the compilation process in the subsequent step.
In order for Keras to built the internal graph representation before saving, make sure to either compile the model, or pass an input_shape
to the first layer:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
Let's consider you write your network model in a standalone function (usually a tf.function
). In this case, you need to wrap it's invocation inside a tf.Module
instance as shown below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
|
The following files should have been created upon success.
SavedModel files
/path/to/saved_model
│
├── variables/
│ ├── variables.data-00000-of-00001
│ └── variables.index
│
├── assets/ # likely empty
│
├── fingerprint.pb
│
└── saved_model.pb
Model compatibility¶
Before the actual compilation, you can check whether your model contains any operation that is not XLA/AOT compatible. For this, simply run
...
cpu: all ops compatible
and check its output. If you are interested in the full list of operations that are available (independent of your model), append --table
to the command.
Full output
> cmsml_check_aot_compatibility /path/to/saved_model --devices cpu
+----------------+-------+
| Operation | cpu |
+================+=======+
| AddV2 | yes |
+----------------+-------+
| BiasAdd | yes |
+----------------+-------+
| Const | yes |
+----------------+-------+
| Identity | yes |
+----------------+-------+
| MatMul | yes |
+----------------+-------+
| Mul | yes |
+----------------+-------+
| ReadVariableOp | yes |
+----------------+-------+
| Rsqrt | yes |
+----------------+-------+
| Softmax | yes |
+----------------+-------+
| Sub | yes |
+----------------+-------+
| Tanh | yes |
+----------------+-------+
cpu: all ops compatible
AOT compilation¶
The compilation of the model requires quite a few configuration options as the process that generates code is quite flexible. Therefore, this step requires a configuration file in either yaml of json format. An example is given below.
aot_config.yaml
model:
# name of the model, required
name: test
# version of the model, required
version: "1.0.0"
# location of the saved_model directory, resolved relative to this file,
# defaults to "./saved_model", optional
saved_model: ./saved_model
# serving key, defaults to "serving_default", optional
serving_key: serving_default
# author information, optional
author: Your Name
# additional description, optional
description: Some test model
compilation:
# list of batch sizes to compile, required
batch_sizes: [1, 2, 4]
# list of TF_XLA_FLAGS (for the TF -> XLA conversion), optional
tf_xla_flags: []
# list of XLA_FLAGS (for the XLA optimization itself), optional
xla_flags: []
An additional example can be found here.
With that, we can initiate the compilation process.
saved model at '/tmp/tmpb2qnby72'
compiling for batch size 1
compiling for batch size 2
compiling for batch size 4
successfully AOT compiled model 'test' for batch sizes: 1,2,4
Upon success, all generated files can be found in $CMSSW_BASE/tfaot/test
and should look like indicated below.
Generated files
${CMSSW_BASE}/tfaot/test
│
├── lib/
│ ├── test_bs1.o # object file compiled for batch size 1
│ ├── test_bs2.o # for batch size 2
│ └── test_bs4.o # for batch size 4
│
├── include/
│ └── tfaot-model-test
│ ├── test_bs1.h # header file generated for batch size 1
│ ├── test_bs2.h # for batch size 2
│ ├── test_bs4.h # for batch size 4
│ └── model.h # entry point that should be included by CMSSW plugins
│
└── tfaot-model-test.xml # tool file that sets up your scram environment
Note that the name of the model is injected into tfaot-model-NAME
, defining the names of the include directory as well as the tool file (xml).
At the end, the cms_tfaot_compile
command prints instructions on how to proceed. They are described in more detail below.
Inference in CMSSW¶
The model integration and inference can be achieved in five steps. Please find the full code example below. Also, take a look at the AOT interface unit tests to get a better idea of the API.
1. Tool setup¶
As printed in the instructions at the end of cms_tfaot_compile
, you should register the compiled model as a software dependency via scram. For this reason, a custom tool file was created that you need to setup
.
2. CMSSW module setup¶
In the next step, you should instruct your BuildFile.xml
(in SUBSYSTEM/MODULE/plugins/BuildFile.xml
if you are writing a CMSSW plugin, or in SUBSYSTEM/MODULE/BuildFile.xml
if you intend to use the model inside src/
or interface/
directory of your module) to depend on the new tool. This could like like the following.
1 2 3 4 5 |
|
3. Includes¶
In your source file, include the generated header file as well as the AOT interface.
1 2 3 |
|
4. Initialize objects¶
Your model is accessible through a type named tfaot_model::NAME
. You can access it by initializing a tfaot::Model<T>
instance, providing your type as a template parameter.
1 |
|
When used in a plugin such as an EDProducer
, you should create one model instance per plugin instance, that is, not as part of a GlobalCache
but as a normal instance member. As shown below, the model.run()
call is not const
and thus, not thread-safe. The memory overhead is minimal though, as the model is a just thin wrapper around the compiled machine code.
At this point, one would like to configure the dynamic batching strategies on the model. However, this is optional and a performance optimization measure, and therefore shown later.
5. Inference¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
Since we, by intention, do not have access to TensorFlow's tf::Tensor
objects, the types tfaot::*Arrays
with *
being Bool
, Int32
, Int64
, Float
, or Double
are nested std::vector<std::vector<T>>
objects. This means that access is simple, but please be aware of cache locality when filling input values.
The model.run()
method is variadic in its inputs and outputs, both for the numbers and types of arguments. This means that a model with two inputs, float
and bool
, and two outputs, double
and int32_t
, would be called like this.
1 2 3 4 5 6 7 |
|
Full example¶
Click to expand
The example assumes the following directory structure:
MySubsystem/MyModule/
│
├── plugins/
│ ├── MyPlugin.cpp
│ └── BuildFile.xml
│
└── test/
└── my_plugin_cfg.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 |
|
1 2 3 4 5 6 7 8 |
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
|
Dynamic batching strategies¶
Compiled models are specialized for a single batch size with buffers for inputs, intermediate layer values, and outputs being statically allocated. As explained earlier, the tfoat::Model<T>
class (with T
being the wrapper over all batch-size-specialized, compiled models) provides a mechanism that emulates dynamic batching. More details were presented at a recent Core Software meeting contribution.
Batch rules and strategies¶
Internally, tfoat::Model<T>
distinguishes between the target batch size, and composite batch sizes. The former is the batch size that the model should emulate, and the latter are the batch sizes for which the model was compiled.
BatchRule
's define how a target batch size should be emulated.
- A batch rule of
5:1,4
(in its string representation) would state that the target batch size of 5 should be emulated by stitching together the results of batch sizes 1 and 4. - A batch rule of
5:2,4
would mean that the models compiled for batch sizes 2 and 4 are evaluated, with the latter being zero-padded by 1.
The BatchStrategy
of a model defines the set of BatchRule
's that are currently active.
Default rules and optimization¶
There is no general, a-priory choice of batch sizes that works best for all models. Instead, the optimal selection of batch sizes and batch rules depends on the model and the input data, and should be determined through profiling (e.g. using the MLProf project). However, the following guidelines can be used as a starting point.
- Unlike for other batch sizes, models compiled for a batch size of 1 (
model1
) are subject to an additional, internal optimization step due to reductions to one-dimensional arrays and operations. It is therefore recommended to always include a batch size of 1. - For higher batch sizes, (single core) vectorization can be exploited, which can lead to a significant speed-up.
- The exact break-even points are model dependent. This means that for a target batch size of, say, 8 it could be more performant to evaluate
model1
8 times than to evaluatemodel8
once,model4
twice ormodel2
four times. If this is the case, the optimization available formodel1
(taking into account the stitching overhead!) outweighs the vectorization gains entailed by e.g.model8
.
Also, it should be noted that not every possible target batch size must be configured through a batch rule. In fact, the tfoat::Model<T>
does not require any batch rule to be pre-defined as an algorithm is in place that, for yet unseen batch sizes, constructs default batch rules.
- If the target batch size matches one of the available, composite batch sizes, this size is used as is.
- Otherwise, the smallest available, composite batch size is repeated until the target batch size, or a larger value is reached. If the value is larger, zero-padding is applied to the last evaluation.
For central, performance-critical models, an optimization study should be conducted to determine the optimal batch sizes and rules.
XLA optimization¶
As described above, the conversion from a TensorFlow graph to compiled machine code happens in two stages which can be separately optimized through various flags. Both sets of flags can be controlled through the aot_config.yaml
.
- The TF-XLA boundary is configured through so-called
tf_xla_flags
in thecompilation
settings. Example:tf_xla_flags: ["--tf_xla_min_cluster_size=4"]
, The full list of possible flags can be found here. - The XLA optimizer and code generator can be controlled through
xla_flags
in thecompilation
settings. Example:xla_flags: ["--xla_cpu_enable_fast_math=true"]
. The full list of possible flags can be found here.
Production workflow¶
If you intend to integrate an AOT compiled model into CMSSW production code, you need to account for the differences with respect to deployment using other direct inference methods (e.g. TF C++ or ONNX). Since the model itself is represented as compiled code rather than an external model file that can be read and interpreted at runtime, production models must be registered as a package in CMSDIST. The components are shown below.
graph LR
CMSDATA -- provides SavedModel to --> CMSDIST
CMSDIST -- compiles model for --> CMSSW
The integration process takes place in four steps.
- Push your model (in
SavedModel
format) to a central CMSDATA repository. - Create a new spec in CMSDIST (example), named
tfaot-model-NAME.spec
. This spec file should define two variables.%{aot_config}
: The path to an AOT configuration file (required).%{aot_source}
: A source to fetch, containing the model to compile (optional). When provided through a CMSDATA repository, you would typically delcare it as a build requirement viaBuildRequires: data-NAME
and just define%{aot_config}
. Seetfaot-compile.file
for more info.
- Add your spec to the list of
tfaot
models. - After integration into CMSDIST, a tool named
tfaot-model-NAME
will be provided centrally and the instructions for setting it up and using the compiled model in your plugin are identical to the ones described above.
Links and further reading¶
cmsml
packagecms-tfaot
package- MLProf project
- CMSSW
- TensorFlow
- Keras
Authors: Marcel Rieger, Bogdan Wiederspan