Code Annotations
This page covers Proteus's annotation-based interface for runtime specialization. It includes both attribute-based annotations and the C++ annotation helpers for specializing scalars, arrays, objects, and captured lambda state.
This interface requires compiling the user application with Clang so the Proteus pass can observe the annotations and instrument the annotated regions.
If you want to construct JIT code directly instead of annotating existing code, see the C++ frontend API or the DSL API.
Kernels / Functions
The interface for host functions and device kernel functions is the same.
Proteus relies on the annotate attribute, which allows metadata to be
attached to any function, whether it runs on the CPU or as a GPU kernel.
Here is a DAXPY (double-precision scaled vector addition) example:
__attribute__((annotate("jit", 1, 2)))
void daxpy(double A, size_t N, double *X, double *Y) {
for (size_t I = 0; I < N; ++I)
Y[I] = A * X[I] + Y[I];
}
__global__
__attribute__((annotate("jit", 1, 2)))
void daxpy(double A, size_t N, double *X, double *Y) {
int Tid = threadIdx.x + (blockIdx.x * blockDim.x);
int Stride = blockDim.x * gridDim.x;
for (int I = Tid; I < N; I += Stride)
X[I] = A * X[I] + Y[I];
}
Here, the annotate("jit", 1, 2) attribute marks the function for JIT
compilation.
The comma-separated list after jit specifies which arguments
(1-indexed) Proteus should specialize: in this case, the scaling factor A and
the vector size N.
Note
Function arguments are 1-indexed.
When specializing, Proteus folds the designated arguments into constants.
This allows the downstream compiler optimization pipeline (for example at -O3)
to apply stronger transformations.
Folding runtime values into constants enhances analysis precision and unlocks
optimizations such as constant propagation, control-flow simplification, and
loop unrolling.
Lambdas
Portability libraries such as RAJA and Kokkos make heavy use of lambdas, so captured lambda state is an important specialization target.
Lambdas are slightly more complex than plain functions because they capture variables in closures. In practice, the greatest opportunities for specialization are often in the captured variables rather than the argument list.
Proteus therefore requires lambda functions to be registered with the
runtime using proteus::register_lambda.
To mark captured variables for specialization, they must be explicitly wrapped
with proteus::jit_variable.
Here is a simple example:
auto DaxpyLambda = proteus::register_lambda(
[=, A = proteus::jit_variable(A), N = proteus::jit_variable(N)]() {
for (size_t I = 0; I < N; ++I)
Y[I] = A * X[I] + Y[I];
});
DaxpyLambda();
template <typename Lambda>
__global__
__attribute__((annotate("jit")))
void kernel(Lambda &&Func) {
Func();
}
auto DaxpyLambda = proteus::register_lambda(
[=, A = proteus::jit_variable(A), N = proteus::jit_variable(N)] __device__() {
for (size_t I = 0; I < N; ++I)
Y[I] = A * X[I] + Y[I];
});
kernel<<<blocks, threads>>>(DaxpyLambda);
Note
For device lambdas, the kernel that executes the Proteus-registered lambda
must be annotated with jit.
C++ Annotations API
Beyond attribute-based annotations, Proteus also provides a C++ API for specialization through annotation functions. This API supports specialization of scalars, arrays, and objects.
When using these APIs, the contract is that annotated arrays and objects are treated as read-only within the specialized code. Specialized contents may be folded into constants, so these helpers are for inputs that should not be mutated by the specialized region.
The C++ annotations API includes:
template<typename T> proteus::jit_arg(T Scalar)for specializing a scalar valuetemplate<typename T> proteus::jit_array(T *Arr, size_t NumElts)for specializing array contentstemplate<typename T> proteus::jit_object(T *Obj or T &Obj)for specializing a trivially copyable object (copyable viastd::memcpy)
When you annotate a function or kernel with these APIs, Proteus will JIT-compile
it automatically.
There is no need to also use the annotate attribute.
Here is the same DAXPY example written with annotation functions:
void daxpy(double A, size_t N, double *X, double *Y) {
proteus::jit_arg(A);
proteus::jit_arg(N);
for (std::size_t I = 0; I < N; ++I)
Y[I] += X[I] * A;
}
__global__
void daxpy(double A, size_t N, double *X, double *Y) {
proteus::jit_arg(A);
proteus::jit_arg(N);
int Tid = threadIdx.x + (blockIdx.x * blockDim.x);
int Stride = blockDim.x * gridDim.x;
for (int I = Tid; I < N; I += Stride)
X[I] = A * X[I] + Y[I];
}
jit_array() and jit_object() are used similarly:
void sum_prefix(const double *A, double *Out, size_t N) {
proteus::jit_array(A, N);
Out[0] = A[0];
for (size_t I = 1; I < N; ++I)
Out[I] = Out[I - 1] + A[I];
}
struct Params {
int Tile;
double Scale;
};
void saxpy(const Params &P, double *X, double *Y, size_t N) {
proteus::jit_object(P);
for (size_t I = 0; I < N; ++I)
Y[I] += X[I] * P.Scale;
}