SPV_NV_cooperative_matrix2

SPV_NV_cooperative_matrix2
SPV_NV_cooperative_matrix2
Table of Contents
Name Strings
Contact
Contributors
Notice
Status
Version
Dependencies
Overview
Extension Name
Modifications to the SPIR-V Specification, Version 1.6
Issues
Revision History
Name Strings
SPV_NV_cooperative_matrix2
Contact
To report problems with this extension, please open a new issue at:
Contributors
Jeff Bolz, NVIDIA
Karthik Vaidyanathan, NVIDIA
Notice
Copyright (c) 2024 NVIDIA Corp.
Status
Complete
Version
Last Modified Date
2025-11-24
Revision
Dependencies
This extension is written against the SPIR-V Specification,
Version 1.6, Revision 3, Unified.
This extension requires SPIR-V 1.6.
This extension requires SPV_KHR_cooperative_matrix.
If
CooperativeMatrixTensorAddressingNV
is used, SPV_NV_tensor_addressing is
required.
Overview
This extension adds several new features building on the cooperative matrix
types added in SPV_KHR_cooperative_matrix. The goal is to add and accelerate
features beyond just simple GEMM kernels, including adding support for type/use
conversions, reductions, per-element operations, and tensor addressing, and
also to improve usability and out-of-the-box performance by adding support
for more flexible matrix sizes, and workgroup scope matrices with
compiler-managed staging through shared memory.
Extension Name
To use this extension within a SPIR-V module, the following
OpExtension
must be present in the module:
OpExtension "SPV_NV_cooperative_matrix2"
Modifications to the SPIR-V Specification, Version 1.6
2.2 Terms
Add to the list of Tangled Instructions in section 2.2.5 Control Flow:
OpCooperativeMatrixLoadTensorNV
OpCooperativeMatrixStoreTensorNV
OpCooperativeMatrixReduceNV
OpCooperativeMatrixConvertNV
OpCooperativeMatrixTransposeNV
, and
OpCooperativeMatrixPerElementOpNV
2.16 Validation Rules
Modify section 2.16.1. Universal Validation Rules:
Add
OpCooperativeMatrixLoadTensorNV
and
OpCooperativeMatrixStoreTensorNV
to the list
of instructions under "It is invalid for a pointer to be an operand to any
instruction other than:", when the
Logical
addressing model is selected and
neither the
VariablePointers
nor
VariablePointersStorageBuffer
capability
are declared.
If an
OpTypeCooperativeMatrixKHR
instruction uses a
Scope
of
Workgroup
then the workgroup size must have already been specified in the module,
including any constant instructions used by
LocalSizeId
In any function used as a
DecodeFunc
parameter to
OpCooperativeMatrixLoadTensorNV
or as a
Func
parameter to
OpCooperativeMatrixPerElementOpNV
or as a
CombineFunc
parameter to
OpCooperativeMatrixReduceNV
, and any function called directly or
indirectly by those functions, tangled instructions are not allowed.
3.26 Memory Operands
Modify Section 3.26, "Memory Operands":
In the description of
MakePointerAvailable
, change "Not valid with
OpLoad
to "Not valid with
OpLoad
or
OpCooperativeMatrixLoadKHR
or
OpCooperativeMatrixLoadTensorNV
".
In the description of
MakePointerVisible
, change "Not valid with
OpStore
to "Not valid with
OpStore
or
OpCooperativeMatrixStoreKHR
or
OpCooperativeMatrixStoreTensorNV
".
3.31 Capabilities
Modify Section 3.31, "Capability", adding these rows to the Capability table:
Capability
Implicitly Declares
5430
CooperativeMatrixReductionsNV
Enables cooperative matrix reduction instructions.
5431
CooperativeMatrixConversionsNV
Enables cooperative matrix conversion/transpose instructions.
5432
CooperativeMatrixPerElementOperationsNV
Enables cooperative matrix per-element operations.
5433
CooperativeMatrixTensorAddressingNV
Enables cooperative matrix load/store instruction using tensor addressing
OpCooperativeMatrixLoadTensorNV
and
OpCooperativeMatrixStoreTensorNV
).
5434
CooperativeMatrixBlockLoadsNV
Enables the
DecodeFunc
parameter for
OpCooperativeMatrixLoadTensorNV
3.X Tensor Layout and View
Tensor layout and tensor view types are representations of the mapping
between matrix coordinates and tensor memory layout. They each have a
number of dimensions in the range [1,5], with dimension 0 being the
outermost dimension and the last dimension being the innermost. These types
have the following logical state:
struct tensorLayoutNVTensorClampMode Mode = TensorClampModeUndefined>
static constexpr uint32_t LDim = Dim;
static constexpr TensorClampMode clampMode = Mode;

uint32_t blockSize[LDim];
uint32_t layoutDimension[LDim];
uint32_t stride[LDim];
int32_t offset[LDim];
uint32_t span[LDim];
uint32_t clampValue;
};

struct tensorViewNV>
static constexpr uint32_t VDim = Dim;
static constexpr bool hasDim = hasDimensions;
static constexpr uint32_t permutation[VDim] = {p0, ..., p};

uint32_t viewDimension[VDim];
uint32_t viewStride[VDim];
uint32_t clipRowOffset, clipRowSpan, clipColOffset, clipColSpan;
};
A tensor layout represents the layout of values in memory (number of
dimensions and size), along with a region being accessed (offset and span).
---------------------------------------------------------------------------
| layoutDimension1 |
| |
| |
| |
| |
| |
| |
| |
| span1 |
| ----------------- |
| | | |
| | | |
| | slice | span0 |
| | | layoutDimension0|
| | | |
| offset1 | | |
| ---------------> ----------------- |
| |
| ^ |
| | |
| | |
| | offset0 |
| | |
| | |
| | |
| | |
---------------------------------------------------------------------------
Figure: A 2D tensor layout, and a slice selecting a region within it.
A tensor view allows reinterpreting the dimensions of the region being
accessed, including changing the number of dimensions, reordering the
dimensions as they are loaded or stored, and clipping the region of the
matrix that is loaded or stored. Often the span will have the
same number of elements as the matrix, but in some more advanced uses
that may not be the case.
Loads and stores can either use just a tensor layout, or a tensor layout and
tensor view. The addressing starts by treating the matrix itself as a 2D
"view" and mapping the (row,col) coordinate to a 1D index. If there is only a
tensor layout parameter, then that 1D index is mapped to an N-D coordinate
within the slice. If there is both a tensor layout and a tensor view, then
the 1D index is first mapped to a coordinate within the view, the
coordinate components can be permuted, and then is converted back to a 1D
index which is then run through the tensor layout addressing calculation.
The tensor view dimensions and stride can be used to do more complex
addressing calculations. If the tensor view type has "hasDimensions" false,
then the dimensions of the tensor layout span are used instead.
The tensor view "clip" region restricts which elements of the matrix are
loaded or stored, and also affects the shape of the implicit 2D "view".
Unlike some other ML APIs, tensor layouts and views only describe
addressing calculations and never involve making copies of tensors. For
this reason, the functionality is slightly more limited (e.g. there’s no
way to slice, then permute, then slice again).
While these calculations may look expensive in their full generality,
certain calculations can be skipped when they’re not needed, and the
common cases should be quite efficient.
OpTensorLayout
and
OpTensorView
instructions operate by copying
existing object state and updating the requested state and returning
that as a new result. Some of these instructions initialize multiple
related pieces of state, setting some to common default values, so
the order of the operations matters.
For load and store functions with no
TensorView
parameter, an element index
is computed according to the matrixCoordToTensorElement function for each
(row,col) of the matrix, which has M rows and N columns. This converts the (row,col) into a row-major index,
converts that index into an N-dimensional coord relative to the span,
and uses the span coordinate to compute a location within the tensor.
constexpr uint32_t MAX_DIM = 5;
using Coord = array;

uint32_t matrixCoordToLinear(tensorLayoutNV t, uint32_t row, uint32_t col, uint32_t N)
uint32_t index = row * N + col;
return index;

Coord linearToSpanCoord(tensorLayoutNV t, uint32_t index)
Coord spanCoord {};
for (int32_t dim = t.LDim-1; dim >= 0; --dim) {
spanCoord[dim] = index % t.span[dim];
index /= t.span[dim];
return spanCoord;

auto spanCoordToTensorCoord(tensorLayoutNV t, Coord spanCoord)
Coord blockCoord {};
Coord coordInBlock {};

for (uint32_t dim = 0; dim <= t.LDim-1; ++dim) {
int32_t c = spanCoord[dim] + t.offset[dim];

if (c < 0 || c >= t.layoutDimension[dim]) {

ClampMode clampMode = t.clampMode;
// For stores, other than Undefined, everything is treated as "discard"
if (operation is a store && clampMode != Undefined) {
clampMode = Constant;

// remainders are computed as defined in OpSMod
switch (clampMode) {
case Undefined:
undefined behavior;
case Constant:
For load, set result value to t.clampValue;
For store, discard the store;
terminate index calculation;
case ClampToEdge:
c = min(max(c, 0), t.layoutDimension[dim]-1);
break;
case Repeat:
c = c % t.layoutDimension[dim];
break;
case MirrorRepeat:
c = c % (2*t.layoutDimension[dim]-2);
c = (c >= dim) ? (2*dim-2-c) : c;
break;

coordInBlock[dim] = c % t.blockSize[dim];
blockCoord[dim] = c / t.blockSize[dim];

return tuple(blockCoord, coordInBlock);

uint32_t tensorCoordToLinear(tensorLayoutNV t, Coord blockCoord)
uint32_t index = 0;

for (uint32_t dim = 0; dim <= t.LDim-1; ++dim) {
index += blockCoord[dim] * t.stride[dim];
return index;

// map (row,col) -> linear index in span -> span coordinate -> tensor coordinate -> linear index in tensor
uint32_t matrixCoordToTensorElement(tensorLayoutNV t, uint32_t row, uint32_t col, uint32_t N)
uint32_t index = matrixCoordToLinear(t, row, col, N);

Coord spanCoord = linearToSpanCoord(t, index);

Coord blockCoord;
Coord coordInBlock;

tie(blockCoord, coordInBlock) = spanCoordToTensorCoord(t, spanCoord);

index = tensorCoordToLinear(t, blockCoord);

return index;
This index is then multiplied by the size of the component type of the matrix and
treated as a byte offset from the
Pointer
operand. The matrix element is
loaded from or stored to this location. The
Pointer
must be a multiple of 16B,
but the region of elements selected by the span need not be so aligned. If the
OpCooperativeMatrixLoadTensorNV
instruction has a decode parameter,
then the blockCoord and coordInBlock arrays are passed to it as parameters.
For load and store functions with a
TensorView
parameter, an element index
is computed according to the matrixCoordToTensorElementWithView function
for each (row,col) of the matrix, where has M rows and N columns.
This computes a row-major index relative to the clip region, converts that to
an N-dimensional coordinate relative to the permuted view dimensions, and
computes a linear index from the view coordinate, then runs through the tensor
layout calculation.
uint32_t matrixCoordToLinear(tensorLayoutNV t, tensorViewNV v, uint32_t row, uint32_t col, uint32_t N)
if (row < v.clipRowOffset ||
row >= v.clipRowOffset + v.clipRowSpan ||
col < v.clipColOffset ||
col >= v.clipColOffset + v.clipColSpan) {

Load or store is skipped. For load, the matrix element is unmodified.
terminate index calculation;
row -= v.clipRowOffset;
col -= v.clipColOffset;
uint32_t width = min(N, v.clipColSpan);
uint32_t index = row * width + col;
return index;

Coord linearToViewCoord(tensorLayoutNV t, tensorViewNV v, uint32_t index)
auto &dimensions = v.hasDimensions ? v.viewDimension : t.span;

Coord viewCoord {};

for (int32_t dim = v.VDim-1; dim >= 0; --dim) {
uint32_t i = v.permutation[dim];

viewCoord[i] = index % dimensions[i];
index /= dimensions[i];

return viewCoord;

uint32_t viewCoordToLinear(tensorLayoutNV t, tensorViewNV v, Coord viewCoord)
Coord stride {};
if (v.hasDimensions) {
stride = v.viewStride;
} else {
// set stride to match t.span
stride[v.VDim-1] = 1;
for (int32_t dim = v.VDim-2; dim >= 0; --dim) {
stride[dim] = stride[dim+1] * t.span[dim+1];

uint32_t index = 0;
for (int32_t dim = v.VDim-1; dim >= 0; --dim) {
index += viewCoord[dim] * stride[dim];

return index;

// map (row,col) -> linear index in view -> view coordinate -> linear index in span -> span coordinate -> tensor coordinate -> linear index in tensor
uint32_t matrixCoordToTensorElementWithView(tensorLayoutNV t, uint32_t row, uint32_t col, uint32_t N)
uint32_t index = matrixCoordToLinear(t, v, row, col, N);

Coord viewCoord = linearToViewCoord(t, v, index);

index = viewCoordToLinear(t, v, viewCoord);

Coord spanCoord = linearToSpanCoord(t, index);

Coord blockCoord;
Coord coordInBlock;

tie(blockCoord, coordInBlock) = spanCoordToTensorCoord(t, spanCoord);

index = tensorCoordToLinear(t, blockCoord);

return index;
The final result is then multiplied by the size of the component type of
the matrix and treated as a byte offset from
Pointer
. The matrix
element is loaded from or stored to this location.
For
OpCooperativeMatrixLoadTensorNV
instructions with a
DecodeFunc
operand,
rather than loading a value, the function operand is invoked for each matrix
element at least once. The function’s return type must match the component
type of the result matrix type. The first parameter must be a pointer type
with storage class
PhysicalStorageBuffer
and the parameter is filled a pointer computed by multiplying the index
returned by matrixCoordToTensorElement(WithView) by the size of the pointee type. The second and third
parameters must each be an array of 32-bit integers whose dimension matches the
tensor dimension. The second parameter is filled with the blockCoord, and the
third parameter with the coordInBlock, for the matrix element being decoded.
The return value is stored in the corresponding element of the result matrix.
DecodeFunc
is not allowed with
OpCooperativeMatrixStoreTensorNV
. Similarly,
a block size larger than 1 must not be used with
OpCooperativeMatrixStoreTensorNV
because it will lead to data races.
3.X Cooperative Matrix Reduce Mode
New section in 3 "Binary Form".
Cooperative Matrix Reduce Mode
Enabling Capabilities
0x1
Row
Elements within each row of a matrix are reduced.
0x2
Column
Elements within each column of a matrix are reduced.
0x4
2x2
Elements within an aligned 2x2 neighborhood are reduced.
It is invalid to combine
2x2
with
Row
or
Column
Row
and
Column
can be used together.
3.X Tensor Addressing Operands
New section in 3 "Binary Form".
This is a literal mask; it can be formed by combining the bits from multiple
rows in the table below.
Provides additional operands to the listed memory instructions. Bits that are
set indicate whether an additional operand follows, as described by the table.
If there are multiple following operands indicated, they are ordered: Those
indicated by smaller-numbered bits appear first. An instruction needing two
masks must first provide the first mask followed by the first mask’s additional
operands, and then provide the second mask followed by the second mask’s
additional operands.
Used by:
OpCooperativeMatrixLoadTensorNV
OpCooperativeMatrixStoreTensorNV
Tensor Addressing Operands
Enabling Capabilities
0x0
None
0x1
TensorView
Addressing calculations use a Tensor View. The of a tensor view is
specified in a subsequent operand.
CooperativeMatrixTensorAddressingNV
0x2
DecodeFunc
Addressing calculations use a decode function. The of a function is
specified in a subsequent operand.
CooperativeMatrixBlockLoadsNV
3.49.8 Memory Instructions
OpCooperativeMatrixLoadTensorNV
Load a cooperative matrix through a pointer.
Result Type
is the type of the loaded object. It must be a cooperative matrix
type.
Pointer
is a pointer from which the matrix will be loaded. If the
Shader
capability was declared,
Pointer
must point into an array and any
ArrayStride
decoration on
Pointer
is ignored.
Addressing calculations are performed as described in the Tensor Layout and View
section.
Object
is a cooperative matrix object whose values are used for clipped loads.
It must have the same type as
Result Type
TensorLayout
is a tensor layout that affects addressing calculations.
Memory Operand
must begin with a
Memory Operand
literal.
Tensor Addressing Operands
must begin with a
Tensor Addressing Operands
literal. If the operands include
DecodeFunc
, then
Pointer
must point to
PhysicalStorageBuffer
or
StorageBuffer
storage class.
All the operands to this instruction must be dynamically uniform within every
instance of the
Scope
of the cooperative matrix.
Capability:
CooperativeMatrixTensorAddressingNV
8+variable
5367

Result Type
Result

Pointer

Object

TensorLayout
Literal
Memory Operand
…
optional literals and

Literal
Tensor Addressing Operands
…
optional literals and

OpCooperativeMatrixStoreTensorNV
Store a cooperative matrix through a pointer.
Pointer
is a pointer to which the matrix will be stored. If the
Shader
capability was declared,
Pointer
must point into an array and any
ArrayStride
decoration on
Pointer
is ignored.
Addressing calculations are performed as described in the Tensor Layout and View
section.
Object
is the object to store. Its type must be an
OpTypeCooperativeMatrixKHR
TensorLayout
is a tensor layout that affects addressing calculations.
Memory Operand
must begin with a
Memory Operand
literal.
Tensor Addressing Operands
is a literal mask of
Memory Operands
All the operands to this instruction must be dynamically uniform within every
instance of the
Scope
of the cooperative matrix.
Capability:
CooperativeMatrixTensorAddressingNV
6+variable
5368

Pointer

Object

TensorLayout
Literal
Memory Operand
…
optional literals and

Literal
Tensor Addressing Operands
…
optional literals and

3.49.13. Arithmetic Instructions
OpCooperativeMatrixReduceNV
Computes a matrix where each element of the result matrix is computed from a
row, column, or neighborhood of the source matrix.
Result Type
must be an
OpTypeCooperativeMatrixKHR
type'.
The type of
Matrix
must be an
OpTypeCooperativeMatrixKHR
with the same
Component Type
as
Result Type
The type of
Matrix
and
Result Type
must each have
Use
of
MatrixAccumulatorKHR
and must have matching
Scope
If
Reduce
includes
2x2
, the dimensions of
ResultType
must be half of
the dimensions of
Matrix
. Each element (r,c) of the result matrix is calculated
by combining elements (2*r, 2*c), (2*r+1, 2*c), (2*r, 2*c+1), and (2*r+1, 2*c+1)
of
Matrix
If
Reduce
equals
Row
, then
Result Type
must have the same number of rows as
Matrix
but can have any supported number of columns. All elements of a row in the result matrix
have the same value, which is computed by combining all elements of the corresponding
row of
Matrix
If
Reduce
equals
Column
, then
Result Type
must have the same number of columns as
Matrix
but can have any supported number of rows. All elements of a column in the result
matrix have the same value, which is computed by combining all elements of the corresponding
row of
Matrix
If
Reduce
includes
Row
and
Column
Result Type
can have any number of rows and
columns. All elements of the result matrix have the same value, which is computed by
combining all elements of
Matrix
CombineFunc
must be an
OpFunction
with two parameters whose types and result
type all match the component type of
Matrix
. This function is called to combine
pairs of elements (or intermediate results) when computing the reduction. This
function should be mathematically commutative and associative (though in practice, with floating
point numbers, may not be exactly commutative/associative).
Capability:
CooperativeMatrixReductionsNV
5366

Result Type
Result

Matrix
Literal
Reduce

CombineFunc
3.49.11 Conversion Instructions
Relax the restrictions on
Op{F,S,U,etc.}Convert
from SPV_KHR_cooperative_matrix
if
CooperativeMatrixConversionsNV
is enabled to allow
Use
to mismatch,
where the
Use
of the operand can be
MatrixAccumulatorKHR
and the
Use
of the result type can be
MatrixAKHR
or
MatrixBKHR
. The restriction on
OpBitcast
is not relaxed.
OpCooperativeMatrixConvertNV
Converts a cooperative matrix to another cooperative matrix with different
Use
Result Type
must be an
OpTypeCooperativeMatrixKHR
The type of
Matrix
must be an
OpTypeCooperativeMatrixKHR
with the same
Component Type
Scope
Rows
, and
Columns
as
Result Type
.The
Use
of
Result Type
must be
MatrixAKHR
or
MatrixBKHR
and the
Use
of
Matrix
must be
MatrixAccumulatorKHR
. For conversions that change both
Component Type
and
Use
, use
Op{F,S,U,etc.}Convert
Capability:
CooperativeMatrixConversionsNV
5293

Result Type
Result

Matrix
OpCooperativeMatrixTransposeNV
Converts a cooperative matrix to from
MatrixAccumulatorKHR
to
MatrixBKHR
and transposes the matrix.
Result Type
must be an
OpTypeCooperativeMatrixKHR
The type of
Matrix
must be an
OpTypeCooperativeMatrixKHR
with the same
Scope
as
Result Type
, and with
Rows
, and
Columns
swapped relative to
Result Type
. The
Use
of
Result Type
must be
MatrixBKHR
and the
Use
of
Matrix
must be
MatrixAccumulatorKHR
Capability:
CooperativeMatrixConversionsNV
5390

Result Type
Result

Matrix
3.49.9 Function Instructions
OpCooperativeMatrixPerElementOpNV
Applies an operation to each element of a cooperative matrix.
The type of
Matrix
must be an
OpTypeCooperativeMatrixKHR
Result Type
must match the type of
Matrix
Func
must be an
OpFunction
whose return type matches the component type
of
Matrix
, whose first two parameters must be 32-bit integer types, whose
third parameter type must match the component type of
Matrix
, and which may
have additional parameters. The function is called for each element of the
matrix where the element is passed as the third parameter to the function,
the row and column number of the matrix are passed as the first and second
parameters, and any optional operands are passed in order as the remaining
parameters.
Any additional cooperative matrix operands have the corresponding matrix
element passed to the function. The matrix type must match the type of
Matrix
and the component type of the matrix must match the type of the
corresponding parameter of
Func
The return value of
Func
is the corresponding element of
Result
The calls are considered unordered against each other, and calls may occur more
than once.
Capability:
CooperativeMatrixPerElementOperationsNV
5+variable
5369

Result Type
Result

Matrix

Func
Optional

, …
Issues
How are matrix type conversions with
Use
change handled?
Discussion: RESOLVED. We need to support conversions that change both
Component Type
and
Use
at the same time, because there is often not a
supported intermediate type that matches one but not the other. For example,
if converting from f32
MatrixAccumulatorKHR
to u8
MatrixAKHR
, there may
not be support for u8
MatrixAccumulatorKHR
or f32
MatrixAKHR
. Conversions
that change the
Component Type
should use
Op{F,S,U,etc.}Convert
even if the
Use
changes.
We also need to support conversions that only change the
Use
, for example
converting from f16
MatrixAccumulatorKHR
to f16
MatrixAKHR
. For this,
OpFConvert
could be confusing/misleading so we add a new
OpCooperativeMatrixConvertNV
instruction for this case.
Revision History
Rev
Date
Author
Changes
2025-04-10
Jeff Bolz
Specify which instructions are tangled
2024-09-18
Jeff Bolz
Initial revision of SPV_NV_cooperative_matrix2