SPV_INTEL_2d_block_io
SPV_INTEL_2d_block_io
Table of Contents
Name Strings
Contact
Contributors
Notice
Status
Version
Dependencies
Overview
Extension Name
Modifications to the SPIR-V Specification, Version 1.6
Diagram
Mapping Block Data to Invocations
Out-of-Bounds Behavior
Restrictions
Issues
Revision History
Name Strings
SPV_INTEL_2d_block_io
Contact
To report problems with this extension, please open a new issue at:
Contributors
Ben Ashbaugh, Intel
Pekka Jääskeläinen, Intel
Victor Mustya, Intel
Yury Plyakhin, Intel
Notice
Copyright (c) 2025 Intel Corporation. All rights reserved.
Status
Complete
Version
Last Modified Date
2025-02-28
Revision
Dependencies
This extension is written against the SPIR-V Specification,
Version 1.6, Revision 4.
This extension requires SPIR-V 1.0.
This extension interacts with the
SPV_KHR_untyped_pointers
extension, by accepting untyped pointers as pointer operands.
This extension interacts with the
SPV_INTEL_cache_controls
extension, by supporting cache control decorations on the pointer operands.
Overview
This extension adds additional subgroup block load and store instructions to read two-dimensional blocks of data from a two-dimensional region of memory, or to write two-dimensional blocks of data to a two dimensional region of memory.
This is an important operation for many machine learning algorithms, which operate on two-dimensional matrix data as part of a matrix multiplication algorithm.
The block sizes that are supported are device-specific.
A companion client API specification will describe the block sizes that are supported for a device.
This extension additionally adds support for two pre-processing operations that may be performed when loading a two-dimensional block of data:
The two-dimensional block may be
transposed
after loading and before it is written to the instruction’s destination.
The two-dimensional block may be
transformed
after loading and before it is written to the instruction’s destination.
The
transform
operation converts the two-dimensional block from a
row-major
layout to a
packed
layout by combining data elements from multiple block rows into 32-bit values.
This layout is used by some matrix multiplication instructions.
Extension Name
To use this extension within a SPIR-V module, the appropriate
OpExtension
must
be present in the module:
OpExtension "SPV_INTEL_2d_block_io"
Modifications to the SPIR-V Specification, Version 1.6
Capabilities
Modify Section 3.31, Capability, adding rows to the Capability table:
Capability
Implicitly Declares
6228
Subgroup2DBlockIOINTEL
6229
Subgroup2DBlockTransformINTEL
Subgroup2DBlockIOINTEL
6230
Subgroup2DBlockTransposeINTEL
Subgroup2DBlockIOINTEL
Instructions
Modify Section 3.42.21, Group Instructions, adding to the end of the list of instructions:
OpSubgroup2DBlockLoadINTEL
Loads one or more 2D blocks of data from a 2D row-major region of memory.
The 2D blocks of data are loaded collectively, as a subgroup operation.
The
Element Size
operand specifies the size of one block element, in bytes.
The
Block Width
Block Height
, and
Block Count
operands specify the total number of elements to load.
These operands must be constant instructions with scalar 32-bit integer type.
The
Block Width
specifies the number of elements in each block row.
The
Block Height
specifies the number of rows in each block.
The
Block Count
specifies the number of blocks to load.
If
Block Count
is greater than one, the blocks are loaded in row-major order, with the next block beginning immediately after the previous block.
Src Base Pointer
is a pointer to the base of the 2D region of memory to load from.
It must be a pointer to the
CrossWorkgroup
storage class.
The
Memory Width
Memory Height
, and
Memory Pitch
operands specify the 2D region of memory to load from.
These operands must be integer type scalars.
The
Memory Width
specifies the width of the 2D region of memory, in bytes.
The
Memory Height
specifies the number of rows in the 2D region of memory.
The
Memory Pitch
specifies the number of bytes between each row in the 2D region of memory.
The
Coordinate
operand specifies the starting location in the 2D region of memory to load from.
It must be a vector of two integer type components.
The first component of
Coordinate
specifies the number of elements to skip, from the start of a row.
The second component of
Coordinate
specifies the number of rows to skip, from the base of the 2D region of memory.
Dst Pointer
is a pointer to per-invocation storage that will hold the results of the 2D block load.
It must be a pointer to the
Function
storage class.
Behavior is undefined unless all invocations within the subgroup execute the same dynamic instance of this instruction.
Behavior is undefined unless
Block Width
Block Height
Block Count
Src Base Pointer
Memory Width
Memory Height
Memory Pitch
, and
Coordinate
are dynamically uniform for all invocations within the subgroup.
Follows the templated function:
template
void OpSubgroup2DBlockLoadINTEL(
const T* srcBasePointer,
int memoryWidth,
int memoryHeight,
int memoryPitch,
int2 coordinate,
T* dstPointer);
Capability:
Subgroup2DBlockIOINTEL
11
6231
Element Size
Block Width
Block Height
Block Count
Src Base Pointer
Memory Width
Memory Height
Memory Pitch
Coordinate
Dst Pointer
OpSubgroup2DBlockLoadTransposeINTEL
Loads and transposes one or more 2D blocks of data from a 2D row-major region of memory.
The 2D blocks of data are loaded collectively, as a subgroup operation.
The
Element Size
operand specifies the size of one block element, in bytes.
The
Block Width
Block Height
, and
Block Count
operands specify the total number of elements to load.
These operands must be constant instructions with scalar 32-bit integer type.
The
Block Width
specifies the number of elements in each block row, pre-transpose.
The
Block Height
specifies the number of rows in each block, pre-transpose.
The
Block Count
specifies the number of blocks to load.
If
Block Count
is greater than one, the blocks are loaded in row-major order, with the next block beginning immediately after the previous block.
Src Base Pointer
is a pointer to the base of the 2D region of memory to load from.
It must be a pointer to the
CrossWorkgroup
storage class.
The
Memory Width
Memory Height
, and
Memory Pitch
operands specify the 2D region of memory to load from.
These operands must be integer type scalars.
The
Memory Width
specifies the width of the 2D region of memory, in bytes.
The
Memory Height
specifies the number of rows in the 2D region of memory.
The
Memory Pitch
specifies the number of bytes between each row in the 2D region of memory.
The
Coordinate
operand specifies the starting location in the 2D region of memory to load from.
It must be a vector of two integer type components.
The first component of
Coordinate
specifies the number of elements to skip, from the start of a row.
The second component of
Coordinate
specifies the number of rows to skip, from the base of the 2D region of memory.
Dst Pointer
is a pointer to per-invocation storage that will hold the results of the transposed 2D block load.
It must be a pointer to the
Function
storage class.
Behavior is undefined unless all invocations within the subgroup execute the same dynamic instance of this instruction.
Behavior is undefined unless
Block Width
Block Height
Block Count
Src Base Pointer
Memory Width
Memory Height
Memory Pitch
, and
Coordinate
are dynamically uniform for all invocations within the subgroup.
Follows the templated function:
template
void OpSubgroup2DBlockLoadTransposeINTEL(
const T* srcBasePointer,
int memoryWidth,
int memoryHeight,
int memoryPitch,
int2 coordinate,
T* dstPointer);
Capability:
Subgroup2DBlockTransposeINTEL
11
6233
Element Size
Block Width
Block Height
Block Count
Src Base Pointer
Memory Width
Memory Height
Memory Pitch
Coordinate
Dst Pointer
OpSubgroup2DBlockLoadTransformINTEL
Loads and transforms one or more 2D blocks of data into a packed format from a 2D row-major region of memory.
The transformation combines elements from multiple rows of the 2D region into packed 32-bit values.
The 2D blocks of data are loaded and transformed collectively, as a subgroup operation.
The
Element Size
operand specifies the size of one block element, in bytes.
The
Block Width
Block Height
, and
Block Count
operands specify the total number of elements to load.
These operands must be constant instructions with scalar 32-bit integer type.
The
Block Width
specifies the number of elements in each block row.
The
Block Height
specifies the number of rows in each block.
The
Block Count
specifies the number of blocks to load.
If
Block Count
is greater than one, the blocks are loaded in row-major order, with the next block beginning immediately after the previous block.
Src Base Pointer
is a pointer to the base of the 2D region of memory to load from.
It must be a pointer to the
CrossWorkgroup
storage class.
The
Memory Width
Memory Height
, and
Memory Pitch
operands specify the 2D region of memory to load from.
These operands must be integer type scalars.
The
Memory Width
specifies the width of the 2D region of memory, in bytes.
The
Memory Height
specifies the number of rows in the 2D region of memory.
The
Memory Pitch
specifies the number of bytes between each row in the 2D region of memory.
The
Coordinate
operand specifies the starting location in the 2D region of memory to load from.
It must be a vector of two integer type components.
The first component of
Coordinate
specifies the number of elements to skip, from the start of a row.
The second component of
Coordinate
specifies the number of rows to skip, from the base of the 2D region of memory.
Dst Pointer
is a pointer to per-invocation storage that will hold the results of the transformed 2D block load.
It must be a pointer to the
Function
storage class.
If it is an
OpTypePointer
pointer, it must point to a scalar 32-bit integer type.
Behavior is undefined unless all invocations within the subgroup execute the same dynamic instance of this instruction.
Behavior is undefined unless
Block Width
Block Height
Block Count
Src Base Pointer
Memory Width
Memory Height
Memory Pitch
, and
Coordinate
are dynamically uniform for all invocations within the subgroup.
Follows the templated function:
template
void OpSubgroup2DBlockLoadTransformINTEL(
const T* srcBasePointer,
int memoryWidth,
int memoryHeight,
int memoryPitch,
int2 coordinate,
uint* dstPointer);
Capability:
Subgroup2DBlockTransformINTEL
11
6232
Element Size
Block Width
Block Height
Block Count
Src Base Pointer
Memory Width
Memory Height
Memory Pitch
Coordinate
Dst Pointer
OpSubgroup2DBlockPrefetchINTEL
Prefetches one or more blocks of data from a 2D row-major region of memory into a cache.
Prefetching does not affect the functionality of a module but may change its performance characteristics.
The 2D blocks of data are prefetched collectively, as a subgroup operation.
The
Element Size
operand specifies the size of one block element, in bytes.
The
Block Width
Block Height
, and
Block Count
operands specify the total number of elements to prefetch.
These operands must be constant instructions with scalar 32-bit integer type.
The
Block Width
specifies the number of elements in each block row.
The
Block Height
specifies the number of rows in each block.
The
Block Count
specifies the number of blocks to prefetch.
If
Block Count
is greater than one, the blocks are prefetched in row-major order, with the next block beginning immediately after the previous block.
Src Base Pointer
is a pointer to the base of the 2D region of memory to prefetch from.
It must be a pointer to the
CrossWorkgroup
storage class.
The
Memory Width
Memory Height
, and
Memory Pitch
operands specify the 2D region of memory to prefetch.
These operands must be integer type scalars.
The
Memory Width
specifies the width of the 2D region of memory, in bytes.
The
Memory Height
specifies the number of rows in the 2D region of memory.
The
Memory Pitch
specifies the number of bytes between each row in the 2D region of memory.
The
Coordinate
operand specifies the starting location in the 2D region of memory to prefetch from.
It must be a vector of two integer type components.
The first component of
Coordinate
specifies the number of elements to skip, from the start of a row.
The second component of
Coordinate
specifies the number of rows to skip, from the base of the 2D region of memory.
Behavior is undefined unless all invocations within the subgroup execute the same dynamic instance of this instruction.
Behavior is undefined unless
Block Width
Block Height
Block Count
Src Base Pointer
Memory Width
Memory Height
Memory Pitch
, and
Coordinate
are dynamically uniform for all invocations within the subgroup.
Follows the templated function:
template
void OpSubgroup2DBlockPrefetchINTEL(
const T* srcBasePointer,
int memoryWidth,
int memoryHeight,
int memoryPitch,
int2 coordinate);
Capability:
Subgroup2DBlockIOINTEL
10
6234
Element Size
Block Width
Block Height
Block Count
Src Pointer
Memory Width
Memory Height
Memory Pitch
Coordinate
OpSubgroup2DBlockStoreINTEL
Stores one or more 2D blocks of data to a 2D region of memory.
The 2D blocks of data are stored collectively, as a subgroup operation.
The
Element Size
operand specifies the size of one block element, in bytes.
The
Block Width
Block Height
, and
Block Count
operands specify the total number of elements to store.
These operands must be constant instructions with scalar 32-bit integer type.
The
Block Width
specifies the number of elements in each block row.
The
Block Height
specifies the number of rows in each block.
The
Block Count
specifies the number of blocks to store.
If
Block Count
is greater than one, the blocks are stored in row-major order, with the next block beginning immediately after the previous block.
Src Pointer
is a pointer to per-invocation storage that holds the data to store.
It must be a pointer to the
Function
storage class.
Dst Base Pointer
is a pointer to the base of the 2D region of memory to store to.
It must be a pointer to the
CrossWorkgroup
storage class.
The
Memory Width
Memory Height
, and
Memory Pitch
operands specify the 2D region of memory to store to.
These operands must be integer type scalars.
The
Memory Width
specifies the width of the 2D region of memory, in bytes.
The
Memory Height
specifies the number of rows in the 2D region of memory.
The
Memory Pitch
specifies the number of bytes between each row in the 2D region of memory.
The
Coordinate
operand specifies the starting location in the 2D region of memory to store to.
It must be a vector of two integer type components.
The first component of
Coordinate
specifies the number of elements to skip, from the start of a row.
The second component of
Coordinate
specifies the number of rows to skip, from the base of the 2D region of memory.
Behavior is undefined unless all invocations within the subgroup execute the same dynamic instance of this instruction.
Behavior is undefined unless
Block Width
Block Height
Block Count
Src Base Pointer
Memory Width
Memory Height
Memory Pitch
, and
Coordinate
are dynamically uniform for all invocations within the subgroup.
Follows the templated function:
template
void OpSubgroup2DBlockStoreINTEL(
const T* srcPointer,
T* dstBasePointer,
int memoryWidth,
int memoryHeight,
int memoryPitch,
int2 coordinate);
Capability:
Subgroup2DBlockIOINTEL
11
6235
Element Size
Block Width
Block Height
Block Count
Src Pointer
Dst Base Pointer
Memory Width
Memory Height
Memory Pitch
Coordinate
Diagram
The diagram below shows the meaning of the 2D block load and store operands.
Mapping Block Data to Invocations
This section describes the mapping between the 2D block of data that is loaded or stored and the invocations in the subgroup.
First, the
Block Width
and
Block Height
are padded, if necessary.
For
OpSubgroup2DBlockLoadINTEL
OpSubgroup2DBlockLoadTransformINTEL
, and
OpSubgroup2DBlockStoreINTEL
, the
Block Width
is padded to the next power-of-two.
For
OpSubgroup2DBlockLoadTransposeINTEL
, the
Block Height
is padded to the next power-of-two.
For
OpSubgroup2DBlockLoadTransformINTEL
, the
Block Height
is padded to a multiple of four for 1-byte elements, and a multiple of two for 2-byte elements.
For loads, the value of any padded elements is zero.
For stores, the value of any padded elements is ignored.
For
OpSubgroup2DBlockLoadTransformINTEL
, the loaded block data is then transformed, by combining elements from multiple rows of a single column of the 2D region and packing them into 32-bit values.
For 2-byte elements, every two rows are combined into a 32-bit value, with the lower-numbered rows in the lower bits and the higher-numbered rows in the higher bits.
For 1-byte elements, every four rows are are combined into a 32-bit value, with the lower-numbered rows in the lower bits and the higher-numbered rows in the higher bits.
This packed layout is sometimes referred to as a
VNNI
layout.
For
OpSubgroup2DBlockLoadTransposeINTEL
, the loaded block data is then transposed, by assigning the first column of the 2D block to the first row of the transposed 2D block, and so on.
Next, the rows of the 2D block are assigned to invocations in the subgroup.
Because the padded block width and the subgroup size are both powers of two, there are three scenarios to consider:
If the padded block width is equal to the subgroup size, each invocation is assigned one element of the block row.
If the padded block width is less than the subgroup size, multiple rows are assigned to the subgroup.
The first row is assigned to the first set of invocations, then the next row is assigned to the next set of invocations, and so on.
If the padded block width is greater than the subgroup size, multiple elements of each block row are assigned to each invocation.
The first set of elements are assigned to the first invocation, then the next set of elements are assigned to the next invocation, and so on.
In all cases, the lower numbered columns are assigned to the lower numbered invocations.
Examples
Loading a two row by four column block of elements (
Block Width
equals four,
Block Height
equals two), with a subgroup size of four, using
OpSubgroup2DBlockLoadINTEL
Block data:
0,0
0,1
0,2
0,3
1,0
1,1
1,2
1,3
This is the case where the padded block width is equal to the subgroup size. In this case, each invocation is assigned one element of the block row. Therefore, because there are two rows:
Invocation 0 is assigned the values
0,0
and
1,0
Invocation 1 is assigned the values
0,1
and
1,1
Invocation 2 is assigned the values
0,2
and
1,2
Invocation 3 is assigned the values
0,3
and
1,3
Loading a four row by two column block of elements (
Block Width
equals two,
Block Height
equals four), with a subgroup size of four, using
OpSubgroup2DBlockLoadINTEL
Block data:
0,0
0,1
1,0
1,1
2,0
2,1
3,0
3,1
This is the case where the padded block width is less than the subgroup size. In this case, the first row is assigned to Invocation 0 and Invocation 1, and the second row is assigned to Invocation 2 and Invocation 3, and so on. Therefore:
Invocation 0 is assigned the values
0,0
and
2,0
Invocation 1 is assigned the values
0,1
and
2,1
Invocation 2 is assigned the values
1,0
and
3,0
Invocation 3 is assigned the values
1,1
and
3,1
Loading a two row by eight column block of elements (
Block Width
equals eight,
Block Height
equals two), with a subgroup size of four, using
OpSubgroup2DBlockLoadINTEL
Block data:
0,0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
1,0
1,1
1,2
1,3
1,4
1,5
1,6
1,7
This is the case where the padded block width is greater than the subgroup size. In this case, the first set of elements of each block row is assigned to Invocation 0, the next set of elements are assigned to Invocation 1, and so on. Therefore:
Invocation 0 is assigned the values
0,0
0,1
1,0
, and
1,1
Invocation 1 is assigned the values
0,2
0,3
1,2
, and
1,3
Invocation 2 is assigned the values
0,4
0,5
1,4
, and
1,5
Invocation 3 is assigned the values
0,6
0,7
1,6
, and
1,7
Loading a four row by two column block of elements (
Block Width
equals two,
Block Height
equals four), with a subgroup size of four, using
OpSubgroup2DBlockLoadTransposeINTEL
Block data (pre-transpose):
0,0
0,1
1,0
1,1
2,0
2,1
3,0
3,1
After transposition, this is the same as the first example, so:
Invocation 0 is assigned the values
0,0
and
0,1
Invocation 1 is assigned the values
1,0
and
1,1
Invocation 2 is assigned the values
2,0
and
2,1
Invocation 3 is assigned the values
3,0
and
3,1
Loading a two row by four column block of two-byte elements (
Block Width
equals four,
Block Height
equals two), with a subgroup size of four, using
OpSubgroup2DBlockLoadTransformINTEL
Block data:
0,0
0,1
0,2
0,3
1,0
1,1
1,2
1,3
For two-byte elements, the transform operation combines every two rows together to form a 32-bit value. Therefore:
Invocation 0 is assigned the 32-bit value
1,0 | 0,0
Invocation 1 is assigned the 32-bit value
1,1 | 0,1
Invocation 2 is assigned the 32-bit value
1,2 | 0,2
Invocation 3 is assigned the 32-bit value
1,3 | 0,3
Loading a four row by four column block of one-byte elements (
Block Width
equals four,
Block Height
equals two), with a subgroup size of four, using
OpSubgroup2DBlockLoadTransformINTEL
Block data:
0,0
0,1
0,2
0,3
1,0
1,1
1,2
1,3
2,0
2,1
2,2
2,3
3,0
3,1
3,2
3,3
For one-byte elements, the transform operation combines every four rows together to form a 32-bit value. Therefore:
Invocation 0 is assigned the 32-bit value
3,0 | 2,0 | 1,0 | 0,0
Invocation 1 is assigned the 32-bit value
3,1 | 2,1 | 1,1 | 0,1
Invocation 2 is assigned the 32-bit value
3,2 | 2,2 | 1,2 | 0,2
Invocation 3 is assigned the 32-bit value
3,3 | 2,3 | 1,3 | 0,3
Out-of-Bounds Behavior
If some or all of the 2D block is out-of-bounds, where the bounds are defined by the
Memory Width
and
Memory Height
, the behavior is as follows:
For loads, any out-of-bounds elements are assigned the value zero.
For prefetches and stores, any out-of-bounds elements are ignored.
Restrictions
The following restrictions apply to the 2D block load, store and prefetch instructions added by this extension:
The
Element Size
must be 1, 2, 4, or 8 bytes.
The
Block Width
must be a multiple of four for 1-byte elements, or a multiple of two for 2-byte elements.
Behavior is undefined unless:
the first component of
Coordinate
is a multiple of four for 1-byte elements, or a multiple of two for 2-byte elements.
the per-subgroup source or destination base address is cache-line aligned (64 bytes).
the per-invocation source or destination address is aligned to a multiple of the
Element Size
the
Memory Width
is greater than or equal to 64 bytes and less than or equal to 2
24
bytes.
the
Memory Width
is a multiple of four for 1-byte or 2-byte elements, or a multiple of the element size otherwise.
the
Memory Height
is greater than zero and less than or equal to 2
24
rows.
the
Memory Pitch
is greater than or equal to the
Memory Width
and a multiple of 16 bytes.
the
SubgroupMaxSize
is a power of two.
the
SubgroupSize
is equal to the
SubgroupMaxSize
; in other words, this is a full subgroup.
Issues
How should this functionality work with untyped pointers (AKA opaque pointers)?
RESOLVED
: Added an
Element Size
operand to explicitly specify the amount of data to load or store vs. inferring the element size from typed pointers.
Note, this extension does not currently includes optional
Memory Operands
to specify pointer alignment, because the pointer must already be aligned due to hardware restrictions..
Can we use a 32-bit integer-type scalar to represent the memory width, height, and pitch, or should we allow for 64-bit integers for very large matrices?
RESOLVED
: We will use 32-bit integer-type scalars to represent the block width, height, and count, but we will allow for 64-bit integers to represent the memory width, height, and pitch, and for the block start coordinates.
The client API environment specs will restrict all of these operands to 32-bit integers initially, however.
Terminology-wise, should we use "width" and "height", or "rows" and "columns"?
RESOLVED
: We will use "width" and "height" to describe both the block dimensions and the memory dimensions.
Terminology-wise, how should we describe the coordinate to read?
RESOLVED
: The operand will simply be described as a vector coordinate.
This avoids needing to describe "X" or "Y" or "Row" or "Column" in the operand names.
The first coordinate will be the "X" or "Column" coordinate, and the second coordinate will be the "Y" or "Row" coordinate.
Terminology-wise, should we use "load" and "store", or "read" and "write"?
RESOLVED
: We will use "load" and "store" for consistency with the rest of the SPIR-V specification.
What should the behavior be if some or all of the 2D block is out-of-bounds?
RESOLVED
: The behavior is well-defined.
Specifically, out-of-bounds reads are assigned the value zero, and out-of-bounds prefetches and stores are ignored.
Revision History
Rev
Date
Author
Changes
2025-01-07
Ben Ashbaugh
Initial revision for publication
2025-02-28
Ben Ashbaugh
Updated restrictions
US