NekoMio

NekoMio

telegram
github

Ascend C Study Notes

AI Processor#

Components#

  • HBM (Storage)
  • DVPP (Hardware encoding and decoding)
  • CPU (4 cores, ARM): Instruction dispatch and other functions
  • AI Core NPU (AI computing): Executes AI computing, matrix computing (16x16x16), scalar computing, vector computing (128), etc.

Difference between General Computing and AI Computing#

General computing: Suitable for most computing tasks, such as addition, subtraction, multiplication, and division.
For example, when calculating matrix multiplication, only serial computation is possible, resulting in slow speed.

NPU computing can use vector computing units to simultaneously compute multiple data, resulting in faster speed, or directly use matrix computing units to directly calculate matrix multiplication.

Heterogeneous computing: CPU + NPU, with CPU responsible for control and NPU responsible for computation.

Ascend C#

Native operator development, automatic scheduling.

Graph optimization, automatic pipelining.
Operator fusion, automatically fusing multiple different operators to reduce data memory copying.

Graph execution sinking, sinking all operators to be executed on the NPU.

Automatic optimization AOE

Ascend C#

AI Core#

Computing unit, storage unit, control unit.

20240711102909

Internal structure:
Scalar computing unit (CPU) issues instructions
20240711103146

Tensor concept
name
shape
dtype
format

format#

N: batch size, number of images
H: Height
W: Width
C: Channel, number of channels, RGB is 3

Advantages of Ascend C#

20240711105258

Device Module#

Host and Device

Host refers to the server
Device refers to the NPU

Ascend C Operator Development#

Add Operator Development#

Operator Registration#

extern "C" __global__ __aicore__ void add_custom(GM_ADDR x, GM_ADDR y, GM_ADDR z)
{
    KernelAdd op;
    op.Init(x, y, z);
    op.Process();
}

Here, we define our kernel operator function add_custom, which accepts three parameters: x, y, and z. These three parameters are the input tensors and the output tensor, respectively.

GM_ADDR is an alias for __gm__ uint8_t*, representing the address on global memory.

Operator Implementation#

Here, we use a class to implement the operator.

class KernelAdd {
public:
    __aicore__ inline KernelAdd() {}
    __aicore__ inline void Init(GM_ADDR x, GM_ADDR y, GM_ADDR z)
    {
        ...
    }
    __aicore__ inline void Process()
    {
        ...
    }

private:
    __aicore__ inline void CopyIn(int32_t progress)
    {
        ...
    }
    __aicore__ inline void Compute(int32_t progress)
    {
        ...
    }
    __aicore__ inline void CopyOut(int32_t progress)
    {
        ...
    }

private:
    ...
};

The class exposes two main functions, Init and Process.

__aicore__ inline void Init(GM_ADDR x, GM_ADDR y, GM_ADDR z)
{
    xGm.SetGlobalBuffer((__gm__ half *)x + BLOCK_LENGTH * GetBlockIdx(), BLOCK_LENGTH);
    yGm.SetGlobalBuffer((__gm__ half *)y + BLOCK_LENGTH * GetBlockIdx(), BLOCK_LENGTH);
    zGm.SetGlobalBuffer((__gm__ half *)z + BLOCK_LENGTH * GetBlockIdx(), BLOCK_LENGTH);
    pipe.InitBuffer(inQueueX, BUFFER_NUM, TILE_LENGTH * sizeof(half));
    pipe.InitBuffer(inQueueY, BUFFER_NUM, TILE_LENGTH * sizeof(half));
    pipe.InitBuffer(outQueueZ, BUFFER_NUM, TILE_LENGTH * sizeof(half));
}

Init is responsible for memory allocation and initialization.

__aicore__ inline void Process()
{
    int32_t loopCount = TILE_NUM * BUFFER_NUM;
    for (int32_t i = 0; i < loopCount; i++) {
        CopyIn(i);
        Compute(i);
        CopyOut(i);
    }
}

Process function iteratively calls the CopyIn, Compute, and CopyOut functions, which form a pipeline.

CopyIn function copies data from global memory to local memory, Compute function performs the computation, and CopyOut function copies the computation result from local memory to global memory.

__aicore__ inline void CopyIn(int32_t progress)
{
    LocalTensor<half> xLocal = inQueueX.AllocTensor<half>();
    LocalTensor<half> yLocal = inQueueY.AllocTensor<half>();
    DataCopy(xLocal, xGm[progress * TILE_LENGTH], TILE_LENGTH);
    DataCopy(yLocal, yGm[progress * TILE_LENGTH], TILE_LENGTH);
    inQueueX.EnQue(xLocal);
    inQueueY.EnQue(yLocal);
}
__aicore__ inline void Compute(int32_t progress)
{
    LocalTensor<half> xLocal = inQueueX.DeQue<half>();
    LocalTensor<half> yLocal = inQueueY.DeQue<half>();
    LocalTensor<half> zLocal = outQueueZ.AllocTensor<half>();
    Add(zLocal, xLocal, yLocal, TILE_LENGTH);
    outQueueZ.EnQue<half>(zLocal);
    inQueueX.FreeTensor(xLocal);
    inQueueY.FreeTensor(yLocal);
}
__aicore__ inline void CopyOut(int32_t progress)
{
    LocalTensor<half> zLocal = outQueueZ.DeQue<half>();
    DataCopy(zGm[progress * TILE_LENGTH], zLocal, TILE_LENGTH);
    outQueueZ.FreeTensor(zLocal);
}

In the computation, the Add vector addition is used to implement this operator.

Since all instructions are executed concurrently, a queue is used in this operator to ensure data dependency.

DeQue blocks until the required data is available in the queue.

Operator Invocation#

When calling the kernel operator

<<<...>>> is used to invoke the function.

blockDim: Set the block size based on different devices.
l2ctrl: Currently set to nullptr.
stream: Represents the overall stream.

void add_custom_do(uint32_t blockDim, void *l2ctrl, void *stream, uint8_t *x, uint8_t *y, uint8_t *z)
{
    add_custom<<<blockDim, l2ctrl, stream>>>(x, y, z);
}

Sinh Operator Development#

The structure of Sinh is similar to Add, with the main difference being the number of parameters.

The core difference lies in the implementation of the Compute function.

Here, I use Muls Exp to first calculate exe^{-x}
Then calculate exe^x
Finally, subtract them and divide by 2 to obtain the result of sinh.

__aicore__ inline void Compute(int32_t progress)
{
    LocalTensor<DTYPE_X> xLocal = inQueueX.DeQue<DTYPE_X>();
    LocalTensor<DTYPE_Y> yLocal = outQueueY.AllocTensor<DTYPE_Y>();
    LocalTensor<DTYPE_Y> tmpLocal = tmpBuffer.AllocTensor<DTYPE_Y>();
    // e^-x
    Muls(tmpLocal, xLocal, static_cast<DTYPE_X>(-1), this->tileLength);
    Exp(tmpLocal, tmpLocal, this->tileLength);
    // e^x 
    Exp(yLocal, xLocal, this->tileLength);
    // (e^x - e^-x)
    Sub(yLocal, yLocal, tmpLocal, this->tileLength);
    // (e^x - e^-x) / 2
    Muls(yLocal, yLocal, static_cast<DTYPE_X>(0.5), this->tileLength);
    outQueueY.EnQue<DTYPE_Y>(yLocal);
    inQueueX.FreeTensor(xLocal);
    tmpBuffer.FreeTensor(tmpLocal);
}
Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.