Efficient LLM inference on CPU: the approach explained

Let's dig into the theory behind the approach used to implement Neural Speed

Apr 28, 2024

*Created with AUTOMATIC1111/stable-diffusion-webui on prem, wildcardxXLFusion_fusionOG model*

In the previous article I presented you a new inference engine, Neural Speed, that demonstrates incredible performance and that can run proficiently on consumer-grade CPU, without the need for expensive graphic cards or other dedicated resources.

Before proceeding in taking a deep dive into the advanced features of Neural Speed, it makes sense to take a seat and try to understand how it works under the hood. So, the intention of this article is to directly report a few concepts from the original documentation of Neural Speed.

A little Premise

The Team of researchers and developers behind Neural Speed leverage Intel Neural Compressor, an open-source Python library supporting popular model compression techniques on all mainstream deep learning frameworks, because it provides the full support of INT4 quantization such as GTPQ, AWS, TEQ and SignRound, allowing to generate the INT4 model automatically.

In addition, they were also inspired by the ggerganov/ggml library: written in C++ and fully open source, a Tensor library for machine learning to do efficient transformer model inference at the edge (on bare-metal); they develop a tensor library specifically for inference on CPU, supporting the mainstream processor instruction sets such as AVX2, AVX512, AVX512_VNNI, and the Advanced Matrix Extensions (AMX, briefly).

The results were absolutely astonishing: they showed an average latency of generation token from 12 to 80ms on "old" 4th Gen Intel Xeon Scalable processors with 6B, 8B, 13B, and 20B-parameters LLMs. Everything preserving an accuracy with just only 1% loss (from FP32 baseline data type computation).

The Approach

The Flow (the recipe tuning is marked as automatic and it is optional, it is required only when the model cannot meet the accuracy target) — *The Flow* *(the recipe tuning is marked as automatic and it is optional, it is required only when the model cannot meet the accuracy target)*

The approach used to develop Neural Speed consists in two major components:

1. an automatic INT4 quantization flow
Developed as a complement of Neural Compressor which already supported multiple INT4 quantization recipes (GPTQ, SignRound, AWQ, TEQ, and round-to-nearest. Neural Speed allows recipe tuning on different quantization recipes supporting different granularities and different group size: each recipe generates an INT4 model that is directly evaluated in the flow. If the INT4 quantized model reaches the required accuracy, it will be passed to the LLM runtime to be evaluated.
2. a framework consisting in a tensor library that supports all the major latest CPU instruction sets for deep learning acceleration, and, in completion, an efficient LLM runtime to take full advantage of the library
As shown in the following diagram, it's composed by several components: the specified ones to execute the inference are the CPU Tensor Library and the LLM optimizations.

CPU Tensor Library

Inspired from the template design of NVIDIA/cutlass, this CPU tensor library has been developed to execute linear algebra subroutines offering a comprehensive support of INT4 kernels specifically for adoption on x86_64 CPUs.

The CPU Tensor Library supports dynamic quantization for input along with batch and/or input channel per group (32, 64, 128, ...), and weight quantization (both symmetric and asymmetric scheme)

Support Matrix for CPU Tensor Library (note: AMX is available only in the latest Intel Xeon Scalable processors, then VNNI is available for both Intel and AMD SOCs)

where

^A per-batch and per-K group-wise dynamic quantization for input tensor, where per-K group-wise also applies to weight quantization group size of weight tensor; support both symmetric and asymmetric quantization;
^B per-batch dynamic dequantization for output tensor.

LLM Optimizations

Respect the legacy behavior of KV Cache, where new token generation requires memory reallocation for all the tokens, the LLM Optimizations proposed by Neural Speed introduced a K/V Cache with a pre-allocated K/V memory where only the new token is updated each time. This is critical for LLM inference engine performance, as almost all the recent LLMs proposed are typically decoder-only and Transformer-based models.

At a code level, the memory for KV Cache is allocated in neural_speed/models/model_utils/model_utils.cpp, and this is strictly connected to how Fused Attention has been implemented.

Fused Attention

Attention is one of the key elements to understand if a model and its inference are performing well, someone defined it "the driving force behind many of today’s cutting-edge Large Language Models", and this definition fits quite well as allows these models to focus on relevant elements of the input sequence, like a sentence or a document, and extract meaningful parts with correct accuracy. As presented earlier, to implement the additional optimizations to Neural Speed, a fused attention layer and a set of correlated utilities for the customized KV-cache it uses have been developed.

A large part of the implementation of the Fused Attention Layer is located in core/layers/mha_dense.cpp; however, the model builder can enable fused attention with operators ne_flash_attn* defined in core/ne_layers.h.

core/layers/mha_dense.cpp contains, in particular, the definition of Boolean function bestla_reordered_attn_fp32_support(): if passed, the memory type is set to NE_TYPE_BTLA and Fusion Attention is enabled.

Then, the function get_batch_kv_elements_from_gpt_params() returns the sizes of byte of k-cache and v-cache respectively for each batch and each layer (note: if attention is disabled, it returns the sizes of terms elements). So, the KV-cache is finally prepared with these 2 sizes by creating ne_new_tensor inside model.kv_self.k and model.kv_self.v (or model.layer[il].k_cache and model.layer[il].v_cache).

To reach the level of optimization, KV Cache is appended every time a new pair of K and V are generated by evaluating inner product for Trasformers’ vectors Q,K,V (ne_mul_qkv): the operation appends an additional K/V-tensors on the dimension of sequence length.

The K and V tensors to append must be contiguous in the dimension of head size.

For example, resulting in n_past = n_past + N, where n_past is number of previously cached tokens and N is the length of current tokens.

With ne_permute, both tensors K- and V- are permutated to batch x N x head_num x head_size, and the result is passed to ne_flash_attn_update_k() and ne_flash_attn_update _v(), together with n_past as an offset in the dimension of sequence length, to concatenate to a "view" of the cached K/V of the current layer and current batch.

Now, with the KV Cache of customized type and layout, Attention can be computed at its best performance.

Note: ne_flash_attn accepts a Q-tensor of batch x head_num x N x head_size.

Similarly to the append operation, the Q-tensor must be contiguous in the dimension of head and its strides (ne_tensor::nb) of the other 3 dimensions are configurable.

The output results are contiguous tensors of batch x N x head_num x head_size.

Next Steps

In the next article coming next week, I'll bring you in a deep dive into the advanced features of Neural Speed, a journey for both tinkerers and software developers. Also, BesTLA and Tensor Parallelism deserve a dedicated article I am planning to insert in my publishing roadmap.

References

Shen et. al. Efficient LLM Inference on CPUs. arXiv:2311.00502v2, 2023

Neural Speed documentation on Github https://github.com/intel/neural-speed/blob/main/README.md

Tensor library for machine learning developed by Georgi Gerganov et. al. https://github.com/ggerganov/ggml

João Lages, Transformers KV Caching Explained https://medium.com/@joaolages/kv-caching-explained-276520203249, 2023

Shobhit Agarwal, Navigating the Attention Landscape: MHA, MQA, and GQA Decoded, 2023

* This content has been produced by human hands.

AI on Bare Metal

Discussion about this post