LLMs (Large Language Models) are incredible: they have demonstrated phenomenal performance and a proven potential across thousands of different scenarios. However, they require massive amount of computational resources to be executed: starting from basic elements like several gigabytes of storage, and astronomical quantity of GPU vRAM. Really impractical for many users who would like to execute simple natural language tasks, such as classification or entity recognition, and the only hardware available is usually consumer-grade. The more obvious solution is usually taking advantage of mainstream cloud providers (the triad Amazon AWS, Microsoft Azure, Google Cloud Platform), that make available services designed to provide direct APIs and offers, but the scarcity of GPU availability, the chip shortage, and the high demand raised the costs of this online platforms. Costs that are not affordable for the vast audience of software developers, data scientists and researchers, students, hobbyists, little professionals that want to try out solutions based on LLMs and build something innovative from scratch without the need to rely on third party services and continuous stable connection to Internet; not to mention the privacy and security of external providers that is not always fully guaranteed.
The alternative I want to propose is to develop solutions that could take full advantage of your personal hardware.
But, how is it possible to deploy large language models considering their immense amount of parameters? It seems technically impossible.
As owner of several Intel Core and Xeon platforms, I am always trying to squeeze out every last drop of performance from the silicon I personally own and in which I invested money.
Intel Software provided us a very interesting solutions to achieve incredible results even on our consumer-grade hardware, without discommode cloud platforms or any other online service. The solution is called Neural Speed: it uses an effective approach that can make the deployment of LLMs more efficiently. It's open source, publicly available on Github, actively developed, and fully supported.
Follow me in this journey to discover this new tool...
Neural Speed is based on an automatic INT4 weight-only quantization flow and Intel's engineers designed a special LLM runtime with highly optimized kernel to accelerate the LLM inference on CPUs.
What is the INT4 quantization?
Quantization is a common technique used to reduce the numeric precision of weights and activations of a neural network to lower the computational costs of inference (so, the hardware and energy resources). Usually, the most widely used approach today is the INT8 quantization because it offers a good trade-off between high inference performance and reasonable model accuracy. However, thanks to the research of Shen et. al. 2023 (arXiv:2311.00502v2), abnormal values in activations have been observed, limiting its full adoption.
There were a few proposals to overcome this issue; one of these was, for example, the introduction of FP8, a newly introduced datatype, that never took over the market (for now).
In the process of democratizing AI, the open-source community demonstrated to be incredible active in developing alternative solutions based on weight-only quantization as it applies the low precision to weights only, while keeping a high precision for activations, therefore mainting the model accuracy. This allows to drastically cut down the hardware resources required to run model inference.
In this context, it is quite common to target a precision of 4bit for weights, while relying on a value of 16-bit floating point for activations.
An example of the route that a large part of the scaled community is following is llama.cpp: an incredible well designed and implemented inference engine, born to embrace the low-bit weight-only quantization approach.
These implementations are often optimized for the NVIDIA CUDA framework (even if, it is possible to run models in cpu-only mode).
On the other hand, Neural Speed is certainly inspired directly by llama.cpp, but it is further optimized for Intel platforms.
What is Neural Speed?
Neural Speed is an innovative C/C++ library to support an efficient inference of LLMs on Intel hardware through the SOTA low-bit quantization and it is highly fine-tuned low-precision kernels on CPUs that support ISAs, as AMX, VNNI, AVX512F, AVX_VNNI, and AVX2.
In addition, it also supports tensor parallelism across sockets and nodes on CPU cores.
Neural Speed demonstrated up to 40x performance speedup on popular LLMs compared with llama.cpp.
Supported Hardware and Models
The usage of Neural Speed is supported on Xeon Scalable processors, Xeon Max Series, the widely adopted Intel Core processors.
Theoretically, almost all the models in Transformers' PyTorch format from Hugging Face are fully supported (such as Llama2, ChatGLM2, Baichuan2, Qwen, Mistral, Whisper, etc...).
In my personal perspective, the support of typical LLMs in .gguf format is critical as it is the most convenient redistributable one-file only packaging for the models.
Neural Speed Installation
Let us dig into the installation process of Neural Speed on a typical GNU/Linux Ubuntu x86-64 machine. It can eventually run on Microsoft Windows as well.
I always prefer to install experimental tools within secure python virtual-environments using conda as a package manager.
First step, install GCC ≥10.* to build the software from sources
$ sudo apt update
$ sudo apt install build-essential
Second step, create a virtual environment using python 3.11.*
$ conda create -n neural-speed python=3.11 -y
$ conda activate neural-speed
Third step, clone the repository
(neural-speed) $ git clone https://github.com/intel/neural-speed.git
(neural-speed) $ cd neural-speed
(neural-speed) $ pip install -r requirements.txt
(neural-speed) $ pip install .
Note: do not use the released version 1.0 as it contains an bug in stdin routine that prevents the quantization step from working properly. In this example, we are using the version directly cloned from the official github repository, up to date (2024, April 19th).
Getting Started with inference
The most interesting thing about Neural Speed is that it has been implemented with a llama.cpp-like command line interface, so, if you are familiar with llama.cpp, you can still use similar input flags.
Let's assume you would like to run inference a 7B model using Neural Speed. For the following example I choose open_llama v2 developed by OpenLM Research Lab because it is freely available and it's not gated, so it does not require any special access to be downloaded.
A very convenient one-click script has been developed to facilitate the operations:
(neural-speed) $ python scripts/run.py openlm-research/open_llama_7b_v2
--weight_dtype int4 -p "The boy runs on the hill and see"
The run.py script will execute four steps: 1) il will download the model from HuggingFace, 2) it will convert the pytorch model contained in the folder "openlm-research_open_llama_7b_v2" to a ne_llama_f32.bin binary file, 3) it will quantize it in INT4 (4bit integers weight-only quantization), and 4) it will run the inference on the prompt "The boy runs on the hill and see".
If you prefer to use a local model, already downloaded on your hard drive you just need to put the full model path in the prompt:
(neural-speed) $ python scripts/run.py
./models/openlm-research_open_llama_7b_v2 --weight_dtype int4 -p "The
boy runs on the hill and see"
Performance profiling
To better evaluate how the model is performing, you can obtain full information during the inference process. To do that, you need to re-compile Neural Speed from the sources setting the following user environment variables:
(neural-speed) $ NS_PROFILING="ON"; export NS_PROFILING
(neural-speed) $ NEURAL_SPEED_VERBOSE="0"; export
NEURAL_SPEED_VERBOSE
(neural-speed) $ pip install .
And running again the inference.
Where:
Print full verbose debug information: evaluation time and operator profiling will be printed.
Print only evaluation time: the time taken for each evaluation.
Profile individual operator to identify performance bottlenecks within the model.
Note: the profiling output is meant to be only for debugging purposes, it's usually quite noisy and really verbose. Do not use it if you don't really need it.
* This content has been produced by human hands.
Conclusion
Neural Speed has been proven to give us the opportunity to play seriously with LLMs on your own hardware, without the need for spending huge amounts of money on cloud providers or to buy expensive graphic cards or external tools. It works offline, it is open source, and you collaborate with Intel developers to improve the tool.
In the following articles, we will see how it works under the hood and how to play nice and advanced things.
Article References and Bibliography
Shen et. al. Efficient LLM Inference on CPUs. arXiv:2311.00502v2, 2023.