NeuralChat: deploy a local chatbot within minutes

Let's create your own LLM-driven AI apps with Neural Speed and Intel Extension for Transformers framework.

May 12, 2024

Created with AUTOMATIC1111/stable-diffusion-webui on prem, juggernautXL_juggXRundiffusion and epicrealismXL_v6Miracle models

After showcasing Neural Speed in my past articles, my desire is to share a direct application of the theory: a tool developed using Neural Speed as very first brick, NeuralChat.

NeuralChat is highlighted as “A customizable framework to create your own LLM-driven AI apps within minutes”: it is available as part of the Intel® Extension for Transformers, a Transformer-based toolkit that makes possible to accelerate Generative AI/LLM inference both on CPU and GPU.

The framework stands out for its support to a variety of different hardware platforms: from the basic Intel Core, to the powerful Xeon Scalable processors, passing through the super high performance Gaudi AI processors, Data Center GPU Max Series, and even NVidia GPUs.

NeuralChat is compatible with an incredible assortment of third-party frameworks: it works as extension for the super famous PyTorch, and leverage popular domain libraries (as Hugging Face, Langchain) with their respective extensions.

As we will see later in this article, NeuralChat supports the customized models made with weight-only quantization method I presented in previous articles, plus alternative model customizations through parameter-efficient fine-tuning, and conversion/quantization.

Finally, NeuralChat offers a rich set of plugins to enhance the capabilities of Retrieval-Augmented Generation (RAG) for LLMs, ; in particular, I would like to highlight that NeuralChat supports Haystack v2.0 API through IntelLabs FastRAG.

Installation

We will evaluate together the installation of Neural Chat framework on cheaper and dated Core/Xeon 7th-gen (codename Kaby Lake, 14nm, ~2017) with Ubuntu Server 22.04 LTS installed.

As preliminary steps, we need to install a few requirements at a system level:

~$ apt-get update

~$ apt-get install libsm6 libxext6 libgl1-mesa-glx libgl1-mesa-dev ffmpeg

Then, make sure you have a safe python virtual environment we will use to experiment with NeuralChat. In this article I am going to use conda to create a separate virtual-env

~$ conda create -n neuralchat python=3.11 pip

~$ conda activate neuralchat

Now, let us proceed installing Intel Extension for Transformers (as said, NeuralChat is part of it):

(neuralchat) ~$ pip install intel-extension-for-transformers

Note: at this point, I prefer to install intel-extension-for-transformers using pip and not conda, it returns several compatibility issue with python=3.11.

(neuralchat) ~$ pip install fastapi==0.103.2

(neuralchat) ~$ git clone https://github.com/intel/intel-extension-for-transformers

(neuralchat) ~$ cd intel-extension-for-transformer

(neuralchat) ~/intel-extension-for-transformers$ cd intel_extension_for_transformers

(neuralchat) ~/intel-extension-for-transformers/intel_extension_for_transformers$ cd neural-chat

If you notice, in the neural-chat folder there are several requirements*.txt files. We need to install the additional requirements for our specific scenario, as follows:

# For CPU-only device
pip install -r requirements_cpu.txt

# For HPU (Habana® Gaudi® AI Processor) device
pip install -r requirements_hpu.txt

# For XPU (Intel GPU with PyTorch support) device
pip install -r requirements_xpu.txt

# For Windows
pip install -r requirements_win.txt

# For NVIDIA CUDA-enabled device
pip install -r requirements.txt

So, in my case:

(neuralchat) ~/intel-extension-for-transformers/intel_extension_for_transformers/neural-chat$ pip install -r requirements.txt

Note: if you notice, neural-compressor and neural_speed are essential mandatory requirements for NeuralChat.

You should now be able to see 4 new binaries available:

neural_engine neuralchat neuralchat_client neuralchat_server

Test the installation

You can check if your installation was successful with this really simple commands:

(neuralchat) ~/intel-extension-for-transformers$ neuralchat predict --query "Tell me more about Generative AI."

Check additional option for neuralchat predict with:

(neuralchat) ~$ neuralchat predict -h

Getting Started: your first “Hello World!” chatbot

Let’s test the installation of NeuralChat implementing a very basic chatbot query/response. To do that you will need to fulfil some extra requirements to use the method chatbot_build() within the intel_extension_for_transformers.neural_chat module.

(neuralchat) ~/intel-extension-for-transformers$ pip install torch

(neuralchat) ~/intel-extension-for-transformers$ pip install accelerate

(neuralchat) ~/intel-extension-for-transformers$ pip install datasets

(neuralchat) ~/intel-extension-for-transformers$ pip install uvicorn

(neuralchat) ~/intel-extension-for-transformers$ pip install yacs

(neuralchat) ~/intel-extension-for-transformers$ pip install fastchat

At this point, I suggest to use the old-fashion i(nteractive)python:

(neuralchat) ~/intel-extension-for-transformers$ pip install ipython

(neuralchat) ~/intel-extension-for-transformers$ ipython

[1] from intel_extension_for_transformers.neural_chat import build_chatbot

[2] chatbot = build_chatbot()

Creating the object chatbot a fine-tuned (base model Mistral-7B-v0.1), pre-trained (with dataset Open-Orca/SlimOrca), pre-converted and pre-quantized model directly from HuggingFace’s repository Intel/neural-chat-7b-v3-1 will be automatically downloaded.

When finished, you should read something like: root:INFO:Model loaded.

Now, let’s test your first prompt with

[4] response = chatbot.predict("What are the top ten scifi movies of all times?")

[5] print("chatbot's response: ", response)

You should get a response from the model.

Note: to leave ipython just type quit().

Getting Started: let’s do a bit of RAG

Ok, it’s time to experiment with RAG expandability available as plugin for NeuralChat. Developers already made available within the code, a subdirectory containing everything you need. As always, starting from the requirements

(neuralchat) ~/intel-extension-for-transformers$ cd intel_extension_for_transformers/neural_chat/pipeline/plugins/retrieval

(neuralchat) ~/intel-extension-for-transformers/intel_extension_for_transformers/neural_chat/pipeline/plugins/retrieval$ pip install -r requirements.txt

Now, lets download a pdf file where we’ll look for information to retrieve:

(neuralchat) ~$ mkdir text

(neuralchat) ~$ wget https://rauterberg.employee.id.tue.nl/lecturenotes/DDM110%20CAS/Orwell-1949%201984.pdf -O ./text/Orwell-1984.pdf

Then, let’s implement a simple solution step-by-step using ipython:

(neuralchat) ~/intel-extension-for-transformers$ ipython

[1] from intel_extension_for_transformers.neural_chat import build_chatbot, PipelineConfig, plugins

[2] model_path="Intel/neural-chat-7b-v3-1"

[3] plugins.retrieval.args["embedding_model"]="BAAI/bge-base-en-v1.5"

[4] plugins.retrieval.args["persist_directory"]="./output"

[5] plugins.retrieval.args["append"]=False

[6] plugins.retrieval.args["input_path"]="./text/"

[7] plugins.retrieval.enable=True

[8] config = PipelineConfig(model_name_or_path=model_path, plugins=plugins)

[9] chatbot = build_chatbot(config)

Output: root:INFO:Model loaded.

Then, you are ready to interrogate the model about the information grabbed from the pdf:

[10] response = chatbot.predict("Who is the main character of 1984?")

Did you get the right reply?

This RAG plugin APIs simplify the creation and utilization of models for chatbots; their design serves as easy-to-use extension for LangChain’s users and, in general, a very user-friendly deployment solution.

For file type like txt, html, markdown, doc(x), and pdf you don’t need any particular predefined structure; instead, xls(x), csv, and json/jsonl must adhere to specific structure:

xls(x) ['Questions', 'Answers'], ['question', 'answer', 'link'], or ['context', 'link']
csv ['question', 'correct_answer']
json/jsonl {'content':xxx, 'link':xxx}

Costumized Models

Intel made available a set of models fine-tuned initially from Mistral-7B, then with an iterative approach from the previous version: so, for example, Intel/neural-chat-7b-v3-3 is a tuned version of Intel/neural-chat-7b-v3-1.

They are free, publicly available on HuggingFace and regrouped in a dedicated collection.

Created with AUTOMATIC1111/stable-diffusion-webui on prem, wildcardxXLFusion_fusionOG.safetensors model

Advanced Features

I could talk about NeuralChat for hours, there are tons of different things to cover. Just one article is not enough: so, for now, I’ll leave with a few additional tips to deepen into NeuralChat advanced functionalities.

NeuralChat Server

(neuralchat) ~$ neuralchat_server start -h

usage: neuralchat_server.start [-h] --config_file CONFIG_FILE [--log_file LOG_FILE]

Where

--config_file CONFIG_FILE pathfile for yaml file configuration of the server

--log_file LOG_FILE pathfile for log file

NeuralChat Server can de deployed using a configuration file in .yaml format to customize the behaviour my modifying the value of its fields: in there, you can set server host, port, model name or pathdir, tozenizer, peft model, cache, etc...

A complete description is available in intel_extension_for_transformers/neural_chat/server/README.md.

NeuralChat Client

(neuralchat) ~$ neuralchat_client help

Where options are available:

textchat neuralchat_client text chat command
voicechat neuralchat_client voice chat command
finetune neuralchat_client finetuning command

NeuralChat General Binary

(neuralchat) ~$ neuralchat help

Where options are available:

predict neuralchat text/voice chat command
usage: neuralchat.predict [-h] [--query QUERY] [--model_name_or_path MODEL_NAME_OR_PATH] [--output_audio_path OUTPUT_AUDIO_PATH] [--device DEVICE]
--query QUERY Prompt text or audio file
--model_name_or_path MODEL_NAME_OR_PATH Model name or path
--output_audio_path OUTPUT_AUDIO_PATH Audio output path if the prompt is audio file
--device DEVICE Specify chat on which device
finetune neuralchat finetuning command
usage: neuralchat.finetune [-h] [--base_model BASE_MODEL] [--device DEVICE] [--train_file TRAIN_FILE] [--max_steps MAX_STEPS]
--base_model BASE_MODEL Base model path or name for finetuning
--device DEVICE Specify finetune model on which device
--train_file TRAIN_FILE Specify train file path
--max_steps MAX_STEPS Specify max steps of finetuning

Additional Useful Resources

Advanced Features Documentation

RESTful APIs

Useful Examples

Jupiter Notebooks (.ipynb) to start experimenting

Remember the three steps

Install intel_extension_for_transformers requirements for your machine;
install intel_extension_for_transformers;
install neural_chat specific requirements.

References

Official Intel repository for Extension For Transformers on Github

* This content has been produced by human hands.

License

This work is licensed under Creative Commons Attribution 4.0 International

AI on Bare Metal

Discussion about this post