Local Function Calling With Mistral 7B and vLLM
Setup
My server specs:
- CPU: AMD Ryzen 9 5900X 12-Core Processor
- RAM: 32GB
- GPU: RTX 3090 24GB
I recommend using Ubuntu 22.04 to simplify setting up the Nvidia drivers and CUDA. Also be sure to download the Nvidia Container Toolkit.
I am using the full precision bfoat16
version of Mistral-7B-Instruct-v0.3
instead of a quantized model because I want to optimize for throughput over latency.
See this article from Neural Magic for more info on quantization trade-offs.
Here is how to run:
sudo docker run \
--runtime nvidia \
--gpus all \
-v $HOME/.cache/huggingface:/root/.cache/huggingface \
-v $HOME/chat-templates/tool_chat_template_mistral.jinja:/root/tool_chat_template_mistral.jinja \
-p 8000:8000 \
--ipc=host \
-it vllm/vllm-openai:latest \
--model mistralai/Mistral-7B-Instruct-v0.3 \
--served-model-name mistral-7B \
--gpu-memory-utilization 0.95 \
--tool-call-parser mistral \
--chat-template /root/tool_chat_template_mistral.jinja \
--enable-auto-tool-choice
I had to download the correct chat template for function calling and make it available in the docker container.
Ignore this warning:
FutureWarning: It is strongly recommended to run mistral models with `--tokenizer_mode "mistral"` to ensure correct encoding and decoding.
Function calling will not work if you pass this flag to vLLM
as it does not allow changing the chat template.
Here is a quick script to check if function calling is working:
import argparse
import instructor
import requests
from openai import OpenAI
from openai.types.chat import ChatCompletionMessageParam
from pydantic import BaseModel, Field
# Model must match vllm --served-model-name
MODEL = "mistral-7B"
# set a default seed and temperature for more determinism
SEED = 32**4
TEMPERATURE = 0
# HOST is the ip of vLLM server
HOST = "192.168.2.9"
client = instructor.from_openai(
OpenAI(base_url=f"http://{HOST}:8000/v1", api_key="mistral"),
mode=instructor.Mode.TOOLS,
)
# make pyright/mypy happy
Messages = list[ChatCompletionMessageParam]
class WeatherForcast(BaseModel):
city: str
state: str = Field(
description="Either the state abbreviation if US or ISO 3166-1 alpha-2 country code"
)
def execute(self) -> str:
url = f"https://wttr.in/{self.city},{self.state}"
r = requests.get(url)
return r.text
def main():
parser = argparse.ArgumentParser(description="Mistral weather bot")
parser.add_argument("prompt", type=str, help="Your prompt")
args = parser.parse_args()
messages: Messages = [
{
"role": "user",
"content": args.prompt,
}
]
forecast, completion = client.chat.completions.create_with_completion(
model=MODEL,
seed=SEED,
temperature=TEMPERATURE,
response_model=WeatherForcast,
tool_choice={"function": {"name": "auto"}},
messages=messages,
)
# mistral tool call ids must be exactly 9 characters
openai_tool_call_id = completion.choices[0].message.tool_calls[0].id
mistral_tool_call_id = openai_tool_call_id[-9:]
messages.append(
{
"role": "tool",
"name": "WeatherForecast",
"content": forecast.execute(),
"tool_call_id": mistral_tool_call_id,
}
)
answer = client.chat.completions.create(
model=MODEL, response_model=str, messages=messages
)
print(f"\n> {answer}")
if __name__ == "__main__":
main()
Now we can run this like:
$ python mistral-tools.py "I'm flying to new york tomorrow, How should I pack?"
> To pack for your trip to New York, consider the following: Since the weather forecast shows overcast and cloudy conditions with temperatures ranging from 62°F to 68°F, you should pack layers, including a light jacket or sweater. Rain is expected on Wednesday, so it would be a good idea to bring an umbrella and waterproof shoes. The wind is expected to be around 8-14 mph, so remember to pack a hat or headwear to protect against the wind. Finally, be prepared for some rain showers on Thursday with temperatures ranging from 73°F to 78°F. Have a great trip!