Llama Chat Completions API - vLLM engine
Use vLLM’s OpenAI-compatible Chat Completions endpoint to call HexGrid-hosted Llama 3.1 and Llama 3.3 models with a messages-based interface.
This page provides copy-pasteable cURL-only examples for standard chat, reasoning-style prompts, streaming, tool calling, tool-result continuation, JSON output, and common generation parameters.
Endpoint
POST http://<server-ip>:<port>/v1/chat/completions
Set these environment variables before running the examples:
export HEXGRID_API_KEY="your-hexgrid-api-key"
export LLAMA_BASE_URL="http://<server-ip>:<port>/v1"
export LLAMA_MODEL="meta-llama/Llama-3.3-70B-Instruct"
You can replace meta-llama/Llama-3.3-70B-Instruct with another HexGrid-hosted Llama model, such as:
meta-llama/Llama-3.3-70B-Instruct
meta-llama/Meta-Llama-3.1-8B-Instruct
meta-llama/Meta-Llama-3.1-70B-Instruct
meta-llama/Meta-Llama-3.1-405B-Instruct
Use the exact model ID configured in your HexGrid deployment.
Create a chat completion
Generate a normal non-streaming response from a Llama 3.1 or Llama 3.3 model served by vLLM.
Required attributes
- Name
model- Type
- string
- Description
The served Llama model name, for example
meta-llama/Llama-3.3-70B-Instruct.
- Name
messages- Type
- array
- Description
The conversation so far. Each message has a
roleandcontent. Common roles aresystem,user,assistant, andtool.
Optional attributes used here
- Name
temperature- Type
- number
- Description
Sampling temperature. Higher values produce more varied output; lower values produce more deterministic output.
- Name
top_p- Type
- number
- Description
Nucleus sampling probability threshold.
- Name
max_tokens- Type
- integer
- Description
Maximum number of tokens to generate.
Request
curl -X POST "$LLAMA_BASE_URL/chat/completions" \
-H "Authorization: Bearer $HEXGRID_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "'"$LLAMA_MODEL"'",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Give me a one-sentence explanation of what Llama is."
}
],
"temperature": 0.7,
"top_p": 0.9,
"max_tokens": 256
}'
Response shape
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1760000000,
"model": "meta-llama/Llama-3.3-70B-Instruct",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Llama is Meta’s family of open large language models designed for tasks such as chat, reasoning, coding, multilingual generation, and tool-using applications."
},
"finish_reason": "stop",
"logprobs": null
}
],
"usage": {
"prompt_tokens": 31,
"completion_tokens": 29,
"total_tokens": 60
}
}
Reasoning-style chat completion
Llama 3.1 and Llama 3.3 are not exposed by vLLM with a dedicated parsed reasoning field like Qwen or Gemma reasoning models.
Use normal Chat Completions and instruct the model to reason carefully, then return a concise final answer.
Required attributes
- Name
model- Type
- string
- Description
The served Llama model name.
- Name
messages- Type
- array
- Description
The user conversation.
Optional attributes used here
- Name
temperature- Type
- number
- Description
Use a lower value for more deterministic reasoning-style answers.
- Name
max_tokens- Type
- integer
- Description
Increase this value for harder reasoning tasks.
Request
curl -X POST "$LLAMA_BASE_URL/chat/completions" \
-H "Authorization: Bearer $HEXGRID_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "'"$LLAMA_MODEL"'",
"messages": [
{
"role": "system",
"content": "You are a careful reasoning assistant. Solve the problem step by step internally, then provide a concise final answer."
},
{
"role": "user",
"content": "Which is greater, 9.11 or 9.8? Explain briefly."
}
],
"temperature": 0.3,
"top_p": 0.9,
"max_tokens": 512
}'
Response shape
{
"id": "chatcmpl-reasoning-style123",
"object": "chat.completion",
"created": 1760000001,
"model": "meta-llama/Llama-3.3-70B-Instruct",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "9.8 is greater than 9.11. Both numbers start with 9, but 9.8 has 8 in the tenths place while 9.11 has 1 in the tenths place."
},
"finish_reason": "stop",
"logprobs": null
}
],
"usage": {
"prompt_tokens": 47,
"completion_tokens": 42,
"total_tokens": 89
}
}
Streaming
Stream tokens as Server-Sent Events instead of waiting for the complete response.
Streaming attributes
- Name
stream- Type
- boolean
- Description
Set to
trueto return incremental chunks.
- Name
stream_options- Type
- object
- Description
Optional streaming configuration.
{ "include_usage": true }requests usage in the final stream chunk.
- Name
max_tokens- Type
- integer
- Description
Maximum number of tokens to generate.
Request
curl -N -X POST "$LLAMA_BASE_URL/chat/completions" \
-H "Authorization: Bearer $HEXGRID_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "'"$LLAMA_MODEL"'",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Explain Llama 3.3 in three short bullet points."
}
],
"stream": true,
"stream_options": {
"include_usage": true
},
"temperature": 0.7,
"top_p": 0.9,
"max_tokens": 256
}'
Sample streamed response shape
data: {"id":"chatcmpl-stream123","object":"chat.completion.chunk","created":1760000002,"model":"meta-llama/Llama-3.3-70B-Instruct","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}],"usage":null}
data: {"id":"chatcmpl-stream123","object":"chat.completion.chunk","created":1760000002,"model":"meta-llama/Llama-3.3-70B-Instruct","choices":[{"index":0,"delta":{"content":"- Llama 3.3 is a 70B instruction-tuned text model from Meta."},"finish_reason":null}],"usage":null}
data: {"id":"chatcmpl-stream123","object":"chat.completion.chunk","created":1760000002,"model":"meta-llama/Llama-3.3-70B-Instruct","choices":[{"index":0,"delta":{"content":"\n- It is optimized for multilingual dialogue and general assistant use cases."},"finish_reason":null}],"usage":null}
data: {"id":"chatcmpl-stream123","object":"chat.completion.chunk","created":1760000002,"model":"meta-llama/Llama-3.3-70B-Instruct","choices":[{"index":0,"delta":{"content":"\n- It can be served through vLLM using an OpenAI-compatible API."},"finish_reason":null}],"usage":null}
data: {"id":"chatcmpl-stream123","object":"chat.completion.chunk","created":1760000002,"model":"meta-llama/Llama-3.3-70B-Instruct","choices":[{"index":0,"delta":{},"finish_reason":"stop"}],"usage":null}
data: {"id":"chatcmpl-stream123","object":"chat.completion.chunk","created":1760000002,"model":"meta-llama/Llama-3.3-70B-Instruct","choices":[],"usage":{"prompt_tokens":29,"completion_tokens":44,"total_tokens":73}}
data: [DONE]
Tool calling
Provide function tools that the model may call. The model can return tool_calls instead of a final text answer.
HexGrid-hosted Llama models have tool calling enabled by default.
Tool attributes
- Name
tools- Type
- array
- Description
Tool definitions. Use
type: "function"with a JSON schema for parameters.
- Name
tool_choice- Type
- string | object
- Description
Controls tool use. Use
"auto"to let the model decide,"none"to disable tool calls,"required"to force at least one tool call, or an object to force a named function.
- Name
parallel_tool_calls- Type
- boolean
- Description
Set to
falseif you want at most one tool call in a single response.
Request
curl -X POST "$LLAMA_BASE_URL/chat/completions" \
-H "Authorization: Bearer $HEXGRID_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "'"$LLAMA_MODEL"'",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant. Use tools when needed."
},
{
"role": "user",
"content": "What is the weather in Tokyo today?"
}
],
"tools": [
{
"type": "function",
"function": {
"name": "get_current_weather",
"description": "Get the current weather for a city or district.",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City or district, such as Tokyo, San Francisco, or Bengaluru."
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "Temperature unit."
}
},
"required": ["location"]
}
}
}
],
"tool_choice": "auto",
"parallel_tool_calls": false,
"temperature": 0.2,
"max_tokens": 512
}'
Response shape
{
"id": "chatcmpl-tool123",
"object": "chat.completion",
"created": 1760000003,
"model": "meta-llama/Llama-3.3-70B-Instruct",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "",
"tool_calls": [
{
"id": "call_abc123",
"type": "function",
"function": {
"name": "get_current_weather",
"arguments": "{\"location\":\"Tokyo\",\"unit\":\"celsius\"}"
}
}
]
},
"finish_reason": "tool_calls",
"logprobs": null
}
],
"usage": {
"prompt_tokens": 118,
"completion_tokens": 24,
"total_tokens": 142
}
}
Required tool calling
Force the model to call at least one tool by setting tool_choice to "required".
Tool attributes
- Name
tool_choice- Type
- string
- Description
Set to
"required"to force at least one tool call.
- Name
tools- Type
- array
- Description
Tool definitions available to the model.
Request
curl -X POST "$LLAMA_BASE_URL/chat/completions" \
-H "Authorization: Bearer $HEXGRID_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "'"$LLAMA_MODEL"'",
"messages": [
{
"role": "user",
"content": "Find the current weather for Tokyo."
}
],
"tools": [
{
"type": "function",
"function": {
"name": "get_current_weather",
"description": "Get the current weather for a city or district.",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City or district."
}
},
"required": ["location"]
}
}
}
],
"tool_choice": "required",
"parallel_tool_calls": false,
"temperature": 0.2,
"max_tokens": 512
}'
Response shape
{
"id": "chatcmpl-required-tool123",
"object": "chat.completion",
"created": 1760000004,
"model": "meta-llama/Llama-3.3-70B-Instruct",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "",
"tool_calls": [
{
"id": "call_weather_001",
"type": "function",
"function": {
"name": "get_current_weather",
"arguments": "{\"location\":\"Tokyo\"}"
}
}
]
},
"finish_reason": "tool_calls"
}
],
"usage": {
"prompt_tokens": 92,
"completion_tokens": 20,
"total_tokens": 112
}
}
Named tool calling
Force a specific tool by passing an object to tool_choice.
Tool choice object
- Name
tool_choice- Type
- object
- Description
Use
{ "type": "function", "function": { "name": "..." } }to force a specific tool.
Request
curl -X POST "$LLAMA_BASE_URL/chat/completions" \
-H "Authorization: Bearer $HEXGRID_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "'"$LLAMA_MODEL"'",
"messages": [
{
"role": "user",
"content": "Use the weather tool to check Tokyo."
}
],
"tools": [
{
"type": "function",
"function": {
"name": "get_current_weather",
"description": "Get the current weather for a city or district.",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"]
}
},
"required": ["location"]
}
}
}
],
"tool_choice": {
"type": "function",
"function": {
"name": "get_current_weather"
}
},
"parallel_tool_calls": false,
"temperature": 0.2,
"max_tokens": 512
}'
Response shape
{
"id": "chatcmpl-named-tool123",
"object": "chat.completion",
"created": 1760000005,
"model": "meta-llama/Llama-3.3-70B-Instruct",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "",
"tool_calls": [
{
"id": "call_weather_002",
"type": "function",
"function": {
"name": "get_current_weather",
"arguments": "{\"location\":\"Tokyo\",\"unit\":\"celsius\"}"
}
}
]
},
"finish_reason": "tool_calls"
}
],
"usage": {
"prompt_tokens": 105,
"completion_tokens": 22,
"total_tokens": 127
}
}
Tool result
After your application executes the selected tool, send the tool result back so Llama can produce a final answer.
Tool result message
- Name
role- Type
- string
- Description
Use
toolfor OpenAI-compatible tool result messages.
- Name
tool_call_id- Type
- string
- Description
The
idreturned by the assistant message’stool_callsitem.
- Name
content- Type
- string
- Description
The tool result. If the result is structured data, serialize it as a JSON string.
Request
curl -X POST "$LLAMA_BASE_URL/chat/completions" \
-H "Authorization: Bearer $HEXGRID_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "'"$LLAMA_MODEL"'",
"messages": [
{
"role": "user",
"content": "What is the weather in Tokyo today?"
},
{
"role": "assistant",
"content": "",
"tool_calls": [
{
"id": "call_abc123",
"type": "function",
"function": {
"name": "get_current_weather",
"arguments": "{\"location\":\"Tokyo\",\"unit\":\"celsius\"}"
}
}
]
},
{
"role": "tool",
"tool_call_id": "call_abc123",
"content": "{\"location\":\"Tokyo\",\"temperature\":22,\"condition\":\"Partly cloudy\",\"unit\":\"celsius\"}"
}
],
"max_tokens": 1024
}'
Response shape
{
"id": "chatcmpl-tool-result123",
"object": "chat.completion",
"created": 1760000006,
"model": "meta-llama/Llama-3.3-70B-Instruct",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "The weather in Tokyo today is partly cloudy, with a temperature of 22°C."
},
"finish_reason": "stop",
"logprobs": null
}
],
"usage": {
"prompt_tokens": 156,
"completion_tokens": 20,
"total_tokens": 176
}
}
JSON mode
Request valid JSON output using vLLM’s OpenAI-compatible response_format.
JSON attributes
- Name
response_format- Type
- object
- Description
Set to
{ "type": "json_object" }to request JSON object output.
- Name
messages- Type
- array
- Description
Include an explicit instruction to return JSON in the system or user message.
Request
curl -X POST "$LLAMA_BASE_URL/chat/completions" \
-H "Authorization: Bearer $HEXGRID_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "'"$LLAMA_MODEL"'",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant. Return only valid JSON."
},
{
"role": "user",
"content": "Create a JSON object with three short product tagline ideas for a note-taking app."
}
],
"response_format": {
"type": "json_object"
},
"temperature": 0.7,
"top_p": 0.9,
"max_tokens": 512
}'
Response shape
{
"id": "chatcmpl-json123",
"object": "chat.completion",
"created": 1760000007,
"model": "meta-llama/Llama-3.3-70B-Instruct",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "{\"taglines\":[\"Capture ideas before they disappear.\",\"Your thoughts, organized instantly.\",\"Notes that keep up with you.\"]}"
},
"finish_reason": "stop",
"logprobs": null
}
],
"usage": {
"prompt_tokens": 48,
"completion_tokens": 36,
"total_tokens": 84
}
}
JSON schema
Request output that follows a JSON schema using vLLM’s response_format with type: "json_schema".
JSON schema attributes
- Name
response_format- Type
- object
- Description
Output-format constraint.
- Name
json_schema- Type
- object
- Description
The schema that the response should follow.
Request
curl -X POST "$LLAMA_BASE_URL/chat/completions" \
-H "Authorization: Bearer $HEXGRID_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "'"$LLAMA_MODEL"'",
"messages": [
{
"role": "system",
"content": "Return only JSON that matches the provided schema."
},
{
"role": "user",
"content": "Create three short product tagline ideas for a note-taking app."
}
],
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "tagline_response",
"schema": {
"type": "object",
"properties": {
"taglines": {
"type": "array",
"items": {
"type": "string"
},
"minItems": 3,
"maxItems": 3
}
},
"required": ["taglines"],
"additionalProperties": false
}
}
},
"temperature": 0.7,
"top_p": 0.9,
"max_tokens": 512
}'
Response shape
{
"id": "chatcmpl-schema123",
"object": "chat.completion",
"created": 1760000008,
"model": "meta-llama/Llama-3.3-70B-Instruct",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "{\"taglines\":[\"Capture ideas before they disappear.\",\"Your thoughts, organized instantly.\",\"Notes that keep up with you.\"]}"
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 86,
"completion_tokens": 36,
"total_tokens": 122
}
}
Multiple vLLM parameters
Use this example when you need a broader set of vLLM generation controls.
For raw HTTP/cURL, vLLM-specific parameters can be merged directly into the JSON request body.
Parameters shown here
- Name
temperature- Type
- number
- Description
Sampling temperature.
- Name
top_p- Type
- number
- Description
Nucleus sampling probability threshold.
- Name
top_k- Type
- integer
- Description
vLLM-specific top-k sampling parameter.
- Name
min_p- Type
- number
- Description
vLLM-specific minimum probability sampling parameter.
- Name
repetition_penalty- Type
- number
- Description
vLLM-specific repetition penalty.
- Name
presence_penalty- Type
- number
- Description
Penalizes tokens based on whether they already appeared.
- Name
frequency_penalty- Type
- number
- Description
Penalizes tokens based on how frequently they appeared.
- Name
seed- Type
- integer
- Description
Best-effort deterministic seed.
- Name
n- Type
- integer
- Description
Number of candidate responses to generate.
- Name
logprobs- Type
- boolean
- Description
Whether to return token log probabilities if supported by the serving configuration.
- Name
top_logprobs- Type
- integer
- Description
Number of top candidate tokens to return when
logprobsis enabled.
Request
curl -X POST "$LLAMA_BASE_URL/chat/completions" \
-H "Authorization: Bearer $HEXGRID_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "'"$LLAMA_MODEL"'",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant. Keep the answer concise."
},
{
"role": "user",
"content": "Give me five naming ideas for an AI inference platform."
}
],
"temperature": 0.7,
"top_p": 0.9,
"top_k": 40,
"min_p": 0.0,
"repetition_penalty": 1.05,
"presence_penalty": 0.2,
"frequency_penalty": 0.2,
"max_tokens": 512,
"seed": 1234,
"n": 1,
"logprobs": true,
"top_logprobs": 2,
"stream": false
}'
Response shape
{
"id": "chatcmpl-params123",
"object": "chat.completion",
"created": 1760000009,
"model": "meta-llama/Llama-3.3-70B-Instruct",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "1. HexGrid Inference\n2. ModelForge\n3. TensorRoute\n4. InferaCloud\n5. LatticeAI"
},
"finish_reason": "stop",
"logprobs": {
"content": [
{
"token": "1",
"logprob": -0.02,
"bytes": [49],
"top_logprobs": [
{
"token": "1",
"logprob": -0.02,
"bytes": [49]
},
{
"token": "-",
"logprob": -4.1,
"bytes": [45]
}
]
}
]
}
}
],
"usage": {
"prompt_tokens": 38,
"completion_tokens": 34,
"total_tokens": 72
}
}
Official sources
- vLLM OpenAI-compatible server docs: https://docs.vllm.ai/en/latest/serving/openai_compatible_server/
- vLLM tool calling docs: https://docs.vllm.ai/en/latest/features/tool_calling/
- vLLM Llama tool parser docs: https://docs.vllm.ai/en/latest/api/vllm/tool_parsers/llama_tool_parser/
- vLLM Llama 3.1 recipe: https://docs.vllm.ai/projects/recipes/en/latest/Llama/Llama3.1.html
- Meta Llama 3.1 model card and prompt formats: https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_1/
- Meta Llama 3.3 model card and prompt formats: https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_3/
- Meta Llama 3.3 Hugging Face model page with vLLM cURL example: https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct