Llama Chat Completions API - vLLM engine

Use vLLM’s OpenAI-compatible Chat Completions endpoint to call HexGrid-hosted Llama 3.1 and Llama 3.3 models with a messages-based interface.

This page provides copy-pasteable cURL-only examples for standard chat, reasoning-style prompts, streaming, tool calling, tool-result continuation, JSON output, and common generation parameters.

Endpoint

POST http://<server-ip>:<port>/v1/chat/completions

Set these environment variables before running the examples:

export HEXGRID_API_KEY="your-hexgrid-api-key"
export LLAMA_BASE_URL="http://<server-ip>:<port>/v1"
export LLAMA_MODEL="meta-llama/Llama-3.3-70B-Instruct"

You can replace meta-llama/Llama-3.3-70B-Instruct with another HexGrid-hosted Llama model, such as:

meta-llama/Llama-3.3-70B-Instruct
meta-llama/Meta-Llama-3.1-8B-Instruct
meta-llama/Meta-Llama-3.1-70B-Instruct
meta-llama/Meta-Llama-3.1-405B-Instruct

Use the exact model ID configured in your HexGrid deployment.

POST/v1/chat/completions

Create a chat completion

Generate a normal non-streaming response from a Llama 3.1 or Llama 3.3 model served by vLLM.

Required attributes

Name
model
Type
string
Description
The served Llama model name, for example meta-llama/Llama-3.3-70B-Instruct.
Name
messages
Type
array
Description
The conversation so far. Each message has a role and content. Common roles are system, user, assistant, and tool.

Optional attributes used here

Name
temperature
Type
number
Description
Sampling temperature. Higher values produce more varied output; lower values produce more deterministic output.
Name
top_p
Type
number
Description
Nucleus sampling probability threshold.
Name
max_tokens
Type
integer
Description
Maximum number of tokens to generate.

Request

POST

/v1/chat/completions

curl -X POST "$LLAMA_BASE_URL/chat/completions" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$LLAMA_MODEL"'",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "Give me a one-sentence explanation of what Llama is."
      }
    ],
    "temperature": 0.7,
    "top_p": 0.9,
    "max_tokens": 256
  }'

Response shape

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1760000000,
  "model": "meta-llama/Llama-3.3-70B-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Llama is Meta’s family of open large language models designed for tasks such as chat, reasoning, coding, multilingual generation, and tool-using applications."
      },
      "finish_reason": "stop",
      "logprobs": null
    }
  ],
  "usage": {
    "prompt_tokens": 31,
    "completion_tokens": 29,
    "total_tokens": 60
  }
}

POST/v1/chat/completions

Reasoning-style chat completion

Llama 3.1 and Llama 3.3 are not exposed by vLLM with a dedicated parsed reasoning field like Qwen or Gemma reasoning models.

Use normal Chat Completions and instruct the model to reason carefully, then return a concise final answer.

Required attributes

Name
model
Type
string
Description
The served Llama model name.
Name
messages
Type
array
Description
The user conversation.

Optional attributes used here

Name
temperature
Type
number
Description
Use a lower value for more deterministic reasoning-style answers.
Name
max_tokens
Type
integer
Description
Increase this value for harder reasoning tasks.

Request

POST

/v1/chat/completions

curl -X POST "$LLAMA_BASE_URL/chat/completions" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$LLAMA_MODEL"'",
    "messages": [
      {
        "role": "system",
        "content": "You are a careful reasoning assistant. Solve the problem step by step internally, then provide a concise final answer."
      },
      {
        "role": "user",
        "content": "Which is greater, 9.11 or 9.8? Explain briefly."
      }
    ],
    "temperature": 0.3,
    "top_p": 0.9,
    "max_tokens": 512
  }'

Response shape

{
  "id": "chatcmpl-reasoning-style123",
  "object": "chat.completion",
  "created": 1760000001,
  "model": "meta-llama/Llama-3.3-70B-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "9.8 is greater than 9.11. Both numbers start with 9, but 9.8 has 8 in the tenths place while 9.11 has 1 in the tenths place."
      },
      "finish_reason": "stop",
      "logprobs": null
    }
  ],
  "usage": {
    "prompt_tokens": 47,
    "completion_tokens": 42,
    "total_tokens": 89
  }
}

POST/v1/chat/completions

Streaming

Stream tokens as Server-Sent Events instead of waiting for the complete response.

Streaming attributes

Name
stream
Type
boolean
Description
Set to true to return incremental chunks.
Name
stream_options
Type
object
Description
Optional streaming configuration. { "include_usage": true } requests usage in the final stream chunk.
Name
max_tokens
Type
integer
Description
Maximum number of tokens to generate.

Request

POST

/v1/chat/completions

curl -N -X POST "$LLAMA_BASE_URL/chat/completions" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$LLAMA_MODEL"'",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "Explain Llama 3.3 in three short bullet points."
      }
    ],
    "stream": true,
    "stream_options": {
      "include_usage": true
    },
    "temperature": 0.7,
    "top_p": 0.9,
    "max_tokens": 256
  }'

Sample streamed response shape

data: {"id":"chatcmpl-stream123","object":"chat.completion.chunk","created":1760000002,"model":"meta-llama/Llama-3.3-70B-Instruct","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}],"usage":null}

data: {"id":"chatcmpl-stream123","object":"chat.completion.chunk","created":1760000002,"model":"meta-llama/Llama-3.3-70B-Instruct","choices":[{"index":0,"delta":{"content":"- Llama 3.3 is a 70B instruction-tuned text model from Meta."},"finish_reason":null}],"usage":null}

data: {"id":"chatcmpl-stream123","object":"chat.completion.chunk","created":1760000002,"model":"meta-llama/Llama-3.3-70B-Instruct","choices":[{"index":0,"delta":{"content":"\n- It is optimized for multilingual dialogue and general assistant use cases."},"finish_reason":null}],"usage":null}

data: {"id":"chatcmpl-stream123","object":"chat.completion.chunk","created":1760000002,"model":"meta-llama/Llama-3.3-70B-Instruct","choices":[{"index":0,"delta":{"content":"\n- It can be served through vLLM using an OpenAI-compatible API."},"finish_reason":null}],"usage":null}

data: {"id":"chatcmpl-stream123","object":"chat.completion.chunk","created":1760000002,"model":"meta-llama/Llama-3.3-70B-Instruct","choices":[{"index":0,"delta":{},"finish_reason":"stop"}],"usage":null}

data: {"id":"chatcmpl-stream123","object":"chat.completion.chunk","created":1760000002,"model":"meta-llama/Llama-3.3-70B-Instruct","choices":[],"usage":{"prompt_tokens":29,"completion_tokens":44,"total_tokens":73}}

data: [DONE]

POST/v1/chat/completions

Tool calling

Provide function tools that the model may call. The model can return tool_calls instead of a final text answer.

HexGrid-hosted Llama models have tool calling enabled by default.

Tool attributes

Name
tools
Type
array
Description
Tool definitions. Use type: "function" with a JSON schema for parameters.
Name
tool_choice
Type
string | object
Description
Controls tool use. Use "auto" to let the model decide, "none" to disable tool calls, "required" to force at least one tool call, or an object to force a named function.
Name
parallel_tool_calls
Type
boolean
Description
Set to false if you want at most one tool call in a single response.

Request

POST

/v1/chat/completions

curl -X POST "$LLAMA_BASE_URL/chat/completions" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$LLAMA_MODEL"'",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant. Use tools when needed."
      },
      {
        "role": "user",
        "content": "What is the weather in Tokyo today?"
      }
    ],
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "get_current_weather",
          "description": "Get the current weather for a city or district.",
          "parameters": {
            "type": "object",
            "properties": {
              "location": {
                "type": "string",
                "description": "City or district, such as Tokyo, San Francisco, or Bengaluru."
              },
              "unit": {
                "type": "string",
                "enum": ["celsius", "fahrenheit"],
                "description": "Temperature unit."
              }
            },
            "required": ["location"]
          }
        }
      }
    ],
    "tool_choice": "auto",
    "parallel_tool_calls": false,
    "temperature": 0.2,
    "max_tokens": 512
  }'

Response shape

{
  "id": "chatcmpl-tool123",
  "object": "chat.completion",
  "created": 1760000003,
  "model": "meta-llama/Llama-3.3-70B-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "",
        "tool_calls": [
          {
            "id": "call_abc123",
            "type": "function",
            "function": {
              "name": "get_current_weather",
              "arguments": "{\"location\":\"Tokyo\",\"unit\":\"celsius\"}"
            }
          }
        ]
      },
      "finish_reason": "tool_calls",
      "logprobs": null
    }
  ],
  "usage": {
    "prompt_tokens": 118,
    "completion_tokens": 24,
    "total_tokens": 142
  }
}

POST/v1/chat/completions

Required tool calling

Force the model to call at least one tool by setting tool_choice to "required".

Tool attributes

Name
tool_choice
Type
string
Description
Set to "required" to force at least one tool call.
Name
tools
Type
array
Description
Tool definitions available to the model.

Request

POST

/v1/chat/completions

curl -X POST "$LLAMA_BASE_URL/chat/completions" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$LLAMA_MODEL"'",
    "messages": [
      {
        "role": "user",
        "content": "Find the current weather for Tokyo."
      }
    ],
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "get_current_weather",
          "description": "Get the current weather for a city or district.",
          "parameters": {
            "type": "object",
            "properties": {
              "location": {
                "type": "string",
                "description": "City or district."
              }
            },
            "required": ["location"]
          }
        }
      }
    ],
    "tool_choice": "required",
    "parallel_tool_calls": false,
    "temperature": 0.2,
    "max_tokens": 512
  }'

Response shape

{
  "id": "chatcmpl-required-tool123",
  "object": "chat.completion",
  "created": 1760000004,
  "model": "meta-llama/Llama-3.3-70B-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "",
        "tool_calls": [
          {
            "id": "call_weather_001",
            "type": "function",
            "function": {
              "name": "get_current_weather",
              "arguments": "{\"location\":\"Tokyo\"}"
            }
          }
        ]
      },
      "finish_reason": "tool_calls"
    }
  ],
  "usage": {
    "prompt_tokens": 92,
    "completion_tokens": 20,
    "total_tokens": 112
  }
}

POST/v1/chat/completions

Named tool calling

Force a specific tool by passing an object to tool_choice.

Tool choice object

Name
tool_choice
Type
object
Description
Use { "type": "function", "function": { "name": "..." } } to force a specific tool.

Request

POST

/v1/chat/completions

curl -X POST "$LLAMA_BASE_URL/chat/completions" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$LLAMA_MODEL"'",
    "messages": [
      {
        "role": "user",
        "content": "Use the weather tool to check Tokyo."
      }
    ],
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "get_current_weather",
          "description": "Get the current weather for a city or district.",
          "parameters": {
            "type": "object",
            "properties": {
              "location": {
                "type": "string"
              },
              "unit": {
                "type": "string",
                "enum": ["celsius", "fahrenheit"]
              }
            },
            "required": ["location"]
          }
        }
      }
    ],
    "tool_choice": {
      "type": "function",
      "function": {
        "name": "get_current_weather"
      }
    },
    "parallel_tool_calls": false,
    "temperature": 0.2,
    "max_tokens": 512
  }'

Response shape

{
  "id": "chatcmpl-named-tool123",
  "object": "chat.completion",
  "created": 1760000005,
  "model": "meta-llama/Llama-3.3-70B-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "",
        "tool_calls": [
          {
            "id": "call_weather_002",
            "type": "function",
            "function": {
              "name": "get_current_weather",
              "arguments": "{\"location\":\"Tokyo\",\"unit\":\"celsius\"}"
            }
          }
        ]
      },
      "finish_reason": "tool_calls"
    }
  ],
  "usage": {
    "prompt_tokens": 105,
    "completion_tokens": 22,
    "total_tokens": 127
  }
}

POST/v1/chat/completions

Tool result

After your application executes the selected tool, send the tool result back so Llama can produce a final answer.

Tool result message

Name
role
Type
string
Description
Use tool for OpenAI-compatible tool result messages.
Name
tool_call_id
Type
string
Description
The id returned by the assistant message’s tool_calls item.
Name
content
Type
string
Description
The tool result. If the result is structured data, serialize it as a JSON string.

Request

POST

/v1/chat/completions

curl -X POST "$LLAMA_BASE_URL/chat/completions" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$LLAMA_MODEL"'",
    "messages": [
      {
        "role": "user",
        "content": "What is the weather in Tokyo today?"
      },
      {
        "role": "assistant",
        "content": "",
        "tool_calls": [
          {
            "id": "call_abc123",
            "type": "function",
            "function": {
              "name": "get_current_weather",
              "arguments": "{\"location\":\"Tokyo\",\"unit\":\"celsius\"}"
            }
          }
        ]
      },
      {
        "role": "tool",
        "tool_call_id": "call_abc123",
        "content": "{\"location\":\"Tokyo\",\"temperature\":22,\"condition\":\"Partly cloudy\",\"unit\":\"celsius\"}"
      }
    ],
    "max_tokens": 1024
  }'

Response shape

{
  "id": "chatcmpl-tool-result123",
  "object": "chat.completion",
  "created": 1760000006,
  "model": "meta-llama/Llama-3.3-70B-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The weather in Tokyo today is partly cloudy, with a temperature of 22°C."
      },
      "finish_reason": "stop",
      "logprobs": null
    }
  ],
  "usage": {
    "prompt_tokens": 156,
    "completion_tokens": 20,
    "total_tokens": 176
  }
}

POST/v1/chat/completions

JSON mode

Request valid JSON output using vLLM’s OpenAI-compatible response_format.

JSON attributes

Name
response_format
Type
object
Description
Set to { "type": "json_object" } to request JSON object output.
Name
messages
Type
array
Description
Include an explicit instruction to return JSON in the system or user message.

Request

POST

/v1/chat/completions

curl -X POST "$LLAMA_BASE_URL/chat/completions" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$LLAMA_MODEL"'",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant. Return only valid JSON."
      },
      {
        "role": "user",
        "content": "Create a JSON object with three short product tagline ideas for a note-taking app."
      }
    ],
    "response_format": {
      "type": "json_object"
    },
    "temperature": 0.7,
    "top_p": 0.9,
    "max_tokens": 512
  }'

Response shape

{
  "id": "chatcmpl-json123",
  "object": "chat.completion",
  "created": 1760000007,
  "model": "meta-llama/Llama-3.3-70B-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "{\"taglines\":[\"Capture ideas before they disappear.\",\"Your thoughts, organized instantly.\",\"Notes that keep up with you.\"]}"
      },
      "finish_reason": "stop",
      "logprobs": null
    }
  ],
  "usage": {
    "prompt_tokens": 48,
    "completion_tokens": 36,
    "total_tokens": 84
  }
}

POST/v1/chat/completions

JSON schema

Request output that follows a JSON schema using vLLM’s response_format with type: "json_schema".

JSON schema attributes

Name
response_format
Type
object
Description
Output-format constraint.
Name
json_schema
Type
object
Description
The schema that the response should follow.

Request

POST

/v1/chat/completions

curl -X POST "$LLAMA_BASE_URL/chat/completions" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$LLAMA_MODEL"'",
    "messages": [
      {
        "role": "system",
        "content": "Return only JSON that matches the provided schema."
      },
      {
        "role": "user",
        "content": "Create three short product tagline ideas for a note-taking app."
      }
    ],
    "response_format": {
      "type": "json_schema",
      "json_schema": {
        "name": "tagline_response",
        "schema": {
          "type": "object",
          "properties": {
            "taglines": {
              "type": "array",
              "items": {
                "type": "string"
              },
              "minItems": 3,
              "maxItems": 3
            }
          },
          "required": ["taglines"],
          "additionalProperties": false
        }
      }
    },
    "temperature": 0.7,
    "top_p": 0.9,
    "max_tokens": 512
  }'

Response shape

{
  "id": "chatcmpl-schema123",
  "object": "chat.completion",
  "created": 1760000008,
  "model": "meta-llama/Llama-3.3-70B-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "{\"taglines\":[\"Capture ideas before they disappear.\",\"Your thoughts, organized instantly.\",\"Notes that keep up with you.\"]}"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 86,
    "completion_tokens": 36,
    "total_tokens": 122
  }
}

POST/v1/chat/completions

Multiple vLLM parameters

Use this example when you need a broader set of vLLM generation controls.

For raw HTTP/cURL, vLLM-specific parameters can be merged directly into the JSON request body.

Parameters shown here

Name
temperature
Type
number
Description
Sampling temperature.
Name
top_p
Type
number
Description
Nucleus sampling probability threshold.
Name
top_k
Type
integer
Description
vLLM-specific top-k sampling parameter.
Name
min_p
Type
number
Description
vLLM-specific minimum probability sampling parameter.
Name
repetition_penalty
Type
number
Description
vLLM-specific repetition penalty.
Name
presence_penalty
Type
number
Description
Penalizes tokens based on whether they already appeared.
Name
frequency_penalty
Type
number
Description
Penalizes tokens based on how frequently they appeared.
Name
seed
Type
integer
Description
Best-effort deterministic seed.
Name
n
Type
integer
Description
Number of candidate responses to generate.
Name
logprobs
Type
boolean
Description
Whether to return token log probabilities if supported by the serving configuration.
Name
top_logprobs
Type
integer
Description
Number of top candidate tokens to return when logprobs is enabled.

Request

POST

/v1/chat/completions

curl -X POST "$LLAMA_BASE_URL/chat/completions" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$LLAMA_MODEL"'",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant. Keep the answer concise."
      },
      {
        "role": "user",
        "content": "Give me five naming ideas for an AI inference platform."
      }
    ],
    "temperature": 0.7,
    "top_p": 0.9,
    "top_k": 40,
    "min_p": 0.0,
    "repetition_penalty": 1.05,
    "presence_penalty": 0.2,
    "frequency_penalty": 0.2,
    "max_tokens": 512,
    "seed": 1234,
    "n": 1,
    "logprobs": true,
    "top_logprobs": 2,
    "stream": false
  }'

Response shape

{
  "id": "chatcmpl-params123",
  "object": "chat.completion",
  "created": 1760000009,
  "model": "meta-llama/Llama-3.3-70B-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "1. HexGrid Inference\n2. ModelForge\n3. TensorRoute\n4. InferaCloud\n5. LatticeAI"
      },
      "finish_reason": "stop",
      "logprobs": {
        "content": [
          {
            "token": "1",
            "logprob": -0.02,
            "bytes": [49],
            "top_logprobs": [
              {
                "token": "1",
                "logprob": -0.02,
                "bytes": [49]
              },
              {
                "token": "-",
                "logprob": -4.1,
                "bytes": [45]
              }
            ]
          }
        ]
      }
    }
  ],
  "usage": {
    "prompt_tokens": 38,
    "completion_tokens": 34,
    "total_tokens": 72
  }
}

Official sources

vLLM OpenAI-compatible server docs: https://docs.vllm.ai/en/latest/serving/openai_compatible_server/
vLLM tool calling docs: https://docs.vllm.ai/en/latest/features/tool_calling/
vLLM Llama tool parser docs: https://docs.vllm.ai/en/latest/api/vllm/tool_parsers/llama_tool_parser/
vLLM Llama 3.1 recipe: https://docs.vllm.ai/projects/recipes/en/latest/Llama/Llama3.1.html
Meta Llama 3.1 model card and prompt formats: https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_1/
Meta Llama 3.3 model card and prompt formats: https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_3/
Meta Llama 3.3 Hugging Face model page with vLLM cURL example: https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct