Qwen Chat Completions API - vLLM engine
Use vLLM’s OpenAI-compatible Chat Completions endpoint to call HexGrid hosted Qwen models with a messages-based interface.
This page provides copy-pasteable cURL-only examples for standard chat, reasoning, streaming, tool calling, tool-result continuation, and common vLLM generation parameters.
Endpoint
vLLM exposes the OpenAI-compatible Chat Completions API at:
POST https://<server-ip-address>:<port>/v1/chat/completions
Set these environment variables before running the examples:
export HEXGRID_API_KEY="your-hexgrid-api-key"
export QWEN_BASE_URL="https://<server-ip-address>:<port>/v1"
export QWEN_MODEL="Qwen/Qwen3.6-35B-A3B-FP8"
Use the exact model ID configured in your HexGrid deployment.
Create a chat completion
Generate a normal non-streaming response from a Qwen model served by vLLM.
Required attributes
- Name
model- Type
- string
- Description
The model ID that your vLLM server was launched with, for example
Qwen/Qwen3.6-35B-A3B-FP8.
- Name
messages- Type
- array
- Description
The conversation so far. Each message has a
roleandcontent. Common roles aresystem,user,assistant, andtool.
Optional attributes used here
- Name
temperature- Type
- number
- Description
Sampling temperature. Higher values produce more varied output; lower values produce more deterministic output.
- Name
top_p- Type
- number
- Description
Nucleus sampling probability threshold.
- Name
max_tokens- Type
- integer
- Description
Maximum number of tokens to generate.
Request
curl -X POST "$QWEN_BASE_URL/chat/completions" \
-H "Authorization: Bearer $HEXGRID_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "'"$QWEN_MODEL"'",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Give me a one-sentence explanation of what Qwen is."
}
],
"temperature": 0.7,
"top_p": 0.8,
"max_tokens": 256
}'
Response
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1760000000,
"model": "Qwen/Qwen3.6-35B-A3B-FP8",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Qwen is a family of large language models developed by Alibaba Cloud for tasks such as chat, reasoning, coding, multilingual generation, and agent workflows."
},
"finish_reason": "stop",
"logprobs": null
}
],
"usage": {
"prompt_tokens": 31,
"completion_tokens": 30,
"total_tokens": 61
}
}
Reasoning
Generate a response with Qwen reasoning parsed by vLLM.
Qwen3-series thinking is enabled by default in vLLM reasoning mode. To disable it for a specific request, use:
"chat_template_kwargs": {
"enable_thinking": false
}
Required attributes
- Name
model- Type
- string
- Description
The model ID that your vLLM server was launched with.
- Name
messages- Type
- array
- Description
The user conversation.
Reasoning attributes
- Name
chat_template_kwargs- Type
- object
- Description
Additional keyword arguments passed to the model chat template.
- Name
enable_thinking- Type
- boolean
- Description
Qwen chat-template flag used to enable or disable thinking.
- Name
max_tokens- Type
- integer
- Description
Maximum number of output tokens. Reasoning can consume more tokens, so use a larger value.
Request
curl -X POST "$QWEN_BASE_URL/chat/completions" \
-H "Authorization: Bearer $HEXGRID_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "'"$QWEN_MODEL"'",
"messages": [
{
"role": "user",
"content": "Which is greater, 9.11 or 9.8? Explain briefly."
}
],
"temperature": 0.6,
"top_p": 0.95,
"max_tokens": 1024,
"chat_template_kwargs": {
"enable_thinking": true
}
}'
Response
{
"id": "chatcmpl-reasoning123",
"object": "chat.completion",
"created": 1760000001,
"model": "Qwen/Qwen3.6-35B-A3B-FP8",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"reasoning": "Compare the decimal numbers digit by digit. Both start with 9. The tenths digit of 9.8 is 8, while the tenths digit of 9.11 is 1, so 9.8 is greater.",
"content": "9.8 is greater than 9.11."
},
"finish_reason": "stop",
"logprobs": null
}
],
"usage": {
"prompt_tokens": 24,
"completion_tokens": 72,
"total_tokens": 96
}
}
Streaming
Stream tokens as Server-Sent Events instead of waiting for the complete response.
Streaming attributes
- Name
stream- Type
- boolean
- Description
Set to
trueto return incremental chunks.
- Name
stream_options- Type
- object
- Description
Optional streaming configuration.
{ "include_usage": true }requests usage in the final stream chunk.
- Name
max_tokens- Type
- integer
- Description
Maximum number of tokens to generate.
Request
curl -N -X POST "$QWEN_BASE_URL/chat/completions" \
-H "Authorization: Bearer $HEXGRID_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "'"$QWEN_MODEL"'",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Explain Qwen in three short bullet points."
}
],
"stream": true,
"stream_options": {
"include_usage": true
},
"temperature": 0.7,
"top_p": 0.8,
"max_tokens": 256
}'
Sample streamed response
data: {"id":"chatcmpl-stream123","object":"chat.completion.chunk","created":1760000002,"model":"Qwen/Qwen3.6-35B-A3B-FP8","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}],"usage":null}
data: {"id":"chatcmpl-stream123","object":"chat.completion.chunk","created":1760000002,"model":"Qwen/Qwen3.6-35B-A3B-FP8","choices":[{"index":0,"delta":{"content":"- Qwen is a family of large language models from Alibaba Cloud."},"finish_reason":null}],"usage":null}
data: {"id":"chatcmpl-stream123","object":"chat.completion.chunk","created":1760000002,"model":"Qwen/Qwen3.6-35B-A3B-FP8","choices":[{"index":0,"delta":{"content":"\n- It supports chat, reasoning, coding, and multilingual tasks."},"finish_reason":null}],"usage":null}
data: {"id":"chatcmpl-stream123","object":"chat.completion.chunk","created":1760000002,"model":"Qwen/Qwen3.6-35B-A3B-FP8","choices":[{"index":0,"delta":{"content":"\n- It can be served through vLLM using an OpenAI-compatible API."},"finish_reason":null}],"usage":null}
data: {"id":"chatcmpl-stream123","object":"chat.completion.chunk","created":1760000002,"model":"Qwen/Qwen3.6-35B-A3B-FP8","choices":[{"index":0,"delta":{},"finish_reason":"stop"}],"usage":null}
data: {"id":"chatcmpl-stream123","object":"chat.completion.chunk","created":1760000002,"model":"Qwen/Qwen3.6-35B-A3B-FP8","choices":[],"usage":{"prompt_tokens":28,"completion_tokens":43,"total_tokens":71}}
data: [DONE]
Streaming and reasoning
Stream both reasoning and final answer chunks.
This requires launching vLLM with:
--reasoning-parser qwen3
vLLM places reasoning tokens in the streaming chunk’s delta.reasoning field and final answer tokens in delta.content.
Request
curl -N -X POST "$QWEN_BASE_URL/chat/completions" \
-H "Authorization: Bearer $HEXGRID_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "'"$QWEN_MODEL"'",
"messages": [
{
"role": "user",
"content": "Which is greater, 9.11 or 9.8?"
}
],
"stream": true,
"stream_options": {
"include_usage": true
},
"max_tokens": 1024,
"chat_template_kwargs": {
"enable_thinking": true
}
}'
Sample streamed response
data: {"id":"chatcmpl-reason-stream123","object":"chat.completion.chunk","created":1760000003,"model":"Qwen/Qwen3.6-35B-A3B-FP8","choices":[{"index":0,"delta":{"role":"assistant","reasoning":"Compare the decimal values digit by digit."},"finish_reason":null}],"usage":null}
data: {"id":"chatcmpl-reason-stream123","object":"chat.completion.chunk","created":1760000003,"model":"Qwen/Qwen3.6-35B-A3B-FP8","choices":[{"index":0,"delta":{"reasoning":" 9.8 has a tenths digit of 8, while 9.11 has a tenths digit of 1."},"finish_reason":null}],"usage":null}
data: {"id":"chatcmpl-reason-stream123","object":"chat.completion.chunk","created":1760000003,"model":"Qwen/Qwen3.6-35B-A3B-FP8","choices":[{"index":0,"delta":{"content":"9.8 is greater than 9.11."},"finish_reason":null}],"usage":null}
data: {"id":"chatcmpl-reason-stream123","object":"chat.completion.chunk","created":1760000003,"model":"Qwen/Qwen3.6-35B-A3B-FP8","choices":[{"index":0,"delta":{},"finish_reason":"stop"}],"usage":null}
data: {"id":"chatcmpl-reason-stream123","object":"chat.completion.chunk","created":1760000003,"model":"Qwen/Qwen3.6-35B-A3B-FP8","choices":[],"usage":{"prompt_tokens":16,"completion_tokens":64,"total_tokens":80}}
data: [DONE]
Tool calling
Provide function tools that the model may call. The model can return tool_calls instead of a final text answer.
This requires launching vLLM with:
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
Tool attributes
- Name
tools- Type
- array
- Description
Tool definitions. Use
type: "function"with a JSON schema for parameters.
- Name
tool_choice- Type
- string | object
- Description
Controls tool use. Use
"auto"to let the model decide,"none"to disable tool calls,"required"to force at least one tool call, or an object to force a named function.
- Name
parallel_tool_calls- Type
- boolean
- Description
Set to
falseif you want vLLM to return at most one tool call in a single response.
Request
curl -X POST "$QWEN_BASE_URL/chat/completions" \
-H "Authorization: Bearer $HEXGRID_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "'"$QWEN_MODEL"'",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant. Use tools when needed."
},
{
"role": "user",
"content": "What is the weather in Hangzhou?"
}
],
"tools": [
{
"type": "function",
"function": {
"name": "get_current_weather",
"description": "Get the current weather for a city or district.",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City or district, such as Beijing, Hangzhou, or Yuhang District."
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "Temperature unit."
}
},
"required": ["location"]
}
}
}
],
"tool_choice": "auto",
"parallel_tool_calls": false,
"max_tokens": 512
}'
Response
{
"id": "chatcmpl-tool123",
"object": "chat.completion",
"created": 1760000004,
"model": "Qwen/Qwen3.6-35B-A3B-FP8",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "",
"tool_calls": [
{
"id": "call_abc123",
"type": "function",
"function": {
"name": "get_current_weather",
"arguments": "{\"location\":\"Hangzhou\",\"unit\":\"celsius\"}"
}
}
]
},
"finish_reason": "tool_calls",
"logprobs": null
}
],
"usage": {
"prompt_tokens": 112,
"completion_tokens": 22,
"total_tokens": 134
}
}
Required tool calling
Force the model to call at least one tool by setting tool_choice to "required".
vLLM supports tool_choice: "required" in the Chat Completions API. This uses structured outputs to generate tool calls that follow the schema defined in tools.
Request
curl -X POST "$QWEN_BASE_URL/chat/completions" \
-H "Authorization: Bearer $HEXGRID_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "'"$QWEN_MODEL"'",
"messages": [
{
"role": "user",
"content": "Find the current weather for Hangzhou."
}
],
"tools": [
{
"type": "function",
"function": {
"name": "get_current_weather",
"description": "Get the current weather for a city or district.",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City or district."
}
},
"required": ["location"]
}
}
}
],
"tool_choice": "required",
"parallel_tool_calls": false,
"max_tokens": 512
}'
Response
{
"id": "chatcmpl-required-tool123",
"object": "chat.completion",
"created": 1760000005,
"model": "Qwen/Qwen3.6-35B-A3B-FP8",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "",
"tool_calls": [
{
"id": "call_weather_001",
"type": "function",
"function": {
"name": "get_current_weather",
"arguments": "{\"location\":\"Hangzhou\"}"
}
}
]
},
"finish_reason": "tool_calls"
}
],
"usage": {
"prompt_tokens": 92,
"completion_tokens": 20,
"total_tokens": 112
}
}
Tool result
After your application executes the selected tool, send the tool result back to the model using a tool message.
Tool result message
- Name
role- Type
- string
- Description
Must be
tool.
- Name
tool_call_id- Type
- string
- Description
The
idreturned in the assistant message’stool_callsitem.
- Name
content- Type
- string
- Description
The tool output. If your tool returns structured data, serialize it as a JSON string.
Request
curl -X POST "$QWEN_BASE_URL/chat/completions" \
-H "Authorization: Bearer $HEXGRID_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "'"$QWEN_MODEL"'",
"messages": [
{
"role": "user",
"content": "What is the weather in Hangzhou?"
},
{
"role": "assistant",
"content": "",
"tool_calls": [
{
"id": "call_abc123",
"type": "function",
"function": {
"name": "get_current_weather",
"arguments": "{\"location\":\"Hangzhou\",\"unit\":\"celsius\"}"
}
}
]
},
{
"role": "tool",
"tool_call_id": "call_abc123",
"content": "{\"location\":\"Hangzhou\",\"weather\":\"cloudy\",\"temperature\":23,\"unit\":\"celsius\"}"
}
],
"max_tokens": 256
}'
Response
{
"id": "chatcmpl-tool-result123",
"object": "chat.completion",
"created": 1760000006,
"model": "Qwen/Qwen3.6-35B-A3B-FP8",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "The current weather in Hangzhou is cloudy, with a temperature of about 23°C."
},
"finish_reason": "stop",
"logprobs": null
}
],
"usage": {
"prompt_tokens": 151,
"completion_tokens": 19,
"total_tokens": 170
}
}
JSON mode
Request valid JSON output using vLLM’s OpenAI-compatible response_format.
JSON attributes
- Name
response_format- Type
- object
- Description
Set to
{ "type": "json_object" }to request JSON object output.
- Name
messages- Type
- array
- Description
Include an explicit instruction to return JSON in the system or user message.
Request
curl -X POST "$QWEN_BASE_URL/chat/completions" \
-H "Authorization: Bearer $HEXGRID_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "'"$QWEN_MODEL"'",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant. Return only valid JSON."
},
{
"role": "user",
"content": "Create a JSON object with three short product tagline ideas for a note-taking app."
}
],
"response_format": {
"type": "json_object"
},
"temperature": 0.7,
"top_p": 0.8,
"max_tokens": 512,
"chat_template_kwargs": {
"enable_thinking": false
}
}'
Response
{
"id": "chatcmpl-json123",
"object": "chat.completion",
"created": 1760000007,
"model": "Qwen/Qwen3.6-35B-A3B-FP8",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "{\"taglines\":[\"Capture ideas before they disappear.\",\"Your thoughts, organized instantly.\",\"Notes that keep up with you.\"]}"
},
"finish_reason": "stop",
"logprobs": null
}
],
"usage": {
"prompt_tokens": 48,
"completion_tokens": 36,
"total_tokens": 84
}
}
JSON schema
Request output that follows a JSON schema using vLLM’s response_format with type: "json_schema".
JSON schema attributes
- Name
response_format- Type
- object
- Description
Output-format constraint. vLLM supports
json_object,json_schema,structural_tag, andtext.
- Name
json_schema- Type
- object
- Description
The schema that the response should follow.
Request
curl -X POST "$QWEN_BASE_URL/chat/completions" \
-H "Authorization: Bearer $HEXGRID_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "'"$QWEN_MODEL"'",
"messages": [
{
"role": "system",
"content": "Return only JSON that matches the provided schema."
},
{
"role": "user",
"content": "Create three short product tagline ideas for a note-taking app."
}
],
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "tagline_response",
"schema": {
"type": "object",
"properties": {
"taglines": {
"type": "array",
"items": {
"type": "string"
},
"minItems": 3,
"maxItems": 3
}
},
"required": ["taglines"],
"additionalProperties": false
}
}
},
"temperature": 0.7,
"top_p": 0.8,
"max_tokens": 512,
"chat_template_kwargs": {
"enable_thinking": false
}
}'
Response
{
"id": "chatcmpl-schema123",
"object": "chat.completion",
"created": 1760000008,
"model": "Qwen/Qwen3.6-35B-A3B-FP8",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "{\"taglines\":[\"Capture ideas before they disappear.\",\"Your thoughts, organized instantly.\",\"Notes that keep up with you.\"]}"
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 86,
"completion_tokens": 36,
"total_tokens": 122
}
}
Multiple vLLM parameters
Use this example when you need a broader set of generation controls.
vLLM supports OpenAI-compatible parameters and additional vLLM sampling parameters. With raw HTTP/cURL, these can be sent directly in the JSON request body.
Parameters shown here
- Name
temperature- Type
- number
- Description
Sampling temperature.
- Name
top_p- Type
- number
- Description
Nucleus sampling probability threshold.
- Name
top_k- Type
- integer
- Description
vLLM-specific top-k sampling parameter.
- Name
min_p- Type
- number
- Description
vLLM-specific minimum probability sampling parameter.
- Name
repetition_penalty- Type
- number
- Description
vLLM-specific repetition penalty.
- Name
presence_penalty- Type
- number
- Description
Penalizes tokens based on whether they already appeared.
- Name
frequency_penalty- Type
- number
- Description
Penalizes tokens based on how frequently they appeared.
- Name
seed- Type
- integer
- Description
Best-effort deterministic seed.
- Name
n- Type
- integer
- Description
Number of candidate responses to generate.
- Name
logprobs- Type
- boolean
- Description
Whether to return token log probabilities if supported by the serving configuration.
- Name
top_logprobs- Type
- integer
- Description
Number of top candidate tokens to return when
logprobsis enabled.
- Name
chat_template_kwargs- Type
- object
- Description
Additional keyword arguments passed to the model chat template.
Request
curl -X POST "$QWEN_BASE_URL/chat/completions" \
-H "Authorization: Bearer $HEXGRID_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "'"$QWEN_MODEL"'",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant. Keep the answer concise."
},
{
"role": "user",
"content": "Give me five naming ideas for an AI inference platform."
}
],
"temperature": 0.7,
"top_p": 0.8,
"top_k": 20,
"min_p": 0.0,
"repetition_penalty": 1.05,
"presence_penalty": 0.2,
"frequency_penalty": 0.2,
"max_tokens": 512,
"seed": 1234,
"n": 1,
"logprobs": true,
"top_logprobs": 2,
"chat_template_kwargs": {
"enable_thinking": false
}
}'
Response
{
"id": "chatcmpl-params123",
"object": "chat.completion",
"created": 1760000009,
"model": "Qwen/Qwen3.6-35B-A3B-FP8",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "1. HexGrid Inference\n2. ModelForge\n3. TensorRoute\n4. InferaCloud\n5. LatticeAI"
},
"finish_reason": "stop",
"logprobs": {
"content": [
{
"token": "1",
"logprob": -0.02,
"bytes": [49],
"top_logprobs": [
{
"token": "1",
"logprob": -0.02,
"bytes": [49]
},
{
"token": "-",
"logprob": -4.1,
"bytes": [45]
}
]
}
]
}
}
],
"usage": {
"prompt_tokens": 38,
"completion_tokens": 34,
"total_tokens": 72
}
}
Official sources
- vLLM OpenAI-compatible server docs: https://docs.vllm.ai/en/v0.19.0/serving/openai_compatible_server/
- vLLM Qwen3.5 & Qwen3.6 usage guide: https://docs.vllm.ai/projects/recipes/en/latest/Qwen/Qwen3.5.html
- vLLM reasoning outputs docs: https://docs.vllm.ai/en/latest/features/reasoning_outputs/
- vLLM tool calling docs: https://docs.vllm.ai/en/latest/features/tool_calling/
- Qwen official vLLM deployment docs: https://qwen.readthedocs.io/en/latest/deployment/vllm.html