Gemma 4 Chat Completions API - vLLM engine
Use vLLM’s OpenAI-compatible Chat Completions endpoint to call HexGrid-hosted Gemma 4 models with a messages-based interface.
This page provides copy-pasteable cURL-only examples for standard chat, reasoning, streaming, multimodal input, tool calling, and common generation parameters.
Endpoint
POST http://<server-ip>:<port>/v1/chat/completions
Set these environment variables before running the examples:
export GEMMA_BASE_URL="http://<server-ip>:<port>/v1"
export GEMMA_MODEL="google/gemma-4-31B-it"
Use the exact model ID configured in your HexGrid deployment.
Create a chat completion
Generate a normal non-streaming response from a Gemma 4 model served by vLLM.
Required attributes
- Name
model- Type
- string
- Description
The served Gemma 4 model name, for example
google/gemma-4-31B-it.
- Name
messages- Type
- array
- Description
The conversation so far. Each message has a
roleandcontent.
Optional attributes used here
- Name
temperature- Type
- number
- Description
Sampling temperature.
- Name
top_p- Type
- number
- Description
Nucleus sampling probability threshold.
- Name
max_tokens- Type
- integer
- Description
Maximum number of tokens to generate.
Request
curl -X POST "$GEMMA_BASE_URL/chat/completions" \
-H "Authorization: Bearer $HEXGRID_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "'"$GEMMA_MODEL"'",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Give me a one-sentence explanation of what Gemma 4 is."
}
],
"temperature": 0.7,
"top_p": 0.95,
"max_tokens": 256
}'
Response shape
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1760000000,
"model": "google/gemma-4-31B-it",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Gemma 4 is a family of open Google DeepMind models designed for text generation, reasoning, coding, multimodal understanding, and agentic workflows."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 34,
"completion_tokens": 29,
"total_tokens": 63
}
}
Reasoning
Enable Gemma 4 thinking mode for reasoning-heavy prompts.
This requires starting vLLM with:
--reasoning-parser gemma4
For cURL, pass chat_template_kwargs as a top-level JSON field.
Reasoning attributes
- Name
chat_template_kwargs- Type
- object
- Description
vLLM chat-template options.
- Name
enable_thinking- Type
- boolean
- Description
Gemma 4 thinking flag inside
chat_template_kwargs.
- Name
max_tokens- Type
- integer
- Description
Reasoning uses extra tokens, so increase this value.
Request
curl -X POST "$GEMMA_BASE_URL/chat/completions" \
-H "Authorization: Bearer $HEXGRID_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "'"$GEMMA_MODEL"'",
"messages": [
{
"role": "user",
"content": "What is the derivative of x^3 * ln(x)?"
}
],
"max_tokens": 4096,
"chat_template_kwargs": {
"enable_thinking": true
}
}'
Response shape
{
"id": "chatcmpl-reasoning123",
"object": "chat.completion",
"created": 1760000001,
"model": "google/gemma-4-31B-it",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"reasoning": "Use the product rule: d/dx[x^3 ln(x)] = x^3 * d/dx[ln(x)] + ln(x) * d/dx[x^3].",
"content": "The derivative is x^2 + 3x^2 ln(x), or x^2(1 + 3ln(x))."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 24,
"completion_tokens": 96,
"total_tokens": 120
}
}
Streaming
Stream tokens as Server-Sent Events instead of waiting for the complete response.
Streaming attributes
- Name
stream- Type
- boolean
- Description
Set to
trueto return incremental chunks.
- Name
max_tokens- Type
- integer
- Description
Maximum number of tokens to generate.
Request
curl -N -X POST "$GEMMA_BASE_URL/chat/completions" \
-H "Authorization: Bearer $HEXGRID_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "'"$GEMMA_MODEL"'",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Explain Gemma 4 in three short bullet points."
}
],
"stream": true,
"max_tokens": 256
}'
Sample streamed response shape
data: {"id":"chatcmpl-stream123","object":"chat.completion.chunk","created":1760000002,"model":"google/gemma-4-31B-it","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}
data: {"id":"chatcmpl-stream123","object":"chat.completion.chunk","created":1760000002,"model":"google/gemma-4-31B-it","choices":[{"index":0,"delta":{"content":"- Gemma 4 is an open model family from Google DeepMind."},"finish_reason":null}]}
data: {"id":"chatcmpl-stream123","object":"chat.completion.chunk","created":1760000002,"model":"google/gemma-4-31B-it","choices":[{"index":0,"delta":{"content":"\n- It supports reasoning, coding, and multimodal tasks."},"finish_reason":null}]}
data: {"id":"chatcmpl-stream123","object":"chat.completion.chunk","created":1760000002,"model":"google/gemma-4-31B-it","choices":[{"index":0,"delta":{"content":"\n- It is designed for local-first and agentic workflows."},"finish_reason":null}]}
data: {"id":"chatcmpl-stream123","object":"chat.completion.chunk","created":1760000002,"model":"google/gemma-4-31B-it","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}
data: [DONE]
Multimodal image input
Send text and image content parts to a multimodal Gemma 4 model served by vLLM.
Multimodal message content
- Name
content- Type
- array
- Description
A user message can contain an array of content parts.
- Name
type- Type
- string
- Description
Use
"text"for text parts and"image_url"for image parts.
- Name
image_url- Type
- object
- Description
Image input object containing a
url.
Request
curl -X POST "$GEMMA_BASE_URL/chat/completions" \
-H "Authorization: Bearer $HEXGRID_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "'"$GEMMA_MODEL"'",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe this image in one sentence."
},
{
"type": "image_url",
"image_url": {
"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
}
}
]
}
],
"max_tokens": 128
}'
Response shape
{
"id": "chatcmpl-vision123",
"object": "chat.completion",
"created": 1760000003,
"model": "google/gemma-4-31B-it",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "The image shows the Statue of Liberty standing on Liberty Island in New York Harbor."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 512,
"completion_tokens": 18,
"total_tokens": 530
}
}
Tool calling
Provide function tools that the model may call.
This requires starting vLLM with:
--enable-auto-tool-choice
--tool-call-parser gemma4
--chat-template examples/tool_chat_template_gemma4.jinja
Tool attributes
- Name
tools- Type
- array
- Description
Tool definitions. Use
type: "function"with a JSON schema for parameters.
- Name
tool_choice- Type
- string | object
- Description
Controls tool use. Use
"auto"to let the model decide.
Request
curl -X POST "$GEMMA_BASE_URL/chat/completions" \
-H "Authorization: Bearer $HEXGRID_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "'"$GEMMA_MODEL"'",
"messages": [
{
"role": "user",
"content": "What is the weather in Tokyo today?"
}
],
"tools": [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City name, e.g. San Francisco, CA"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "Temperature unit"
}
},
"required": ["location"]
}
}
}
],
"tool_choice": "auto",
"max_tokens": 1024
}'
Response shape
{
"id": "chatcmpl-tool123",
"object": "chat.completion",
"created": 1760000004,
"model": "google/gemma-4-31B-it",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "",
"tool_calls": [
{
"id": "call_abc123",
"type": "function",
"function": {
"name": "get_weather",
"arguments": "{\"location\":\"Tokyo\",\"unit\":\"celsius\"}"
}
}
]
},
"finish_reason": "tool_calls"
}
],
"usage": {
"prompt_tokens": 118,
"completion_tokens": 24,
"total_tokens": 142
}
}
Tool result
After your application executes the selected tool, send the tool result back so Gemma 4 can produce a final answer.
Tool result message
- Name
role- Type
- string
- Description
Use
toolfor OpenAI-compatible tool result messages.
- Name
tool_call_id- Type
- string
- Description
The
idreturned by the assistant message’stool_callsitem.
- Name
content- Type
- string
- Description
The tool result. If the result is structured data, serialize it as a JSON string.
Request
curl -X POST "$GEMMA_BASE_URL/chat/completions" \
-H "Authorization: Bearer $HEXGRID_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "'"$GEMMA_MODEL"'",
"messages": [
{
"role": "user",
"content": "What is the weather in Tokyo today?"
},
{
"role": "assistant",
"content": "",
"tool_calls": [
{
"id": "call_abc123",
"type": "function",
"function": {
"name": "get_weather",
"arguments": "{\"location\":\"Tokyo\",\"unit\":\"celsius\"}"
}
}
]
},
{
"role": "tool",
"tool_call_id": "call_abc123",
"content": "{\"temperature\":22,\"condition\":\"Partly cloudy\",\"unit\":\"celsius\"}"
}
],
"tools": [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"]
}
},
"required": ["location"]
}
}
}
],
"max_tokens": 1024
}'
Response shape
{
"id": "chatcmpl-tool-result123",
"object": "chat.completion",
"created": 1760000005,
"model": "google/gemma-4-31B-it",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "The weather in Tokyo today is partly cloudy, with a temperature of 22°C."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 156,
"completion_tokens": 20,
"total_tokens": 176
}
}
Multiple vLLM parameters
Use this example when you need a broader set of vLLM generation controls.
For raw HTTP/cURL, vLLM-specific parameters can be merged directly into the JSON request body.
Parameters shown here
- Name
temperature- Type
- number
- Description
Sampling temperature.
- Name
top_p- Type
- number
- Description
Nucleus sampling probability threshold.
- Name
top_k- Type
- integer
- Description
vLLM-specific top-k sampling parameter.
- Name
repetition_penalty- Type
- number
- Description
vLLM-specific repetition penalty.
- Name
max_tokens- Type
- integer
- Description
Maximum number of tokens to generate.
- Name
seed- Type
- integer
- Description
Best-effort deterministic sampling seed.
- Name
chat_template_kwargs- Type
- object
- Description
vLLM chat-template options, such as Gemma 4 thinking mode.
Request
curl -X POST "$GEMMA_BASE_URL/chat/completions" \
-H "Authorization: Bearer $HEXGRID_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "'"$GEMMA_MODEL"'",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant. Return concise JSON."
},
{
"role": "user",
"content": "Create a JSON object with three short tagline ideas for a local-first AI coding assistant."
}
],
"temperature": 0.6,
"top_p": 0.95,
"top_k": 40,
"repetition_penalty": 1.05,
"max_tokens": 512,
"seed": 1234,
"stream": false,
"chat_template_kwargs": {
"enable_thinking": false
}
}'
Response shape
{
"id": "chatcmpl-params123",
"object": "chat.completion",
"created": 1760000006,
"model": "google/gemma-4-31B-it",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "{\"taglines\":[\"Code locally. Think globally.\",\"Your workstation, your AI copilot.\",\"Agentic coding without leaving your machine.\"]}"
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 46,
"completion_tokens": 38,
"total_tokens": 84
}
}
Official sources
- vLLM Gemma 4 usage guide: https://docs.vllm.ai/projects/recipes/en/stable/Google/Gemma4.html
- vLLM OpenAI-compatible server docs: https://docs.vllm.ai/en/latest/serving/openai_compatible_server/
- vLLM tool calling docs: https://docs.vllm.ai/en/latest/features/tool_calling/
- vLLM Gemma 4 tool parser docs: https://docs.vllm.ai/en/latest/api/vllm/tool_parsers/gemma4_tool_parser/
- Google Gemma 4 model overview: https://ai.google.dev/gemma/docs/core