Gemma 4 Chat Completions API - vLLM engine

Use vLLM’s OpenAI-compatible Chat Completions endpoint to call HexGrid-hosted Gemma 4 models with a messages-based interface.

This page provides copy-pasteable cURL-only examples for standard chat, reasoning, streaming, multimodal input, tool calling, and common generation parameters.

Endpoint

POST http://<server-ip>:<port>/v1/chat/completions

Set these environment variables before running the examples:

export GEMMA_BASE_URL="http://<server-ip>:<port>/v1"
export GEMMA_MODEL="google/gemma-4-31B-it"

Use the exact model ID configured in your HexGrid deployment.

POST/v1/chat/completions

Create a chat completion

Generate a normal non-streaming response from a Gemma 4 model served by vLLM.

Required attributes

Name
model
Type
string
Description
The served Gemma 4 model name, for example google/gemma-4-31B-it.
Name
messages
Type
array
Description
The conversation so far. Each message has a role and content.

Optional attributes used here

Name
temperature
Type
number
Description
Sampling temperature.
Name
top_p
Type
number
Description
Nucleus sampling probability threshold.
Name
max_tokens
Type
integer
Description
Maximum number of tokens to generate.

Request

POST

/v1/chat/completions

curl -X POST "$GEMMA_BASE_URL/chat/completions" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$GEMMA_MODEL"'",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "Give me a one-sentence explanation of what Gemma 4 is."
      }
    ],
    "temperature": 0.7,
    "top_p": 0.95,
    "max_tokens": 256
  }'

Response shape

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1760000000,
  "model": "google/gemma-4-31B-it",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Gemma 4 is a family of open Google DeepMind models designed for text generation, reasoning, coding, multimodal understanding, and agentic workflows."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 34,
    "completion_tokens": 29,
    "total_tokens": 63
  }
}

POST/v1/chat/completions

Reasoning

Enable Gemma 4 thinking mode for reasoning-heavy prompts.

This requires starting vLLM with:

--reasoning-parser gemma4

For cURL, pass chat_template_kwargs as a top-level JSON field.

Reasoning attributes

Name
chat_template_kwargs
Type
object
Description
vLLM chat-template options.
Name
enable_thinking
Type
boolean
Description
Gemma 4 thinking flag inside chat_template_kwargs.
Name
max_tokens
Type
integer
Description
Reasoning uses extra tokens, so increase this value.

Request

POST

/v1/chat/completions

curl -X POST "$GEMMA_BASE_URL/chat/completions" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$GEMMA_MODEL"'",
    "messages": [
      {
        "role": "user",
        "content": "What is the derivative of x^3 * ln(x)?"
      }
    ],
    "max_tokens": 4096,
    "chat_template_kwargs": {
      "enable_thinking": true
    }
  }'

Response shape

{
  "id": "chatcmpl-reasoning123",
  "object": "chat.completion",
  "created": 1760000001,
  "model": "google/gemma-4-31B-it",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "reasoning": "Use the product rule: d/dx[x^3 ln(x)] = x^3 * d/dx[ln(x)] + ln(x) * d/dx[x^3].",
        "content": "The derivative is x^2 + 3x^2 ln(x), or x^2(1 + 3ln(x))."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 24,
    "completion_tokens": 96,
    "total_tokens": 120
  }
}

POST/v1/chat/completions

Streaming

Stream tokens as Server-Sent Events instead of waiting for the complete response.

Streaming attributes

Name
stream
Type
boolean
Description
Set to true to return incremental chunks.
Name
max_tokens
Type
integer
Description
Maximum number of tokens to generate.

Request

POST

/v1/chat/completions

curl -N -X POST "$GEMMA_BASE_URL/chat/completions" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$GEMMA_MODEL"'",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "Explain Gemma 4 in three short bullet points."
      }
    ],
    "stream": true,
    "max_tokens": 256
  }'

Sample streamed response shape

data: {"id":"chatcmpl-stream123","object":"chat.completion.chunk","created":1760000002,"model":"google/gemma-4-31B-it","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}

data: {"id":"chatcmpl-stream123","object":"chat.completion.chunk","created":1760000002,"model":"google/gemma-4-31B-it","choices":[{"index":0,"delta":{"content":"- Gemma 4 is an open model family from Google DeepMind."},"finish_reason":null}]}

data: {"id":"chatcmpl-stream123","object":"chat.completion.chunk","created":1760000002,"model":"google/gemma-4-31B-it","choices":[{"index":0,"delta":{"content":"\n- It supports reasoning, coding, and multimodal tasks."},"finish_reason":null}]}

data: {"id":"chatcmpl-stream123","object":"chat.completion.chunk","created":1760000002,"model":"google/gemma-4-31B-it","choices":[{"index":0,"delta":{"content":"\n- It is designed for local-first and agentic workflows."},"finish_reason":null}]}

data: {"id":"chatcmpl-stream123","object":"chat.completion.chunk","created":1760000002,"model":"google/gemma-4-31B-it","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: [DONE]

POST/v1/chat/completions

Multimodal image input

Send text and image content parts to a multimodal Gemma 4 model served by vLLM.

Multimodal message content

Name
content
Type
array
Description
A user message can contain an array of content parts.
Name
type
Type
string
Description
Use "text" for text parts and "image_url" for image parts.
Name
image_url
Type
object
Description
Image input object containing a url.

Request

POST

/v1/chat/completions

curl -X POST "$GEMMA_BASE_URL/chat/completions" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$GEMMA_MODEL"'",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "Describe this image in one sentence."
          },
          {
            "type": "image_url",
            "image_url": {
              "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
            }
          }
        ]
      }
    ],
    "max_tokens": 128
  }'

Response shape

{
  "id": "chatcmpl-vision123",
  "object": "chat.completion",
  "created": 1760000003,
  "model": "google/gemma-4-31B-it",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The image shows the Statue of Liberty standing on Liberty Island in New York Harbor."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 512,
    "completion_tokens": 18,
    "total_tokens": 530
  }
}

POST/v1/chat/completions

Tool calling

Provide function tools that the model may call.

This requires starting vLLM with:

--enable-auto-tool-choice
--tool-call-parser gemma4
--chat-template examples/tool_chat_template_gemma4.jinja

Tool attributes

Name
tools
Type
array
Description
Tool definitions. Use type: "function" with a JSON schema for parameters.
Name
tool_choice
Type
string | object
Description
Controls tool use. Use "auto" to let the model decide.

Request

POST

/v1/chat/completions

curl -X POST "$GEMMA_BASE_URL/chat/completions" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$GEMMA_MODEL"'",
    "messages": [
      {
        "role": "user",
        "content": "What is the weather in Tokyo today?"
      }
    ],
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "get_weather",
          "description": "Get the current weather for a location",
          "parameters": {
            "type": "object",
            "properties": {
              "location": {
                "type": "string",
                "description": "City name, e.g. San Francisco, CA"
              },
              "unit": {
                "type": "string",
                "enum": ["celsius", "fahrenheit"],
                "description": "Temperature unit"
              }
            },
            "required": ["location"]
          }
        }
      }
    ],
    "tool_choice": "auto",
    "max_tokens": 1024
  }'

Response shape

{
  "id": "chatcmpl-tool123",
  "object": "chat.completion",
  "created": 1760000004,
  "model": "google/gemma-4-31B-it",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "",
        "tool_calls": [
          {
            "id": "call_abc123",
            "type": "function",
            "function": {
              "name": "get_weather",
              "arguments": "{\"location\":\"Tokyo\",\"unit\":\"celsius\"}"
            }
          }
        ]
      },
      "finish_reason": "tool_calls"
    }
  ],
  "usage": {
    "prompt_tokens": 118,
    "completion_tokens": 24,
    "total_tokens": 142
  }
}

POST/v1/chat/completions

Tool result

After your application executes the selected tool, send the tool result back so Gemma 4 can produce a final answer.

Tool result message

Name
role
Type
string
Description
Use tool for OpenAI-compatible tool result messages.
Name
tool_call_id
Type
string
Description
The id returned by the assistant message’s tool_calls item.
Name
content
Type
string
Description
The tool result. If the result is structured data, serialize it as a JSON string.

Request

POST

/v1/chat/completions

curl -X POST "$GEMMA_BASE_URL/chat/completions" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$GEMMA_MODEL"'",
    "messages": [
      {
        "role": "user",
        "content": "What is the weather in Tokyo today?"
      },
      {
        "role": "assistant",
        "content": "",
        "tool_calls": [
          {
            "id": "call_abc123",
            "type": "function",
            "function": {
              "name": "get_weather",
              "arguments": "{\"location\":\"Tokyo\",\"unit\":\"celsius\"}"
            }
          }
        ]
      },
      {
        "role": "tool",
        "tool_call_id": "call_abc123",
        "content": "{\"temperature\":22,\"condition\":\"Partly cloudy\",\"unit\":\"celsius\"}"
      }
    ],
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "get_weather",
          "description": "Get the current weather for a location",
          "parameters": {
            "type": "object",
            "properties": {
              "location": {
                "type": "string"
              },
              "unit": {
                "type": "string",
                "enum": ["celsius", "fahrenheit"]
              }
            },
            "required": ["location"]
          }
        }
      }
    ],
    "max_tokens": 1024
  }'

Response shape

{
  "id": "chatcmpl-tool-result123",
  "object": "chat.completion",
  "created": 1760000005,
  "model": "google/gemma-4-31B-it",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The weather in Tokyo today is partly cloudy, with a temperature of 22°C."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 156,
    "completion_tokens": 20,
    "total_tokens": 176
  }
}

POST/v1/chat/completions

Multiple vLLM parameters

Use this example when you need a broader set of vLLM generation controls.

For raw HTTP/cURL, vLLM-specific parameters can be merged directly into the JSON request body.

Parameters shown here

Name
temperature
Type
number
Description
Sampling temperature.
Name
top_p
Type
number
Description
Nucleus sampling probability threshold.
Name
top_k
Type
integer
Description
vLLM-specific top-k sampling parameter.
Name
repetition_penalty
Type
number
Description
vLLM-specific repetition penalty.
Name
max_tokens
Type
integer
Description
Maximum number of tokens to generate.
Name
seed
Type
integer
Description
Best-effort deterministic sampling seed.
Name
chat_template_kwargs
Type
object
Description
vLLM chat-template options, such as Gemma 4 thinking mode.

Request

POST

/v1/chat/completions

curl -X POST "$GEMMA_BASE_URL/chat/completions" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$GEMMA_MODEL"'",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant. Return concise JSON."
      },
      {
        "role": "user",
        "content": "Create a JSON object with three short tagline ideas for a local-first AI coding assistant."
      }
    ],
    "temperature": 0.6,
    "top_p": 0.95,
    "top_k": 40,
    "repetition_penalty": 1.05,
    "max_tokens": 512,
    "seed": 1234,
    "stream": false,
    "chat_template_kwargs": {
      "enable_thinking": false
    }
  }'

Response shape

{
  "id": "chatcmpl-params123",
  "object": "chat.completion",
  "created": 1760000006,
  "model": "google/gemma-4-31B-it",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "{\"taglines\":[\"Code locally. Think globally.\",\"Your workstation, your AI copilot.\",\"Agentic coding without leaving your machine.\"]}"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 46,
    "completion_tokens": 38,
    "total_tokens": 84
  }
}

Official sources

vLLM Gemma 4 usage guide: https://docs.vllm.ai/projects/recipes/en/stable/Google/Gemma4.html
vLLM OpenAI-compatible server docs: https://docs.vllm.ai/en/latest/serving/openai_compatible_server/
vLLM tool calling docs: https://docs.vllm.ai/en/latest/features/tool_calling/
vLLM Gemma 4 tool parser docs: https://docs.vllm.ai/en/latest/api/vllm/tool_parsers/gemma4_tool_parser/
Google Gemma 4 model overview: https://ai.google.dev/gemma/docs/core