Gemma 4 Chat Completions API - vLLM engine

Use vLLM’s OpenAI-compatible Chat Completions endpoint to call HexGrid-hosted Gemma 4 models with a messages-based interface.

This page provides copy-pasteable cURL-only examples for standard chat, reasoning, streaming, multimodal input, tool calling, and common generation parameters.


Endpoint

POST http://<server-ip>:<port>/v1/chat/completions

Set these environment variables before running the examples:

export GEMMA_BASE_URL="http://<server-ip>:<port>/v1"
export GEMMA_MODEL="google/gemma-4-31B-it"

Use the exact model ID configured in your HexGrid deployment.


POST/v1/chat/completions

Create a chat completion

Generate a normal non-streaming response from a Gemma 4 model served by vLLM.

Required attributes

  • Name
    model
    Type
    string
    Description

    The served Gemma 4 model name, for example google/gemma-4-31B-it.

  • Name
    messages
    Type
    array
    Description

    The conversation so far. Each message has a role and content.

Optional attributes used here

  • Name
    temperature
    Type
    number
    Description

    Sampling temperature.

  • Name
    top_p
    Type
    number
    Description

    Nucleus sampling probability threshold.

  • Name
    max_tokens
    Type
    integer
    Description

    Maximum number of tokens to generate.

Request

POST
/v1/chat/completions
curl -X POST "$GEMMA_BASE_URL/chat/completions" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$GEMMA_MODEL"'",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "Give me a one-sentence explanation of what Gemma 4 is."
      }
    ],
    "temperature": 0.7,
    "top_p": 0.95,
    "max_tokens": 256
  }'

Response shape

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1760000000,
  "model": "google/gemma-4-31B-it",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Gemma 4 is a family of open Google DeepMind models designed for text generation, reasoning, coding, multimodal understanding, and agentic workflows."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 34,
    "completion_tokens": 29,
    "total_tokens": 63
  }
}

POST/v1/chat/completions

Reasoning

Enable Gemma 4 thinking mode for reasoning-heavy prompts.

This requires starting vLLM with:

--reasoning-parser gemma4

For cURL, pass chat_template_kwargs as a top-level JSON field.

Reasoning attributes

  • Name
    chat_template_kwargs
    Type
    object
    Description

    vLLM chat-template options.

  • Name
    enable_thinking
    Type
    boolean
    Description

    Gemma 4 thinking flag inside chat_template_kwargs.

  • Name
    max_tokens
    Type
    integer
    Description

    Reasoning uses extra tokens, so increase this value.

Request

POST
/v1/chat/completions
curl -X POST "$GEMMA_BASE_URL/chat/completions" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$GEMMA_MODEL"'",
    "messages": [
      {
        "role": "user",
        "content": "What is the derivative of x^3 * ln(x)?"
      }
    ],
    "max_tokens": 4096,
    "chat_template_kwargs": {
      "enable_thinking": true
    }
  }'

Response shape

{
  "id": "chatcmpl-reasoning123",
  "object": "chat.completion",
  "created": 1760000001,
  "model": "google/gemma-4-31B-it",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "reasoning": "Use the product rule: d/dx[x^3 ln(x)] = x^3 * d/dx[ln(x)] + ln(x) * d/dx[x^3].",
        "content": "The derivative is x^2 + 3x^2 ln(x), or x^2(1 + 3ln(x))."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 24,
    "completion_tokens": 96,
    "total_tokens": 120
  }
}

POST/v1/chat/completions

Streaming

Stream tokens as Server-Sent Events instead of waiting for the complete response.

Streaming attributes

  • Name
    stream
    Type
    boolean
    Description

    Set to true to return incremental chunks.

  • Name
    max_tokens
    Type
    integer
    Description

    Maximum number of tokens to generate.

Request

POST
/v1/chat/completions
curl -N -X POST "$GEMMA_BASE_URL/chat/completions" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$GEMMA_MODEL"'",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "Explain Gemma 4 in three short bullet points."
      }
    ],
    "stream": true,
    "max_tokens": 256
  }'

Sample streamed response shape

data: {"id":"chatcmpl-stream123","object":"chat.completion.chunk","created":1760000002,"model":"google/gemma-4-31B-it","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}

data: {"id":"chatcmpl-stream123","object":"chat.completion.chunk","created":1760000002,"model":"google/gemma-4-31B-it","choices":[{"index":0,"delta":{"content":"- Gemma 4 is an open model family from Google DeepMind."},"finish_reason":null}]}

data: {"id":"chatcmpl-stream123","object":"chat.completion.chunk","created":1760000002,"model":"google/gemma-4-31B-it","choices":[{"index":0,"delta":{"content":"\n- It supports reasoning, coding, and multimodal tasks."},"finish_reason":null}]}

data: {"id":"chatcmpl-stream123","object":"chat.completion.chunk","created":1760000002,"model":"google/gemma-4-31B-it","choices":[{"index":0,"delta":{"content":"\n- It is designed for local-first and agentic workflows."},"finish_reason":null}]}

data: {"id":"chatcmpl-stream123","object":"chat.completion.chunk","created":1760000002,"model":"google/gemma-4-31B-it","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: [DONE]

POST/v1/chat/completions

Multimodal image input

Send text and image content parts to a multimodal Gemma 4 model served by vLLM.

Multimodal message content

  • Name
    content
    Type
    array
    Description

    A user message can contain an array of content parts.

  • Name
    type
    Type
    string
    Description

    Use "text" for text parts and "image_url" for image parts.

  • Name
    image_url
    Type
    object
    Description

    Image input object containing a url.

Request

POST
/v1/chat/completions
curl -X POST "$GEMMA_BASE_URL/chat/completions" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$GEMMA_MODEL"'",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "Describe this image in one sentence."
          },
          {
            "type": "image_url",
            "image_url": {
              "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
            }
          }
        ]
      }
    ],
    "max_tokens": 128
  }'

Response shape

{
  "id": "chatcmpl-vision123",
  "object": "chat.completion",
  "created": 1760000003,
  "model": "google/gemma-4-31B-it",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The image shows the Statue of Liberty standing on Liberty Island in New York Harbor."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 512,
    "completion_tokens": 18,
    "total_tokens": 530
  }
}

POST/v1/chat/completions

Tool calling

Provide function tools that the model may call.

This requires starting vLLM with:

--enable-auto-tool-choice
--tool-call-parser gemma4
--chat-template examples/tool_chat_template_gemma4.jinja

Tool attributes

  • Name
    tools
    Type
    array
    Description

    Tool definitions. Use type: "function" with a JSON schema for parameters.

  • Name
    tool_choice
    Type
    string | object
    Description

    Controls tool use. Use "auto" to let the model decide.

Request

POST
/v1/chat/completions
curl -X POST "$GEMMA_BASE_URL/chat/completions" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$GEMMA_MODEL"'",
    "messages": [
      {
        "role": "user",
        "content": "What is the weather in Tokyo today?"
      }
    ],
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "get_weather",
          "description": "Get the current weather for a location",
          "parameters": {
            "type": "object",
            "properties": {
              "location": {
                "type": "string",
                "description": "City name, e.g. San Francisco, CA"
              },
              "unit": {
                "type": "string",
                "enum": ["celsius", "fahrenheit"],
                "description": "Temperature unit"
              }
            },
            "required": ["location"]
          }
        }
      }
    ],
    "tool_choice": "auto",
    "max_tokens": 1024
  }'

Response shape

{
  "id": "chatcmpl-tool123",
  "object": "chat.completion",
  "created": 1760000004,
  "model": "google/gemma-4-31B-it",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "",
        "tool_calls": [
          {
            "id": "call_abc123",
            "type": "function",
            "function": {
              "name": "get_weather",
              "arguments": "{\"location\":\"Tokyo\",\"unit\":\"celsius\"}"
            }
          }
        ]
      },
      "finish_reason": "tool_calls"
    }
  ],
  "usage": {
    "prompt_tokens": 118,
    "completion_tokens": 24,
    "total_tokens": 142
  }
}

POST/v1/chat/completions

Tool result

After your application executes the selected tool, send the tool result back so Gemma 4 can produce a final answer.

Tool result message

  • Name
    role
    Type
    string
    Description

    Use tool for OpenAI-compatible tool result messages.

  • Name
    tool_call_id
    Type
    string
    Description

    The id returned by the assistant message’s tool_calls item.

  • Name
    content
    Type
    string
    Description

    The tool result. If the result is structured data, serialize it as a JSON string.

Request

POST
/v1/chat/completions
curl -X POST "$GEMMA_BASE_URL/chat/completions" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$GEMMA_MODEL"'",
    "messages": [
      {
        "role": "user",
        "content": "What is the weather in Tokyo today?"
      },
      {
        "role": "assistant",
        "content": "",
        "tool_calls": [
          {
            "id": "call_abc123",
            "type": "function",
            "function": {
              "name": "get_weather",
              "arguments": "{\"location\":\"Tokyo\",\"unit\":\"celsius\"}"
            }
          }
        ]
      },
      {
        "role": "tool",
        "tool_call_id": "call_abc123",
        "content": "{\"temperature\":22,\"condition\":\"Partly cloudy\",\"unit\":\"celsius\"}"
      }
    ],
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "get_weather",
          "description": "Get the current weather for a location",
          "parameters": {
            "type": "object",
            "properties": {
              "location": {
                "type": "string"
              },
              "unit": {
                "type": "string",
                "enum": ["celsius", "fahrenheit"]
              }
            },
            "required": ["location"]
          }
        }
      }
    ],
    "max_tokens": 1024
  }'

Response shape

{
  "id": "chatcmpl-tool-result123",
  "object": "chat.completion",
  "created": 1760000005,
  "model": "google/gemma-4-31B-it",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The weather in Tokyo today is partly cloudy, with a temperature of 22°C."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 156,
    "completion_tokens": 20,
    "total_tokens": 176
  }
}

POST/v1/chat/completions

Multiple vLLM parameters

Use this example when you need a broader set of vLLM generation controls.

For raw HTTP/cURL, vLLM-specific parameters can be merged directly into the JSON request body.

Parameters shown here

  • Name
    temperature
    Type
    number
    Description

    Sampling temperature.

  • Name
    top_p
    Type
    number
    Description

    Nucleus sampling probability threshold.

  • Name
    top_k
    Type
    integer
    Description

    vLLM-specific top-k sampling parameter.

  • Name
    repetition_penalty
    Type
    number
    Description

    vLLM-specific repetition penalty.

  • Name
    max_tokens
    Type
    integer
    Description

    Maximum number of tokens to generate.

  • Name
    seed
    Type
    integer
    Description

    Best-effort deterministic sampling seed.

  • Name
    chat_template_kwargs
    Type
    object
    Description

    vLLM chat-template options, such as Gemma 4 thinking mode.

Request

POST
/v1/chat/completions
curl -X POST "$GEMMA_BASE_URL/chat/completions" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$GEMMA_MODEL"'",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant. Return concise JSON."
      },
      {
        "role": "user",
        "content": "Create a JSON object with three short tagline ideas for a local-first AI coding assistant."
      }
    ],
    "temperature": 0.6,
    "top_p": 0.95,
    "top_k": 40,
    "repetition_penalty": 1.05,
    "max_tokens": 512,
    "seed": 1234,
    "stream": false,
    "chat_template_kwargs": {
      "enable_thinking": false
    }
  }'

Response shape

{
  "id": "chatcmpl-params123",
  "object": "chat.completion",
  "created": 1760000006,
  "model": "google/gemma-4-31B-it",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "{\"taglines\":[\"Code locally. Think globally.\",\"Your workstation, your AI copilot.\",\"Agentic coding without leaving your machine.\"]}"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 46,
    "completion_tokens": 38,
    "total_tokens": 84
  }
}

Official sources

Was this page helpful?