Llama Chat Completions API - vLLM engine

Use vLLM’s OpenAI-compatible Chat Completions endpoint to call HexGrid-hosted Llama 3.1 and Llama 3.3 models with a messages-based interface.

This page provides copy-pasteable cURL-only examples for standard chat, reasoning-style prompts, streaming, tool calling, tool-result continuation, JSON output, and common generation parameters.


Endpoint

POST http://<server-ip>:<port>/v1/chat/completions

Set these environment variables before running the examples:

export HEXGRID_API_KEY="your-hexgrid-api-key"
export LLAMA_BASE_URL="http://<server-ip>:<port>/v1"
export LLAMA_MODEL="meta-llama/Llama-3.3-70B-Instruct"

You can replace meta-llama/Llama-3.3-70B-Instruct with another HexGrid-hosted Llama model, such as:

meta-llama/Llama-3.3-70B-Instruct
meta-llama/Meta-Llama-3.1-8B-Instruct
meta-llama/Meta-Llama-3.1-70B-Instruct
meta-llama/Meta-Llama-3.1-405B-Instruct

Use the exact model ID configured in your HexGrid deployment.


POST/v1/chat/completions

Create a chat completion

Generate a normal non-streaming response from a Llama 3.1 or Llama 3.3 model served by vLLM.

Required attributes

  • Name
    model
    Type
    string
    Description

    The served Llama model name, for example meta-llama/Llama-3.3-70B-Instruct.

  • Name
    messages
    Type
    array
    Description

    The conversation so far. Each message has a role and content. Common roles are system, user, assistant, and tool.

Optional attributes used here

  • Name
    temperature
    Type
    number
    Description

    Sampling temperature. Higher values produce more varied output; lower values produce more deterministic output.

  • Name
    top_p
    Type
    number
    Description

    Nucleus sampling probability threshold.

  • Name
    max_tokens
    Type
    integer
    Description

    Maximum number of tokens to generate.

Request

POST
/v1/chat/completions
curl -X POST "$LLAMA_BASE_URL/chat/completions" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$LLAMA_MODEL"'",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "Give me a one-sentence explanation of what Llama is."
      }
    ],
    "temperature": 0.7,
    "top_p": 0.9,
    "max_tokens": 256
  }'

Response shape

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1760000000,
  "model": "meta-llama/Llama-3.3-70B-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Llama is Meta’s family of open large language models designed for tasks such as chat, reasoning, coding, multilingual generation, and tool-using applications."
      },
      "finish_reason": "stop",
      "logprobs": null
    }
  ],
  "usage": {
    "prompt_tokens": 31,
    "completion_tokens": 29,
    "total_tokens": 60
  }
}

POST/v1/chat/completions

Reasoning-style chat completion

Llama 3.1 and Llama 3.3 are not exposed by vLLM with a dedicated parsed reasoning field like Qwen or Gemma reasoning models.

Use normal Chat Completions and instruct the model to reason carefully, then return a concise final answer.

Required attributes

  • Name
    model
    Type
    string
    Description

    The served Llama model name.

  • Name
    messages
    Type
    array
    Description

    The user conversation.

Optional attributes used here

  • Name
    temperature
    Type
    number
    Description

    Use a lower value for more deterministic reasoning-style answers.

  • Name
    max_tokens
    Type
    integer
    Description

    Increase this value for harder reasoning tasks.

Request

POST
/v1/chat/completions
curl -X POST "$LLAMA_BASE_URL/chat/completions" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$LLAMA_MODEL"'",
    "messages": [
      {
        "role": "system",
        "content": "You are a careful reasoning assistant. Solve the problem step by step internally, then provide a concise final answer."
      },
      {
        "role": "user",
        "content": "Which is greater, 9.11 or 9.8? Explain briefly."
      }
    ],
    "temperature": 0.3,
    "top_p": 0.9,
    "max_tokens": 512
  }'

Response shape

{
  "id": "chatcmpl-reasoning-style123",
  "object": "chat.completion",
  "created": 1760000001,
  "model": "meta-llama/Llama-3.3-70B-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "9.8 is greater than 9.11. Both numbers start with 9, but 9.8 has 8 in the tenths place while 9.11 has 1 in the tenths place."
      },
      "finish_reason": "stop",
      "logprobs": null
    }
  ],
  "usage": {
    "prompt_tokens": 47,
    "completion_tokens": 42,
    "total_tokens": 89
  }
}

POST/v1/chat/completions

Streaming

Stream tokens as Server-Sent Events instead of waiting for the complete response.

Streaming attributes

  • Name
    stream
    Type
    boolean
    Description

    Set to true to return incremental chunks.

  • Name
    stream_options
    Type
    object
    Description

    Optional streaming configuration. { "include_usage": true } requests usage in the final stream chunk.

  • Name
    max_tokens
    Type
    integer
    Description

    Maximum number of tokens to generate.

Request

POST
/v1/chat/completions
curl -N -X POST "$LLAMA_BASE_URL/chat/completions" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$LLAMA_MODEL"'",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "Explain Llama 3.3 in three short bullet points."
      }
    ],
    "stream": true,
    "stream_options": {
      "include_usage": true
    },
    "temperature": 0.7,
    "top_p": 0.9,
    "max_tokens": 256
  }'

Sample streamed response shape

data: {"id":"chatcmpl-stream123","object":"chat.completion.chunk","created":1760000002,"model":"meta-llama/Llama-3.3-70B-Instruct","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}],"usage":null}

data: {"id":"chatcmpl-stream123","object":"chat.completion.chunk","created":1760000002,"model":"meta-llama/Llama-3.3-70B-Instruct","choices":[{"index":0,"delta":{"content":"- Llama 3.3 is a 70B instruction-tuned text model from Meta."},"finish_reason":null}],"usage":null}

data: {"id":"chatcmpl-stream123","object":"chat.completion.chunk","created":1760000002,"model":"meta-llama/Llama-3.3-70B-Instruct","choices":[{"index":0,"delta":{"content":"\n- It is optimized for multilingual dialogue and general assistant use cases."},"finish_reason":null}],"usage":null}

data: {"id":"chatcmpl-stream123","object":"chat.completion.chunk","created":1760000002,"model":"meta-llama/Llama-3.3-70B-Instruct","choices":[{"index":0,"delta":{"content":"\n- It can be served through vLLM using an OpenAI-compatible API."},"finish_reason":null}],"usage":null}

data: {"id":"chatcmpl-stream123","object":"chat.completion.chunk","created":1760000002,"model":"meta-llama/Llama-3.3-70B-Instruct","choices":[{"index":0,"delta":{},"finish_reason":"stop"}],"usage":null}

data: {"id":"chatcmpl-stream123","object":"chat.completion.chunk","created":1760000002,"model":"meta-llama/Llama-3.3-70B-Instruct","choices":[],"usage":{"prompt_tokens":29,"completion_tokens":44,"total_tokens":73}}

data: [DONE]

POST/v1/chat/completions

Tool calling

Provide function tools that the model may call. The model can return tool_calls instead of a final text answer.

HexGrid-hosted Llama models have tool calling enabled by default.

Tool attributes

  • Name
    tools
    Type
    array
    Description

    Tool definitions. Use type: "function" with a JSON schema for parameters.

  • Name
    tool_choice
    Type
    string | object
    Description

    Controls tool use. Use "auto" to let the model decide, "none" to disable tool calls, "required" to force at least one tool call, or an object to force a named function.

  • Name
    parallel_tool_calls
    Type
    boolean
    Description

    Set to false if you want at most one tool call in a single response.

Request

POST
/v1/chat/completions
curl -X POST "$LLAMA_BASE_URL/chat/completions" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$LLAMA_MODEL"'",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant. Use tools when needed."
      },
      {
        "role": "user",
        "content": "What is the weather in Tokyo today?"
      }
    ],
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "get_current_weather",
          "description": "Get the current weather for a city or district.",
          "parameters": {
            "type": "object",
            "properties": {
              "location": {
                "type": "string",
                "description": "City or district, such as Tokyo, San Francisco, or Bengaluru."
              },
              "unit": {
                "type": "string",
                "enum": ["celsius", "fahrenheit"],
                "description": "Temperature unit."
              }
            },
            "required": ["location"]
          }
        }
      }
    ],
    "tool_choice": "auto",
    "parallel_tool_calls": false,
    "temperature": 0.2,
    "max_tokens": 512
  }'

Response shape

{
  "id": "chatcmpl-tool123",
  "object": "chat.completion",
  "created": 1760000003,
  "model": "meta-llama/Llama-3.3-70B-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "",
        "tool_calls": [
          {
            "id": "call_abc123",
            "type": "function",
            "function": {
              "name": "get_current_weather",
              "arguments": "{\"location\":\"Tokyo\",\"unit\":\"celsius\"}"
            }
          }
        ]
      },
      "finish_reason": "tool_calls",
      "logprobs": null
    }
  ],
  "usage": {
    "prompt_tokens": 118,
    "completion_tokens": 24,
    "total_tokens": 142
  }
}

POST/v1/chat/completions

Required tool calling

Force the model to call at least one tool by setting tool_choice to "required".

Tool attributes

  • Name
    tool_choice
    Type
    string
    Description

    Set to "required" to force at least one tool call.

  • Name
    tools
    Type
    array
    Description

    Tool definitions available to the model.

Request

POST
/v1/chat/completions
curl -X POST "$LLAMA_BASE_URL/chat/completions" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$LLAMA_MODEL"'",
    "messages": [
      {
        "role": "user",
        "content": "Find the current weather for Tokyo."
      }
    ],
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "get_current_weather",
          "description": "Get the current weather for a city or district.",
          "parameters": {
            "type": "object",
            "properties": {
              "location": {
                "type": "string",
                "description": "City or district."
              }
            },
            "required": ["location"]
          }
        }
      }
    ],
    "tool_choice": "required",
    "parallel_tool_calls": false,
    "temperature": 0.2,
    "max_tokens": 512
  }'

Response shape

{
  "id": "chatcmpl-required-tool123",
  "object": "chat.completion",
  "created": 1760000004,
  "model": "meta-llama/Llama-3.3-70B-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "",
        "tool_calls": [
          {
            "id": "call_weather_001",
            "type": "function",
            "function": {
              "name": "get_current_weather",
              "arguments": "{\"location\":\"Tokyo\"}"
            }
          }
        ]
      },
      "finish_reason": "tool_calls"
    }
  ],
  "usage": {
    "prompt_tokens": 92,
    "completion_tokens": 20,
    "total_tokens": 112
  }
}

POST/v1/chat/completions

Named tool calling

Force a specific tool by passing an object to tool_choice.

Tool choice object

  • Name
    tool_choice
    Type
    object
    Description

    Use { "type": "function", "function": { "name": "..." } } to force a specific tool.

Request

POST
/v1/chat/completions
curl -X POST "$LLAMA_BASE_URL/chat/completions" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$LLAMA_MODEL"'",
    "messages": [
      {
        "role": "user",
        "content": "Use the weather tool to check Tokyo."
      }
    ],
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "get_current_weather",
          "description": "Get the current weather for a city or district.",
          "parameters": {
            "type": "object",
            "properties": {
              "location": {
                "type": "string"
              },
              "unit": {
                "type": "string",
                "enum": ["celsius", "fahrenheit"]
              }
            },
            "required": ["location"]
          }
        }
      }
    ],
    "tool_choice": {
      "type": "function",
      "function": {
        "name": "get_current_weather"
      }
    },
    "parallel_tool_calls": false,
    "temperature": 0.2,
    "max_tokens": 512
  }'

Response shape

{
  "id": "chatcmpl-named-tool123",
  "object": "chat.completion",
  "created": 1760000005,
  "model": "meta-llama/Llama-3.3-70B-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "",
        "tool_calls": [
          {
            "id": "call_weather_002",
            "type": "function",
            "function": {
              "name": "get_current_weather",
              "arguments": "{\"location\":\"Tokyo\",\"unit\":\"celsius\"}"
            }
          }
        ]
      },
      "finish_reason": "tool_calls"
    }
  ],
  "usage": {
    "prompt_tokens": 105,
    "completion_tokens": 22,
    "total_tokens": 127
  }
}

POST/v1/chat/completions

Tool result

After your application executes the selected tool, send the tool result back so Llama can produce a final answer.

Tool result message

  • Name
    role
    Type
    string
    Description

    Use tool for OpenAI-compatible tool result messages.

  • Name
    tool_call_id
    Type
    string
    Description

    The id returned by the assistant message’s tool_calls item.

  • Name
    content
    Type
    string
    Description

    The tool result. If the result is structured data, serialize it as a JSON string.

Request

POST
/v1/chat/completions
curl -X POST "$LLAMA_BASE_URL/chat/completions" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$LLAMA_MODEL"'",
    "messages": [
      {
        "role": "user",
        "content": "What is the weather in Tokyo today?"
      },
      {
        "role": "assistant",
        "content": "",
        "tool_calls": [
          {
            "id": "call_abc123",
            "type": "function",
            "function": {
              "name": "get_current_weather",
              "arguments": "{\"location\":\"Tokyo\",\"unit\":\"celsius\"}"
            }
          }
        ]
      },
      {
        "role": "tool",
        "tool_call_id": "call_abc123",
        "content": "{\"location\":\"Tokyo\",\"temperature\":22,\"condition\":\"Partly cloudy\",\"unit\":\"celsius\"}"
      }
    ],
    "max_tokens": 1024
  }'

Response shape

{
  "id": "chatcmpl-tool-result123",
  "object": "chat.completion",
  "created": 1760000006,
  "model": "meta-llama/Llama-3.3-70B-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The weather in Tokyo today is partly cloudy, with a temperature of 22°C."
      },
      "finish_reason": "stop",
      "logprobs": null
    }
  ],
  "usage": {
    "prompt_tokens": 156,
    "completion_tokens": 20,
    "total_tokens": 176
  }
}

POST/v1/chat/completions

JSON mode

Request valid JSON output using vLLM’s OpenAI-compatible response_format.

JSON attributes

  • Name
    response_format
    Type
    object
    Description

    Set to { "type": "json_object" } to request JSON object output.

  • Name
    messages
    Type
    array
    Description

    Include an explicit instruction to return JSON in the system or user message.

Request

POST
/v1/chat/completions
curl -X POST "$LLAMA_BASE_URL/chat/completions" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$LLAMA_MODEL"'",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant. Return only valid JSON."
      },
      {
        "role": "user",
        "content": "Create a JSON object with three short product tagline ideas for a note-taking app."
      }
    ],
    "response_format": {
      "type": "json_object"
    },
    "temperature": 0.7,
    "top_p": 0.9,
    "max_tokens": 512
  }'

Response shape

{
  "id": "chatcmpl-json123",
  "object": "chat.completion",
  "created": 1760000007,
  "model": "meta-llama/Llama-3.3-70B-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "{\"taglines\":[\"Capture ideas before they disappear.\",\"Your thoughts, organized instantly.\",\"Notes that keep up with you.\"]}"
      },
      "finish_reason": "stop",
      "logprobs": null
    }
  ],
  "usage": {
    "prompt_tokens": 48,
    "completion_tokens": 36,
    "total_tokens": 84
  }
}

POST/v1/chat/completions

JSON schema

Request output that follows a JSON schema using vLLM’s response_format with type: "json_schema".

JSON schema attributes

  • Name
    response_format
    Type
    object
    Description

    Output-format constraint.

  • Name
    json_schema
    Type
    object
    Description

    The schema that the response should follow.

Request

POST
/v1/chat/completions
curl -X POST "$LLAMA_BASE_URL/chat/completions" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$LLAMA_MODEL"'",
    "messages": [
      {
        "role": "system",
        "content": "Return only JSON that matches the provided schema."
      },
      {
        "role": "user",
        "content": "Create three short product tagline ideas for a note-taking app."
      }
    ],
    "response_format": {
      "type": "json_schema",
      "json_schema": {
        "name": "tagline_response",
        "schema": {
          "type": "object",
          "properties": {
            "taglines": {
              "type": "array",
              "items": {
                "type": "string"
              },
              "minItems": 3,
              "maxItems": 3
            }
          },
          "required": ["taglines"],
          "additionalProperties": false
        }
      }
    },
    "temperature": 0.7,
    "top_p": 0.9,
    "max_tokens": 512
  }'

Response shape

{
  "id": "chatcmpl-schema123",
  "object": "chat.completion",
  "created": 1760000008,
  "model": "meta-llama/Llama-3.3-70B-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "{\"taglines\":[\"Capture ideas before they disappear.\",\"Your thoughts, organized instantly.\",\"Notes that keep up with you.\"]}"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 86,
    "completion_tokens": 36,
    "total_tokens": 122
  }
}

POST/v1/chat/completions

Multiple vLLM parameters

Use this example when you need a broader set of vLLM generation controls.

For raw HTTP/cURL, vLLM-specific parameters can be merged directly into the JSON request body.

Parameters shown here

  • Name
    temperature
    Type
    number
    Description

    Sampling temperature.

  • Name
    top_p
    Type
    number
    Description

    Nucleus sampling probability threshold.

  • Name
    top_k
    Type
    integer
    Description

    vLLM-specific top-k sampling parameter.

  • Name
    min_p
    Type
    number
    Description

    vLLM-specific minimum probability sampling parameter.

  • Name
    repetition_penalty
    Type
    number
    Description

    vLLM-specific repetition penalty.

  • Name
    presence_penalty
    Type
    number
    Description

    Penalizes tokens based on whether they already appeared.

  • Name
    frequency_penalty
    Type
    number
    Description

    Penalizes tokens based on how frequently they appeared.

  • Name
    seed
    Type
    integer
    Description

    Best-effort deterministic seed.

  • Name
    n
    Type
    integer
    Description

    Number of candidate responses to generate.

  • Name
    logprobs
    Type
    boolean
    Description

    Whether to return token log probabilities if supported by the serving configuration.

  • Name
    top_logprobs
    Type
    integer
    Description

    Number of top candidate tokens to return when logprobs is enabled.

Request

POST
/v1/chat/completions
curl -X POST "$LLAMA_BASE_URL/chat/completions" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$LLAMA_MODEL"'",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant. Keep the answer concise."
      },
      {
        "role": "user",
        "content": "Give me five naming ideas for an AI inference platform."
      }
    ],
    "temperature": 0.7,
    "top_p": 0.9,
    "top_k": 40,
    "min_p": 0.0,
    "repetition_penalty": 1.05,
    "presence_penalty": 0.2,
    "frequency_penalty": 0.2,
    "max_tokens": 512,
    "seed": 1234,
    "n": 1,
    "logprobs": true,
    "top_logprobs": 2,
    "stream": false
  }'

Response shape

{
  "id": "chatcmpl-params123",
  "object": "chat.completion",
  "created": 1760000009,
  "model": "meta-llama/Llama-3.3-70B-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "1. HexGrid Inference\n2. ModelForge\n3. TensorRoute\n4. InferaCloud\n5. LatticeAI"
      },
      "finish_reason": "stop",
      "logprobs": {
        "content": [
          {
            "token": "1",
            "logprob": -0.02,
            "bytes": [49],
            "top_logprobs": [
              {
                "token": "1",
                "logprob": -0.02,
                "bytes": [49]
              },
              {
                "token": "-",
                "logprob": -4.1,
                "bytes": [45]
              }
            ]
          }
        ]
      }
    }
  ],
  "usage": {
    "prompt_tokens": 38,
    "completion_tokens": 34,
    "total_tokens": 72
  }
}

Official sources

Was this page helpful?