Qwen3 Embeddings API - vLLM engine

Use vLLM’s OpenAI-compatible Embeddings endpoint to call HexGrid-hosted Qwen3 Embedding models and convert text into dense vector representations.

This page provides copy-pasteable cURL-only examples for single-text embeddings, batch embeddings, retrieval-query embeddings with instructions, document embeddings, custom output dimensions, truncation controls, and token-ID input.

Endpoint

POST http://<server-ip>:<port>/v1/embeddings

Set these environment variables before running the examples:

export HEXGRID_API_KEY="your-hexgrid-api-key"
export QWEN_EMBED_BASE_URL="http://<server-ip>:<port>/v1"
export QWEN_EMBED_MODEL="Qwen/Qwen3-Embedding-4B"

You can replace Qwen/Qwen3-Embedding-4B with another HexGrid-hosted Qwen3 Embedding model, such as:

Qwen/Qwen3-Embedding-0.6B
Qwen/Qwen3-Embedding-4B
Qwen/Qwen3-Embedding-8B

Use the exact model ID configured in your HexGrid deployment.

Qwen3 Embedding model dimensions

Model	Context length	Default / maximum embedding dimension	MRL support	Instruction aware
`Qwen/Qwen3-Embedding-0.6B`	32K	1024	Yes	Yes
`Qwen/Qwen3-Embedding-4B`	32K	2560	Yes	Yes
`Qwen/Qwen3-Embedding-8B`	32K	4096	Yes	Yes

Qwen3 Embedding models support Matryoshka Representation Learning, so you can request smaller output dimensions with the dimensions field when supported by your deployment.

POST/v1/embeddings

Create an embedding

Generate one embedding vector for a single input string.

Required attributes

Name
model
Type
string
Description
The served Qwen3 Embedding model name, for example Qwen/Qwen3-Embedding-4B.
Name
input
Type
string | array
Description
Text to embed. For a single embedding, pass a string.

Optional attributes used here

Name
encoding_format
Type
string
Description
Output encoding format. Use "float" to return an array of floating-point values.

Request

POST

/v1/embeddings

curl -X POST "$QWEN_EMBED_BASE_URL/embeddings" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$QWEN_EMBED_MODEL"'",
    "input": "Qwen3 Embedding models convert text into dense vectors for search, retrieval, clustering, and classification.",
    "encoding_format": "float"
  }'

Response shape

{
  "id": "embd-abc123",
  "object": "list",
  "created": 1760000000,
  "model": "Qwen/Qwen3-Embedding-4B",
  "data": [
    {
      "index": 0,
      "object": "embedding",
      "embedding": [
        -0.01234,
        0.04567,
        0.00891
      ]
    }
  ],
  "usage": {
    "prompt_tokens": 21,
    "total_tokens": 21,
    "completion_tokens": 0
  }
}

POST/v1/embeddings

Batch embeddings

Generate embeddings for multiple strings in one request.

Required attributes

Name
model
Type
string
Description
The served Qwen3 Embedding model name.
Name
input
Type
array
Description
Array of strings to embed. The response returns one embedding object per input item.

Optional attributes used here

Name
encoding_format
Type
string
Description
Use "float" to return floating-point vectors.

Request

POST

/v1/embeddings

curl -X POST "$QWEN_EMBED_BASE_URL/embeddings" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$QWEN_EMBED_MODEL"'",
    "input": [
      "The capital of China is Beijing.",
      "Gravity is a force that attracts two bodies toward each other.",
      "Qwen3 Embedding supports multilingual and code retrieval tasks."
    ],
    "encoding_format": "float"
  }'

Response shape

{
  "id": "embd-batch123",
  "object": "list",
  "created": 1760000001,
  "model": "Qwen/Qwen3-Embedding-4B",
  "data": [
    {
      "index": 0,
      "object": "embedding",
      "embedding": [-0.0101, 0.0202, 0.0303]
    },
    {
      "index": 1,
      "object": "embedding",
      "embedding": [0.0404, -0.0505, 0.0606]
    },
    {
      "index": 2,
      "object": "embedding",
      "embedding": [0.0707, 0.0808, -0.0909]
    }
  ],
  "usage": {
    "prompt_tokens": 31,
    "total_tokens": 31,
    "completion_tokens": 0
  }
}

POST/v1/embeddings

Retrieval query embedding with instruction

Qwen recommends adding an instruction to retrieval queries. The official Qwen3 Embedding format is:

Instruct: <task description>
Query:<query text>

Use this pattern for search queries, user questions, and other inputs that should retrieve matching documents.

Required attributes

Name
model
Type
string
Description
The served Qwen3 Embedding model name.
Name
input
Type
string
Description
Instruction-aware retrieval query text.

Recommended query format

Name
Instruct
Type
string
Description
One-sentence task description.
Name
Query
Type
string
Description
The actual user query.

Request

POST

/v1/embeddings

curl -X POST "$QWEN_EMBED_BASE_URL/embeddings" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$QWEN_EMBED_MODEL"'",
    "input": "Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery: What is the capital of China?",
    "encoding_format": "float"
  }'

Response shape

{
  "id": "embd-query123",
  "object": "list",
  "created": 1760000002,
  "model": "Qwen/Qwen3-Embedding-4B",
  "data": [
    {
      "index": 0,
      "object": "embedding",
      "embedding": [
        0.02145,
        -0.03456,
        0.07891
      ]
    }
  ],
  "usage": {
    "prompt_tokens": 22,
    "total_tokens": 22,
    "completion_tokens": 0
  }
}

POST/v1/embeddings

Document embeddings for retrieval

For retrieval documents, Qwen’s official examples do not add the query instruction. Embed the document text directly.

Required attributes

Name
model
Type
string
Description
The served Qwen3 Embedding model name.
Name
input
Type
array
Description
Document strings to embed.

Request

POST

/v1/embeddings

curl -X POST "$QWEN_EMBED_BASE_URL/embeddings" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$QWEN_EMBED_MODEL"'",
    "input": [
      "The capital of China is Beijing.",
      "Gravity is a force that attracts two bodies toward each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
    ],
    "encoding_format": "float"
  }'

Response shape

{
  "id": "embd-docs123",
  "object": "list",
  "created": 1760000003,
  "model": "Qwen/Qwen3-Embedding-4B",
  "data": [
    {
      "index": 0,
      "object": "embedding",
      "embedding": [
        0.0521,
        -0.0142,
        0.0098
      ]
    },
    {
      "index": 1,
      "object": "embedding",
      "embedding": [
        -0.0063,
        0.0317,
        0.0444
      ]
    }
  ],
  "usage": {
    "prompt_tokens": 34,
    "total_tokens": 34,
    "completion_tokens": 0
  }
}

POST/v1/embeddings

Embedding with custom dimensions

Qwen3 Embedding models support Matryoshka Representation Learning, so you can request a smaller output dimension.

This is useful when you want smaller vectors for faster storage, indexing, and retrieval.

Dimension limits

Name
dimensions
Type
integer
Description
Requested output vector dimension.
Name
Qwen/Qwen3-Embedding-0.6B
Type
integer
Description
Up to 1024.
Name
Qwen/Qwen3-Embedding-4B
Type
integer
Description
Up to 2560.
Name
Qwen/Qwen3-Embedding-8B
Type
integer
Description
Up to 4096.

Request

POST

/v1/embeddings

curl -X POST "$QWEN_EMBED_BASE_URL/embeddings" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$QWEN_EMBED_MODEL"'",
    "input": "Embed this text into a smaller vector for retrieval indexing.",
    "encoding_format": "float",
    "dimensions": 1024
  }'

Response shape

{
  "id": "embd-dim123",
  "object": "list",
  "created": 1760000004,
  "model": "Qwen/Qwen3-Embedding-4B",
  "data": [
    {
      "index": 0,
      "object": "embedding",
      "embedding": [
        -0.0182,
        0.0274,
        0.0411
      ]
    }
  ],
  "usage": {
    "prompt_tokens": 12,
    "total_tokens": 12,
    "completion_tokens": 0
  }
}

POST/v1/embeddings

Embeddings with truncation controls

Use truncation controls when inputs may exceed your deployment’s token limit.

Qwen3 Embedding models support long context, but your HexGrid deployment may enforce a lower maximum input length depending on model size, GPU capacity, and server configuration.

Truncation attributes

Name
truncate_prompt_tokens
Type
integer
Description
Maximum number of prompt tokens to keep. Use -1 or omit it to avoid explicit truncation.
Name
truncation_side
Type
string
Description
Which side to truncate from when truncate_prompt_tokens is active. Use "right" to keep the first N tokens, or "left" to keep the last N tokens.

Request

POST

/v1/embeddings

curl -X POST "$QWEN_EMBED_BASE_URL/embeddings" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$QWEN_EMBED_MODEL"'",
    "input": "This is a long document that may need to be truncated before embedding. Replace this string with your full document text.",
    "encoding_format": "float",
    "truncate_prompt_tokens": 8192,
    "truncation_side": "right"
  }'

Response shape

{
  "id": "embd-truncate123",
  "object": "list",
  "created": 1760000005,
  "model": "Qwen/Qwen3-Embedding-4B",
  "data": [
    {
      "index": 0,
      "object": "embedding",
      "embedding": [
        0.0111,
        -0.0222,
        0.0333
      ]
    }
  ],
  "usage": {
    "prompt_tokens": 23,
    "total_tokens": 23,
    "completion_tokens": 0
  }
}

POST/v1/embeddings

Token-ID input

vLLM’s OpenAI-compatible Embeddings API accepts token IDs as input.

Use this only if your application already tokenizes text with the same tokenizer used by the served Qwen3 Embedding model.

Token input attributes

Name
input
Type
array
Description
Token IDs for one input, or an array of token-ID arrays for multiple inputs.
Name
encoding_format
Type
string
Description
Use "float" to return floating-point vectors.

Request

POST

/v1/embeddings

curl -X POST "$QWEN_EMBED_BASE_URL/embeddings" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$QWEN_EMBED_MODEL"'",
    "input": [3838, 374, 279, 6864, 315, 5736, 30],
    "encoding_format": "float"
  }'

Response shape

{
  "id": "embd-token123",
  "object": "list",
  "created": 1760000006,
  "model": "Qwen/Qwen3-Embedding-4B",
  "data": [
    {
      "index": 0,
      "object": "embedding",
      "embedding": [
        -0.0155,
        0.0266,
        0.0377
      ]
    }
  ],
  "usage": {
    "prompt_tokens": 7,
    "total_tokens": 7,
    "completion_tokens": 0
  }
}

POST/v1/embeddings

Retrieval pipeline example

A typical retrieval pipeline embeds the user query with an instruction and embeds documents without the query instruction.

Your application should compute cosine similarity or dot product between the returned vectors.

Query input

Name
input
Type
string
Description
Use Instruct: ...\nQuery: ... for the query.

Document input

Name
input
Type
array
Description
Use raw document text for the candidate documents.

Query embedding request

POST

/v1/embeddings

curl -X POST "$QWEN_EMBED_BASE_URL/embeddings" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$QWEN_EMBED_MODEL"'",
    "input": "Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery: Explain gravity",
    "encoding_format": "float",
    "dimensions": 1024
  }'

Document embedding request

POST

/v1/embeddings

curl -X POST "$QWEN_EMBED_BASE_URL/embeddings" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$QWEN_EMBED_MODEL"'",
    "input": [
      "The capital of China is Beijing.",
      "Gravity is a force that attracts two bodies toward each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
    ],
    "encoding_format": "float",
    "dimensions": 1024
  }'

Similarity result computed by your application

{
  "query": "Explain gravity",
  "top_match": {
    "document_index": 1,
    "document": "Gravity is a force that attracts two bodies toward each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.",
    "similarity": 0.624
  }
}

POST/v1/embeddings

Multiple embedding parameters

Use this example when you need a broader set of vLLM embedding controls.

Parameters shown here

Name
model
Type
string
Description
The served Qwen3 Embedding model name.
Name
input
Type
string | array
Description
Text, array of texts, token IDs, or array of token-ID arrays.
Name
encoding_format
Type
string
Description
Output format for embeddings. Use "float" for numeric vectors.
Name
dimensions
Type
integer
Description
Requested output vector size for Matryoshka-capable models.
Name
truncate_prompt_tokens
Type
integer
Description
Maximum prompt tokens to keep before embedding.
Name
truncation_side
Type
string
Description
Use "right" to keep the first N tokens or "left" to keep the last N tokens.
Name
user
Type
string
Description
Optional end-user identifier. vLLM accepts this OpenAI-compatible field.

Request

POST

/v1/embeddings

curl -X POST "$QWEN_EMBED_BASE_URL/embeddings" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$QWEN_EMBED_MODEL"'",
    "input": [
      "Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery: What is vLLM?",
      "vLLM is an inference and serving engine that provides OpenAI-compatible APIs for language and embedding models."
    ],
    "encoding_format": "float",
    "dimensions": 1024,
    "truncate_prompt_tokens": 8192,
    "truncation_side": "right",
    "user": "user_123"
  }'

Response shape

{
  "id": "embd-params123",
  "object": "list",
  "created": 1760000007,
  "model": "Qwen/Qwen3-Embedding-4B",
  "data": [
    {
      "index": 0,
      "object": "embedding",
      "embedding": [
        0.0123,
        -0.0456,
        0.0789
      ]
    },
    {
      "index": 1,
      "object": "embedding",
      "embedding": [
        -0.0234,
        0.0567,
        -0.0891
      ]
    }
  ],
  "usage": {
    "prompt_tokens": 32,
    "total_tokens": 32,
    "completion_tokens": 0
  }
}

Official sources

vLLM OpenAI-compatible server docs: https://docs.vllm.ai/en/latest/serving/online_serving/openai_compatible_server/
vLLM embedding usage docs: https://docs.vllm.ai/en/latest/models/pooling_models/embed/
Qwen3 Embedding official GitHub repository: https://github.com/QwenLM/Qwen3-Embedding
Qwen/Qwen3-Embedding-0.6B model card: https://huggingface.co/Qwen/Qwen3-Embedding-0.6B
Qwen/Qwen3-Embedding-4B model card: https://huggingface.co/Qwen/Qwen3-Embedding-4B
Qwen/Qwen3-Embedding-8B model card: https://huggingface.co/Qwen/Qwen3-Embedding-8B