Qwen3 Embeddings API - vLLM engine
Use vLLM’s OpenAI-compatible Embeddings endpoint to call HexGrid-hosted Qwen3 Embedding models and convert text into dense vector representations.
This page provides copy-pasteable cURL-only examples for single-text embeddings, batch embeddings, retrieval-query embeddings with instructions, document embeddings, custom output dimensions, truncation controls, and token-ID input.
Endpoint
POST http://<server-ip>:<port>/v1/embeddings
Set these environment variables before running the examples:
export HEXGRID_API_KEY="your-hexgrid-api-key"
export QWEN_EMBED_BASE_URL="http://<server-ip>:<port>/v1"
export QWEN_EMBED_MODEL="Qwen/Qwen3-Embedding-4B"
You can replace Qwen/Qwen3-Embedding-4B with another HexGrid-hosted Qwen3 Embedding model, such as:
Qwen/Qwen3-Embedding-0.6B
Qwen/Qwen3-Embedding-4B
Qwen/Qwen3-Embedding-8B
Use the exact model ID configured in your HexGrid deployment.
Qwen3 Embedding model dimensions
| Model | Context length | Default / maximum embedding dimension | MRL support | Instruction aware |
|---|---|---|---|---|
Qwen/Qwen3-Embedding-0.6B | 32K | 1024 | Yes | Yes |
Qwen/Qwen3-Embedding-4B | 32K | 2560 | Yes | Yes |
Qwen/Qwen3-Embedding-8B | 32K | 4096 | Yes | Yes |
Qwen3 Embedding models support Matryoshka Representation Learning, so you can request smaller output dimensions with the dimensions field when supported by your deployment.
Create an embedding
Generate one embedding vector for a single input string.
Required attributes
- Name
model- Type
- string
- Description
The served Qwen3 Embedding model name, for example
Qwen/Qwen3-Embedding-4B.
- Name
input- Type
- string | array
- Description
Text to embed. For a single embedding, pass a string.
Optional attributes used here
- Name
encoding_format- Type
- string
- Description
Output encoding format. Use
"float"to return an array of floating-point values.
Request
curl -X POST "$QWEN_EMBED_BASE_URL/embeddings" \
-H "Authorization: Bearer $HEXGRID_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "'"$QWEN_EMBED_MODEL"'",
"input": "Qwen3 Embedding models convert text into dense vectors for search, retrieval, clustering, and classification.",
"encoding_format": "float"
}'
Response shape
{
"id": "embd-abc123",
"object": "list",
"created": 1760000000,
"model": "Qwen/Qwen3-Embedding-4B",
"data": [
{
"index": 0,
"object": "embedding",
"embedding": [
-0.01234,
0.04567,
0.00891
]
}
],
"usage": {
"prompt_tokens": 21,
"total_tokens": 21,
"completion_tokens": 0
}
}
Batch embeddings
Generate embeddings for multiple strings in one request.
Required attributes
- Name
model- Type
- string
- Description
The served Qwen3 Embedding model name.
- Name
input- Type
- array
- Description
Array of strings to embed. The response returns one embedding object per input item.
Optional attributes used here
- Name
encoding_format- Type
- string
- Description
Use
"float"to return floating-point vectors.
Request
curl -X POST "$QWEN_EMBED_BASE_URL/embeddings" \
-H "Authorization: Bearer $HEXGRID_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "'"$QWEN_EMBED_MODEL"'",
"input": [
"The capital of China is Beijing.",
"Gravity is a force that attracts two bodies toward each other.",
"Qwen3 Embedding supports multilingual and code retrieval tasks."
],
"encoding_format": "float"
}'
Response shape
{
"id": "embd-batch123",
"object": "list",
"created": 1760000001,
"model": "Qwen/Qwen3-Embedding-4B",
"data": [
{
"index": 0,
"object": "embedding",
"embedding": [-0.0101, 0.0202, 0.0303]
},
{
"index": 1,
"object": "embedding",
"embedding": [0.0404, -0.0505, 0.0606]
},
{
"index": 2,
"object": "embedding",
"embedding": [0.0707, 0.0808, -0.0909]
}
],
"usage": {
"prompt_tokens": 31,
"total_tokens": 31,
"completion_tokens": 0
}
}
Retrieval query embedding with instruction
Qwen recommends adding an instruction to retrieval queries. The official Qwen3 Embedding format is:
Instruct: <task description>
Query:<query text>
Use this pattern for search queries, user questions, and other inputs that should retrieve matching documents.
Required attributes
- Name
model- Type
- string
- Description
The served Qwen3 Embedding model name.
- Name
input- Type
- string
- Description
Instruction-aware retrieval query text.
Recommended query format
- Name
Instruct- Type
- string
- Description
One-sentence task description.
- Name
Query- Type
- string
- Description
The actual user query.
Request
curl -X POST "$QWEN_EMBED_BASE_URL/embeddings" \
-H "Authorization: Bearer $HEXGRID_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "'"$QWEN_EMBED_MODEL"'",
"input": "Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery: What is the capital of China?",
"encoding_format": "float"
}'
Response shape
{
"id": "embd-query123",
"object": "list",
"created": 1760000002,
"model": "Qwen/Qwen3-Embedding-4B",
"data": [
{
"index": 0,
"object": "embedding",
"embedding": [
0.02145,
-0.03456,
0.07891
]
}
],
"usage": {
"prompt_tokens": 22,
"total_tokens": 22,
"completion_tokens": 0
}
}
Document embeddings for retrieval
For retrieval documents, Qwen’s official examples do not add the query instruction. Embed the document text directly.
Required attributes
- Name
model- Type
- string
- Description
The served Qwen3 Embedding model name.
- Name
input- Type
- array
- Description
Document strings to embed.
Request
curl -X POST "$QWEN_EMBED_BASE_URL/embeddings" \
-H "Authorization: Bearer $HEXGRID_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "'"$QWEN_EMBED_MODEL"'",
"input": [
"The capital of China is Beijing.",
"Gravity is a force that attracts two bodies toward each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
],
"encoding_format": "float"
}'
Response shape
{
"id": "embd-docs123",
"object": "list",
"created": 1760000003,
"model": "Qwen/Qwen3-Embedding-4B",
"data": [
{
"index": 0,
"object": "embedding",
"embedding": [
0.0521,
-0.0142,
0.0098
]
},
{
"index": 1,
"object": "embedding",
"embedding": [
-0.0063,
0.0317,
0.0444
]
}
],
"usage": {
"prompt_tokens": 34,
"total_tokens": 34,
"completion_tokens": 0
}
}
Embedding with custom dimensions
Qwen3 Embedding models support Matryoshka Representation Learning, so you can request a smaller output dimension.
This is useful when you want smaller vectors for faster storage, indexing, and retrieval.
Dimension limits
- Name
dimensions- Type
- integer
- Description
Requested output vector dimension.
- Name
Qwen/Qwen3-Embedding-0.6B- Type
- integer
- Description
Up to
1024.
- Name
Qwen/Qwen3-Embedding-4B- Type
- integer
- Description
Up to
2560.
- Name
Qwen/Qwen3-Embedding-8B- Type
- integer
- Description
Up to
4096.
Request
curl -X POST "$QWEN_EMBED_BASE_URL/embeddings" \
-H "Authorization: Bearer $HEXGRID_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "'"$QWEN_EMBED_MODEL"'",
"input": "Embed this text into a smaller vector for retrieval indexing.",
"encoding_format": "float",
"dimensions": 1024
}'
Response shape
{
"id": "embd-dim123",
"object": "list",
"created": 1760000004,
"model": "Qwen/Qwen3-Embedding-4B",
"data": [
{
"index": 0,
"object": "embedding",
"embedding": [
-0.0182,
0.0274,
0.0411
]
}
],
"usage": {
"prompt_tokens": 12,
"total_tokens": 12,
"completion_tokens": 0
}
}
Embeddings with truncation controls
Use truncation controls when inputs may exceed your deployment’s token limit.
Qwen3 Embedding models support long context, but your HexGrid deployment may enforce a lower maximum input length depending on model size, GPU capacity, and server configuration.
Truncation attributes
- Name
truncate_prompt_tokens- Type
- integer
- Description
Maximum number of prompt tokens to keep. Use
-1or omit it to avoid explicit truncation.
- Name
truncation_side- Type
- string
- Description
Which side to truncate from when
truncate_prompt_tokensis active. Use"right"to keep the first N tokens, or"left"to keep the last N tokens.
Request
curl -X POST "$QWEN_EMBED_BASE_URL/embeddings" \
-H "Authorization: Bearer $HEXGRID_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "'"$QWEN_EMBED_MODEL"'",
"input": "This is a long document that may need to be truncated before embedding. Replace this string with your full document text.",
"encoding_format": "float",
"truncate_prompt_tokens": 8192,
"truncation_side": "right"
}'
Response shape
{
"id": "embd-truncate123",
"object": "list",
"created": 1760000005,
"model": "Qwen/Qwen3-Embedding-4B",
"data": [
{
"index": 0,
"object": "embedding",
"embedding": [
0.0111,
-0.0222,
0.0333
]
}
],
"usage": {
"prompt_tokens": 23,
"total_tokens": 23,
"completion_tokens": 0
}
}
Token-ID input
vLLM’s OpenAI-compatible Embeddings API accepts token IDs as input.
Use this only if your application already tokenizes text with the same tokenizer used by the served Qwen3 Embedding model.
Token input attributes
- Name
input- Type
- array
- Description
Token IDs for one input, or an array of token-ID arrays for multiple inputs.
- Name
encoding_format- Type
- string
- Description
Use
"float"to return floating-point vectors.
Request
curl -X POST "$QWEN_EMBED_BASE_URL/embeddings" \
-H "Authorization: Bearer $HEXGRID_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "'"$QWEN_EMBED_MODEL"'",
"input": [3838, 374, 279, 6864, 315, 5736, 30],
"encoding_format": "float"
}'
Response shape
{
"id": "embd-token123",
"object": "list",
"created": 1760000006,
"model": "Qwen/Qwen3-Embedding-4B",
"data": [
{
"index": 0,
"object": "embedding",
"embedding": [
-0.0155,
0.0266,
0.0377
]
}
],
"usage": {
"prompt_tokens": 7,
"total_tokens": 7,
"completion_tokens": 0
}
}
Retrieval pipeline example
A typical retrieval pipeline embeds the user query with an instruction and embeds documents without the query instruction.
Your application should compute cosine similarity or dot product between the returned vectors.
Query input
- Name
input- Type
- string
- Description
Use
Instruct: ...\nQuery: ...for the query.
Document input
- Name
input- Type
- array
- Description
Use raw document text for the candidate documents.
Query embedding request
curl -X POST "$QWEN_EMBED_BASE_URL/embeddings" \
-H "Authorization: Bearer $HEXGRID_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "'"$QWEN_EMBED_MODEL"'",
"input": "Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery: Explain gravity",
"encoding_format": "float",
"dimensions": 1024
}'
Document embedding request
curl -X POST "$QWEN_EMBED_BASE_URL/embeddings" \
-H "Authorization: Bearer $HEXGRID_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "'"$QWEN_EMBED_MODEL"'",
"input": [
"The capital of China is Beijing.",
"Gravity is a force that attracts two bodies toward each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
],
"encoding_format": "float",
"dimensions": 1024
}'
Similarity result computed by your application
{
"query": "Explain gravity",
"top_match": {
"document_index": 1,
"document": "Gravity is a force that attracts two bodies toward each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.",
"similarity": 0.624
}
}
Multiple embedding parameters
Use this example when you need a broader set of vLLM embedding controls.
Parameters shown here
- Name
model- Type
- string
- Description
The served Qwen3 Embedding model name.
- Name
input- Type
- string | array
- Description
Text, array of texts, token IDs, or array of token-ID arrays.
- Name
encoding_format- Type
- string
- Description
Output format for embeddings. Use
"float"for numeric vectors.
- Name
dimensions- Type
- integer
- Description
Requested output vector size for Matryoshka-capable models.
- Name
truncate_prompt_tokens- Type
- integer
- Description
Maximum prompt tokens to keep before embedding.
- Name
truncation_side- Type
- string
- Description
Use
"right"to keep the first N tokens or"left"to keep the last N tokens.
- Name
user- Type
- string
- Description
Optional end-user identifier. vLLM accepts this OpenAI-compatible field.
Request
curl -X POST "$QWEN_EMBED_BASE_URL/embeddings" \
-H "Authorization: Bearer $HEXGRID_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "'"$QWEN_EMBED_MODEL"'",
"input": [
"Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery: What is vLLM?",
"vLLM is an inference and serving engine that provides OpenAI-compatible APIs for language and embedding models."
],
"encoding_format": "float",
"dimensions": 1024,
"truncate_prompt_tokens": 8192,
"truncation_side": "right",
"user": "user_123"
}'
Response shape
{
"id": "embd-params123",
"object": "list",
"created": 1760000007,
"model": "Qwen/Qwen3-Embedding-4B",
"data": [
{
"index": 0,
"object": "embedding",
"embedding": [
0.0123,
-0.0456,
0.0789
]
},
{
"index": 1,
"object": "embedding",
"embedding": [
-0.0234,
0.0567,
-0.0891
]
}
],
"usage": {
"prompt_tokens": 32,
"total_tokens": 32,
"completion_tokens": 0
}
}
Official sources
- vLLM OpenAI-compatible server docs: https://docs.vllm.ai/en/latest/serving/online_serving/openai_compatible_server/
- vLLM embedding usage docs: https://docs.vllm.ai/en/latest/models/pooling_models/embed/
- Qwen3 Embedding official GitHub repository: https://github.com/QwenLM/Qwen3-Embedding
- Qwen/Qwen3-Embedding-0.6B model card: https://huggingface.co/Qwen/Qwen3-Embedding-0.6B
- Qwen/Qwen3-Embedding-4B model card: https://huggingface.co/Qwen/Qwen3-Embedding-4B
- Qwen/Qwen3-Embedding-8B model card: https://huggingface.co/Qwen/Qwen3-Embedding-8B