Qwen3 Embeddings API - vLLM engine

Use vLLM’s OpenAI-compatible Embeddings endpoint to call HexGrid-hosted Qwen3 Embedding models and convert text into dense vector representations.

This page provides copy-pasteable cURL-only examples for single-text embeddings, batch embeddings, retrieval-query embeddings with instructions, document embeddings, custom output dimensions, truncation controls, and token-ID input.


Endpoint

POST http://<server-ip>:<port>/v1/embeddings

Set these environment variables before running the examples:

export HEXGRID_API_KEY="your-hexgrid-api-key"
export QWEN_EMBED_BASE_URL="http://<server-ip>:<port>/v1"
export QWEN_EMBED_MODEL="Qwen/Qwen3-Embedding-4B"

You can replace Qwen/Qwen3-Embedding-4B with another HexGrid-hosted Qwen3 Embedding model, such as:

Qwen/Qwen3-Embedding-0.6B
Qwen/Qwen3-Embedding-4B
Qwen/Qwen3-Embedding-8B

Use the exact model ID configured in your HexGrid deployment.


Qwen3 Embedding model dimensions

ModelContext lengthDefault / maximum embedding dimensionMRL supportInstruction aware
Qwen/Qwen3-Embedding-0.6B32K1024YesYes
Qwen/Qwen3-Embedding-4B32K2560YesYes
Qwen/Qwen3-Embedding-8B32K4096YesYes

Qwen3 Embedding models support Matryoshka Representation Learning, so you can request smaller output dimensions with the dimensions field when supported by your deployment.


POST/v1/embeddings

Create an embedding

Generate one embedding vector for a single input string.

Required attributes

  • Name
    model
    Type
    string
    Description

    The served Qwen3 Embedding model name, for example Qwen/Qwen3-Embedding-4B.

  • Name
    input
    Type
    string | array
    Description

    Text to embed. For a single embedding, pass a string.

Optional attributes used here

  • Name
    encoding_format
    Type
    string
    Description

    Output encoding format. Use "float" to return an array of floating-point values.

Request

POST
/v1/embeddings
curl -X POST "$QWEN_EMBED_BASE_URL/embeddings" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$QWEN_EMBED_MODEL"'",
    "input": "Qwen3 Embedding models convert text into dense vectors for search, retrieval, clustering, and classification.",
    "encoding_format": "float"
  }'

Response shape

{
  "id": "embd-abc123",
  "object": "list",
  "created": 1760000000,
  "model": "Qwen/Qwen3-Embedding-4B",
  "data": [
    {
      "index": 0,
      "object": "embedding",
      "embedding": [
        -0.01234,
        0.04567,
        0.00891
      ]
    }
  ],
  "usage": {
    "prompt_tokens": 21,
    "total_tokens": 21,
    "completion_tokens": 0
  }
}

POST/v1/embeddings

Batch embeddings

Generate embeddings for multiple strings in one request.

Required attributes

  • Name
    model
    Type
    string
    Description

    The served Qwen3 Embedding model name.

  • Name
    input
    Type
    array
    Description

    Array of strings to embed. The response returns one embedding object per input item.

Optional attributes used here

  • Name
    encoding_format
    Type
    string
    Description

    Use "float" to return floating-point vectors.

Request

POST
/v1/embeddings
curl -X POST "$QWEN_EMBED_BASE_URL/embeddings" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$QWEN_EMBED_MODEL"'",
    "input": [
      "The capital of China is Beijing.",
      "Gravity is a force that attracts two bodies toward each other.",
      "Qwen3 Embedding supports multilingual and code retrieval tasks."
    ],
    "encoding_format": "float"
  }'

Response shape

{
  "id": "embd-batch123",
  "object": "list",
  "created": 1760000001,
  "model": "Qwen/Qwen3-Embedding-4B",
  "data": [
    {
      "index": 0,
      "object": "embedding",
      "embedding": [-0.0101, 0.0202, 0.0303]
    },
    {
      "index": 1,
      "object": "embedding",
      "embedding": [0.0404, -0.0505, 0.0606]
    },
    {
      "index": 2,
      "object": "embedding",
      "embedding": [0.0707, 0.0808, -0.0909]
    }
  ],
  "usage": {
    "prompt_tokens": 31,
    "total_tokens": 31,
    "completion_tokens": 0
  }
}

POST/v1/embeddings

Retrieval query embedding with instruction

Qwen recommends adding an instruction to retrieval queries. The official Qwen3 Embedding format is:

Instruct: <task description>
Query:<query text>

Use this pattern for search queries, user questions, and other inputs that should retrieve matching documents.

Required attributes

  • Name
    model
    Type
    string
    Description

    The served Qwen3 Embedding model name.

  • Name
    input
    Type
    string
    Description

    Instruction-aware retrieval query text.

Recommended query format

  • Name
    Instruct
    Type
    string
    Description

    One-sentence task description.

  • Name
    Query
    Type
    string
    Description

    The actual user query.

Request

POST
/v1/embeddings
curl -X POST "$QWEN_EMBED_BASE_URL/embeddings" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$QWEN_EMBED_MODEL"'",
    "input": "Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery: What is the capital of China?",
    "encoding_format": "float"
  }'

Response shape

{
  "id": "embd-query123",
  "object": "list",
  "created": 1760000002,
  "model": "Qwen/Qwen3-Embedding-4B",
  "data": [
    {
      "index": 0,
      "object": "embedding",
      "embedding": [
        0.02145,
        -0.03456,
        0.07891
      ]
    }
  ],
  "usage": {
    "prompt_tokens": 22,
    "total_tokens": 22,
    "completion_tokens": 0
  }
}

POST/v1/embeddings

Document embeddings for retrieval

For retrieval documents, Qwen’s official examples do not add the query instruction. Embed the document text directly.

Required attributes

  • Name
    model
    Type
    string
    Description

    The served Qwen3 Embedding model name.

  • Name
    input
    Type
    array
    Description

    Document strings to embed.

Request

POST
/v1/embeddings
curl -X POST "$QWEN_EMBED_BASE_URL/embeddings" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$QWEN_EMBED_MODEL"'",
    "input": [
      "The capital of China is Beijing.",
      "Gravity is a force that attracts two bodies toward each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
    ],
    "encoding_format": "float"
  }'

Response shape

{
  "id": "embd-docs123",
  "object": "list",
  "created": 1760000003,
  "model": "Qwen/Qwen3-Embedding-4B",
  "data": [
    {
      "index": 0,
      "object": "embedding",
      "embedding": [
        0.0521,
        -0.0142,
        0.0098
      ]
    },
    {
      "index": 1,
      "object": "embedding",
      "embedding": [
        -0.0063,
        0.0317,
        0.0444
      ]
    }
  ],
  "usage": {
    "prompt_tokens": 34,
    "total_tokens": 34,
    "completion_tokens": 0
  }
}

POST/v1/embeddings

Embedding with custom dimensions

Qwen3 Embedding models support Matryoshka Representation Learning, so you can request a smaller output dimension.

This is useful when you want smaller vectors for faster storage, indexing, and retrieval.

Dimension limits

  • Name
    dimensions
    Type
    integer
    Description

    Requested output vector dimension.

  • Name
    Qwen/Qwen3-Embedding-0.6B
    Type
    integer
    Description

    Up to 1024.

  • Name
    Qwen/Qwen3-Embedding-4B
    Type
    integer
    Description

    Up to 2560.

  • Name
    Qwen/Qwen3-Embedding-8B
    Type
    integer
    Description

    Up to 4096.

Request

POST
/v1/embeddings
curl -X POST "$QWEN_EMBED_BASE_URL/embeddings" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$QWEN_EMBED_MODEL"'",
    "input": "Embed this text into a smaller vector for retrieval indexing.",
    "encoding_format": "float",
    "dimensions": 1024
  }'

Response shape

{
  "id": "embd-dim123",
  "object": "list",
  "created": 1760000004,
  "model": "Qwen/Qwen3-Embedding-4B",
  "data": [
    {
      "index": 0,
      "object": "embedding",
      "embedding": [
        -0.0182,
        0.0274,
        0.0411
      ]
    }
  ],
  "usage": {
    "prompt_tokens": 12,
    "total_tokens": 12,
    "completion_tokens": 0
  }
}

POST/v1/embeddings

Embeddings with truncation controls

Use truncation controls when inputs may exceed your deployment’s token limit.

Qwen3 Embedding models support long context, but your HexGrid deployment may enforce a lower maximum input length depending on model size, GPU capacity, and server configuration.

Truncation attributes

  • Name
    truncate_prompt_tokens
    Type
    integer
    Description

    Maximum number of prompt tokens to keep. Use -1 or omit it to avoid explicit truncation.

  • Name
    truncation_side
    Type
    string
    Description

    Which side to truncate from when truncate_prompt_tokens is active. Use "right" to keep the first N tokens, or "left" to keep the last N tokens.

Request

POST
/v1/embeddings
curl -X POST "$QWEN_EMBED_BASE_URL/embeddings" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$QWEN_EMBED_MODEL"'",
    "input": "This is a long document that may need to be truncated before embedding. Replace this string with your full document text.",
    "encoding_format": "float",
    "truncate_prompt_tokens": 8192,
    "truncation_side": "right"
  }'

Response shape

{
  "id": "embd-truncate123",
  "object": "list",
  "created": 1760000005,
  "model": "Qwen/Qwen3-Embedding-4B",
  "data": [
    {
      "index": 0,
      "object": "embedding",
      "embedding": [
        0.0111,
        -0.0222,
        0.0333
      ]
    }
  ],
  "usage": {
    "prompt_tokens": 23,
    "total_tokens": 23,
    "completion_tokens": 0
  }
}

POST/v1/embeddings

Token-ID input

vLLM’s OpenAI-compatible Embeddings API accepts token IDs as input.

Use this only if your application already tokenizes text with the same tokenizer used by the served Qwen3 Embedding model.

Token input attributes

  • Name
    input
    Type
    array
    Description

    Token IDs for one input, or an array of token-ID arrays for multiple inputs.

  • Name
    encoding_format
    Type
    string
    Description

    Use "float" to return floating-point vectors.

Request

POST
/v1/embeddings
curl -X POST "$QWEN_EMBED_BASE_URL/embeddings" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$QWEN_EMBED_MODEL"'",
    "input": [3838, 374, 279, 6864, 315, 5736, 30],
    "encoding_format": "float"
  }'

Response shape

{
  "id": "embd-token123",
  "object": "list",
  "created": 1760000006,
  "model": "Qwen/Qwen3-Embedding-4B",
  "data": [
    {
      "index": 0,
      "object": "embedding",
      "embedding": [
        -0.0155,
        0.0266,
        0.0377
      ]
    }
  ],
  "usage": {
    "prompt_tokens": 7,
    "total_tokens": 7,
    "completion_tokens": 0
  }
}

POST/v1/embeddings

Retrieval pipeline example

A typical retrieval pipeline embeds the user query with an instruction and embeds documents without the query instruction.

Your application should compute cosine similarity or dot product between the returned vectors.

Query input

  • Name
    input
    Type
    string
    Description

    Use Instruct: ...\nQuery: ... for the query.

Document input

  • Name
    input
    Type
    array
    Description

    Use raw document text for the candidate documents.

Query embedding request

POST
/v1/embeddings
curl -X POST "$QWEN_EMBED_BASE_URL/embeddings" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$QWEN_EMBED_MODEL"'",
    "input": "Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery: Explain gravity",
    "encoding_format": "float",
    "dimensions": 1024
  }'

Document embedding request

POST
/v1/embeddings
curl -X POST "$QWEN_EMBED_BASE_URL/embeddings" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$QWEN_EMBED_MODEL"'",
    "input": [
      "The capital of China is Beijing.",
      "Gravity is a force that attracts two bodies toward each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
    ],
    "encoding_format": "float",
    "dimensions": 1024
  }'

Similarity result computed by your application

{
  "query": "Explain gravity",
  "top_match": {
    "document_index": 1,
    "document": "Gravity is a force that attracts two bodies toward each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.",
    "similarity": 0.624
  }
}

POST/v1/embeddings

Multiple embedding parameters

Use this example when you need a broader set of vLLM embedding controls.

Parameters shown here

  • Name
    model
    Type
    string
    Description

    The served Qwen3 Embedding model name.

  • Name
    input
    Type
    string | array
    Description

    Text, array of texts, token IDs, or array of token-ID arrays.

  • Name
    encoding_format
    Type
    string
    Description

    Output format for embeddings. Use "float" for numeric vectors.

  • Name
    dimensions
    Type
    integer
    Description

    Requested output vector size for Matryoshka-capable models.

  • Name
    truncate_prompt_tokens
    Type
    integer
    Description

    Maximum prompt tokens to keep before embedding.

  • Name
    truncation_side
    Type
    string
    Description

    Use "right" to keep the first N tokens or "left" to keep the last N tokens.

  • Name
    user
    Type
    string
    Description

    Optional end-user identifier. vLLM accepts this OpenAI-compatible field.

Request

POST
/v1/embeddings
curl -X POST "$QWEN_EMBED_BASE_URL/embeddings" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$QWEN_EMBED_MODEL"'",
    "input": [
      "Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery: What is vLLM?",
      "vLLM is an inference and serving engine that provides OpenAI-compatible APIs for language and embedding models."
    ],
    "encoding_format": "float",
    "dimensions": 1024,
    "truncate_prompt_tokens": 8192,
    "truncation_side": "right",
    "user": "user_123"
  }'

Response shape

{
  "id": "embd-params123",
  "object": "list",
  "created": 1760000007,
  "model": "Qwen/Qwen3-Embedding-4B",
  "data": [
    {
      "index": 0,
      "object": "embedding",
      "embedding": [
        0.0123,
        -0.0456,
        0.0789
      ]
    },
    {
      "index": 1,
      "object": "embedding",
      "embedding": [
        -0.0234,
        0.0567,
        -0.0891
      ]
    }
  ],
  "usage": {
    "prompt_tokens": 32,
    "total_tokens": 32,
    "completion_tokens": 0
  }
}

Official sources

Was this page helpful?