Qwen3 Reranker API - vLLM engine

Use vLLM’s reranking endpoint to call HexGrid-hosted Qwen3 Reranker models and score how relevant each document is to a query.

This page provides copy-pasteable cURL-only examples for basic reranking, top-N reranking, instruction-aware reranking, pairwise scoring, batch pair scoring, and common reranker parameters.


Endpoint

POST http://<server-ip>:<port>/v1/rerank

Set these environment variables before running the examples:

export HEXGRID_API_KEY="your-hexgrid-api-key"
export QWEN_RERANK_BASE_URL="http://<server-ip>:<port>/v1"
export QWEN_RERANK_MODEL="Qwen/Qwen3-Reranker-4B"

You can replace Qwen/Qwen3-Reranker-4B with another HexGrid-hosted Qwen3 Reranker model, such as:

Qwen/Qwen3-Reranker-0.6B
Qwen/Qwen3-Reranker-4B
Qwen/Qwen3-Reranker-8B

Use the exact model ID configured in your HexGrid deployment.


Qwen3 Reranker model list

ModelTypeContext lengthInstruction aware
Qwen/Qwen3-Reranker-0.6BText reranking32KYes
Qwen/Qwen3-Reranker-4BText reranking32KYes
Qwen/Qwen3-Reranker-8BText reranking32KYes

Qwen3 Reranker models are cross-encoder style reranking models. They take a query and one or more candidate documents, then return relevance scores that can be used to sort the documents.


POST/v1/rerank

Rerank documents

Rerank a list of documents for a single query.

Required attributes

  • Name
    model
    Type
    string
    Description

    The served Qwen3 Reranker model name, for example Qwen/Qwen3-Reranker-4B.

  • Name
    query
    Type
    string
    Description

    The query to compare against each document.

  • Name
    documents
    Type
    array
    Description

    Candidate documents to rerank. Each item can be a string document.

Response fields

  • Name
    results
    Type
    array
    Description

    Reranked results sorted by relevance.

  • Name
    index
    Type
    integer
    Description

    Original index of the document in the input documents array.

  • Name
    relevance_score
    Type
    number
    Description

    Relevance score for the query-document pair.

Request

POST
/v1/rerank
curl -X POST "$QWEN_RERANK_BASE_URL/rerank" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$QWEN_RERANK_MODEL"'",
    "query": "What is the capital of China?",
    "documents": [
      "The capital of China is Beijing.",
      "Gravity is a force that attracts two bodies toward each other.",
      "Paris is the capital of France."
    ]
  }'

Response shape

{
  "id": "rerank-abc123",
  "model": "Qwen/Qwen3-Reranker-4B",
  "usage": {
    "total_tokens": 48
  },
  "results": [
    {
      "index": 0,
      "document": {
        "text": "The capital of China is Beijing."
      },
      "relevance_score": 0.9985
    },
    {
      "index": 2,
      "document": {
        "text": "Paris is the capital of France."
      },
      "relevance_score": 0.1024
    },
    {
      "index": 1,
      "document": {
        "text": "Gravity is a force that attracts two bodies toward each other."
      },
      "relevance_score": 0.0187
    }
  ]
}

POST/v1/rerank

Top-N reranking

Return only the top N most relevant documents.

Optional attributes used here

  • Name
    top_n
    Type
    integer
    Description

    Maximum number of ranked documents to return. If omitted, vLLM returns all documents sorted by relevance.

Request

POST
/v1/rerank
curl -X POST "$QWEN_RERANK_BASE_URL/rerank" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$QWEN_RERANK_MODEL"'",
    "query": "Explain gravity",
    "documents": [
      "The capital of China is Beijing.",
      "Gravity is a force that attracts two bodies toward each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.",
      "A neural network is a machine learning model inspired by biological neurons.",
      "Photosynthesis is the process by which plants convert light energy into chemical energy."
    ],
    "top_n": 2
  }'

Response shape

{
  "id": "rerank-topn123",
  "model": "Qwen/Qwen3-Reranker-4B",
  "usage": {
    "total_tokens": 78
  },
  "results": [
    {
      "index": 1,
      "document": {
        "text": "Gravity is a force that attracts two bodies toward each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
      },
      "relevance_score": 0.9971
    },
    {
      "index": 3,
      "document": {
        "text": "Photosynthesis is the process by which plants convert light energy into chemical energy."
      },
      "relevance_score": 0.0418
    }
  ]
}

POST/v1/rerank

Instruction-aware reranking

Qwen3 Reranker models are instruction-aware. Qwen’s official reranker examples use an instruction format like:

<Instruct>: Given a web search query, retrieve relevant passages that answer the query

<Query>: What is the capital of China?

<Document>: The capital of China is Beijing.

With vLLM’s /v1/rerank endpoint, the server-side chat template handles query-document formatting for the deployed reranker. To make the task explicit from the client side, include the instruction in the query string.

Required attributes

  • Name
    query
    Type
    string
    Description

    Instruction-aware query text.

  • Name
    documents
    Type
    array
    Description

    Candidate documents to rerank.

Recommended instruction

  • Name
    instruction
    Type
    string
    Description

    A short task description such as Given a web search query, retrieve relevant passages that answer the query.

Request

POST
/v1/rerank
curl -X POST "$QWEN_RERANK_BASE_URL/rerank" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$QWEN_RERANK_MODEL"'",
    "query": "Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery: What is the capital of China?",
    "documents": [
      "The capital of China is Beijing.",
      "Gravity is a force that attracts two bodies toward each other.",
      "Paris is the capital of France."
    ],
    "top_n": 2
  }'

Response shape

{
  "id": "rerank-instruct123",
  "model": "Qwen/Qwen3-Reranker-4B",
  "usage": {
    "total_tokens": 61
  },
  "results": [
    {
      "index": 0,
      "document": {
        "text": "The capital of China is Beijing."
      },
      "relevance_score": 0.9988
    },
    {
      "index": 2,
      "document": {
        "text": "Paris is the capital of France."
      },
      "relevance_score": 0.1089
    }
  ]
}

POST/v1/rerank

Rerank search results

Use reranking after a first-stage retrieval step, such as BM25, vector search, or hybrid search.

Your application sends the top candidate chunks to the reranker, then uses the returned index values to map ranked results back to the original documents.

Required attributes

  • Name
    query
    Type
    string
    Description

    User search query.

  • Name
    documents
    Type
    array
    Description

    Candidate search results from your first-stage retriever.

  • Name
    top_n
    Type
    integer
    Description

    Number of final results to keep.

Request

POST
/v1/rerank
curl -X POST "$QWEN_RERANK_BASE_URL/rerank" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$QWEN_RERANK_MODEL"'",
    "query": "How do I configure streaming in vLLM?",
    "documents": [
      "vLLM exposes an OpenAI-compatible Chat Completions endpoint. Set stream to true to receive Server-Sent Events.",
      "The capital of France is Paris.",
      "Embeddings convert text into dense vectors for semantic search.",
      "Use top_k and top_p to control sampling diversity in generation."
    ],
    "top_n": 3
  }'

Response shape

{
  "id": "rerank-search123",
  "model": "Qwen/Qwen3-Reranker-4B",
  "usage": {
    "total_tokens": 72
  },
  "results": [
    {
      "index": 0,
      "document": {
        "text": "vLLM exposes an OpenAI-compatible Chat Completions endpoint. Set stream to true to receive Server-Sent Events."
      },
      "relevance_score": 0.9821
    },
    {
      "index": 3,
      "document": {
        "text": "Use top_k and top_p to control sampling diversity in generation."
      },
      "relevance_score": 0.2137
    },
    {
      "index": 2,
      "document": {
        "text": "Embeddings convert text into dense vectors for semantic search."
      },
      "relevance_score": 0.1105
    }
  ]
}

POST/v1/score

Pairwise scoring

Use the vLLM score endpoint when you want a relevance score for each aligned query-document pair.

Unlike /v1/rerank, which compares one query against many documents and returns sorted results, /v1/score can score aligned pairs.

Required attributes

  • Name
    model
    Type
    string
    Description

    The served Qwen3 Reranker model name.

  • Name
    queries
    Type
    array
    Description

    Queries to score.

  • Name
    documents
    Type
    array
    Description

    Documents to score. Each document is paired with the query at the same index.

Request

POST
/v1/score
curl -X POST "$QWEN_RERANK_BASE_URL/score" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$QWEN_RERANK_MODEL"'",
    "queries": [
      "What is the capital of China?",
      "Explain gravity"
    ],
    "documents": [
      "The capital of China is Beijing.",
      "Gravity is a force that attracts two bodies toward each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
    ]
  }'

Response shape

{
  "id": "score-abc123",
  "object": "list",
  "model": "Qwen/Qwen3-Reranker-4B",
  "data": [
    {
      "index": 0,
      "object": "score",
      "score": 0.9984
    },
    {
      "index": 1,
      "object": "score",
      "score": 0.9969
    }
  ],
  "usage": {
    "total_tokens": 58
  }
}

POST/v1/score

One query against many documents with score API

Use /v1/score when you want unsorted scores and your application will handle sorting.

This is useful when you want to preserve the original candidate order or combine the reranker score with another scoring signal.

Required attributes

  • Name
    queries
    Type
    array
    Description

    Repeat the same query once per document.

  • Name
    documents
    Type
    array
    Description

    Candidate documents to score.

Request

POST
/v1/score
curl -X POST "$QWEN_RERANK_BASE_URL/score" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$QWEN_RERANK_MODEL"'",
    "queries": [
      "What is vLLM?",
      "What is vLLM?",
      "What is vLLM?"
    ],
    "documents": [
      "vLLM is an inference and serving engine for large language models.",
      "Beijing is the capital of China.",
      "A vector database stores embeddings for semantic search."
    ]
  }'

Response shape

{
  "id": "score-many123",
  "object": "list",
  "model": "Qwen/Qwen3-Reranker-4B",
  "data": [
    {
      "index": 0,
      "object": "score",
      "score": 0.9812
    },
    {
      "index": 1,
      "object": "score",
      "score": 0.0241
    },
    {
      "index": 2,
      "object": "score",
      "score": 0.2183
    }
  ],
  "usage": {
    "total_tokens": 47
  }
}

POST/v1/rerank

Rerank with activation control

vLLM supports the use_activation rerank parameter.

When enabled, the score is typically transformed into a normalized range such as 0–1, depending on the model pooler configuration. When disabled, the endpoint may return raw model scores.

Optional attributes used here

  • Name
    use_activation
    Type
    boolean
    Description

    Whether to apply the model pooler’s activation function to the score.

  • Name
    top_n
    Type
    integer
    Description

    Number of top results to return.

Request

POST
/v1/rerank
curl -X POST "$QWEN_RERANK_BASE_URL/rerank" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$QWEN_RERANK_MODEL"'",
    "query": "What is semantic search?",
    "documents": [
      "Semantic search retrieves results based on meaning rather than exact keyword overlap.",
      "The boiling point of water is 100°C at sea level.",
      "Reranking improves retrieval quality by rescoring candidate documents."
    ],
    "top_n": 2,
    "use_activation": true
  }'

Response shape

{
  "id": "rerank-activation123",
  "model": "Qwen/Qwen3-Reranker-4B",
  "usage": {
    "total_tokens": 63
  },
  "results": [
    {
      "index": 0,
      "document": {
        "text": "Semantic search retrieves results based on meaning rather than exact keyword overlap."
      },
      "relevance_score": 0.9914
    },
    {
      "index": 2,
      "document": {
        "text": "Reranking improves retrieval quality by rescoring candidate documents."
      },
      "relevance_score": 0.6532
    }
  ]
}

POST/v1/rerank

Rerank with document objects

Some rerank-compatible clients send documents as objects with a text field.

Use this when your application wants to keep the document shape close to the response shape.

Required attributes

  • Name
    documents
    Type
    array
    Description

    Candidate document objects. Each object should contain text.

  • Name
    text
    Type
    string
    Description

    Document text to score.

Request

POST
/v1/rerank
curl -X POST "$QWEN_RERANK_BASE_URL/rerank" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$QWEN_RERANK_MODEL"'",
    "query": "What is the role of a reranker in RAG?",
    "documents": [
      {
        "text": "A reranker rescoring stage improves RAG by evaluating retrieved documents against the query more precisely."
      },
      {
        "text": "A tokenizer converts text into tokens before model inference."
      },
      {
        "text": "A load balancer distributes traffic across multiple servers."
      }
    ],
    "top_n": 2
  }'

Response shape

{
  "id": "rerank-objects123",
  "model": "Qwen/Qwen3-Reranker-4B",
  "usage": {
    "total_tokens": 69
  },
  "results": [
    {
      "index": 0,
      "document": {
        "text": "A reranker rescoring stage improves RAG by evaluating retrieved documents against the query more precisely."
      },
      "relevance_score": 0.9887
    },
    {
      "index": 1,
      "document": {
        "text": "A tokenizer converts text into tokens before model inference."
      },
      "relevance_score": 0.1238
    }
  ]
}

POST/v1/rerank

Retrieval pipeline example

A typical retrieval pipeline uses a fast first-stage retriever, then Qwen3 Reranker as a second-stage precision filter.

Pipeline

  • Name
    first_stage_retriever
    Type
    string
    Description

    Use BM25, vector search, hybrid search, or another retriever to fetch candidate documents.

  • Name
    reranker
    Type
    string
    Description

    Send the top candidates to Qwen3 Reranker.

  • Name
    final_context
    Type
    array
    Description

    Use the top reranked documents as context for downstream generation.

Request

POST
/v1/rerank
curl -X POST "$QWEN_RERANK_BASE_URL/rerank" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$QWEN_RERANK_MODEL"'",
    "query": "How does reranking improve RAG answers?",
    "documents": [
      "RAG retrieves external documents and passes relevant context to a language model.",
      "Reranking improves RAG by rescoring first-stage retrieval candidates using a stronger cross-encoder relevance model.",
      "Embeddings are dense vectors that represent text meaning.",
      "KV cache stores attention keys and values during autoregressive decoding.",
      "Hybrid search combines sparse keyword matching and dense semantic retrieval."
    ],
    "top_n": 2
  }'

Response shape

{
  "id": "rerank-pipeline123",
  "model": "Qwen/Qwen3-Reranker-4B",
  "usage": {
    "total_tokens": 89
  },
  "results": [
    {
      "index": 1,
      "document": {
        "text": "Reranking improves RAG by rescoring first-stage retrieval candidates using a stronger cross-encoder relevance model."
      },
      "relevance_score": 0.9962
    },
    {
      "index": 0,
      "document": {
        "text": "RAG retrieves external documents and passes relevant context to a language model."
      },
      "relevance_score": 0.8127
    }
  ]
}

POST/v1/rerank

Multiple rerank parameters

Use this example when you need a broader set of reranker controls.

Parameters shown here

  • Name
    model
    Type
    string
    Description

    The served Qwen3 Reranker model name.

  • Name
    query
    Type
    string
    Description

    Query to compare with each document.

  • Name
    documents
    Type
    array
    Description

    Candidate documents to rerank.

  • Name
    top_n
    Type
    integer
    Description

    Maximum number of results to return.

  • Name
    use_activation
    Type
    boolean
    Description

    Whether to apply the pooler activation function to relevance scores.

  • Name
    user
    Type
    string
    Description

    Optional end-user identifier accepted by vLLM’s rerank API.

Request

POST
/v1/rerank
curl -X POST "$QWEN_RERANK_BASE_URL/rerank" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$QWEN_RERANK_MODEL"'",
    "query": "Instruct: Given a technical support question, retrieve passages that directly answer the issue\nQuery: Why is my vLLM request returning 404 for chat completions?",
    "documents": [
      "Use the base URL ending in /v1 and send chat requests to /v1/chat/completions.",
      "The /v1/embeddings endpoint returns vector embeddings for input text.",
      "A 404 can occur when the client appends /chat/completions to a URL that already includes /chat/completions.",
      "Reranking endpoints return relevance scores, not generated text."
    ],
    "top_n": 3,
    "use_activation": true,
    "user": "user_123"
  }'

Response shape

{
  "id": "rerank-params123",
  "model": "Qwen/Qwen3-Reranker-4B",
  "usage": {
    "total_tokens": 97
  },
  "results": [
    {
      "index": 2,
      "document": {
        "text": "A 404 can occur when the client appends /chat/completions to a URL that already includes /chat/completions."
      },
      "relevance_score": 0.9941
    },
    {
      "index": 0,
      "document": {
        "text": "Use the base URL ending in /v1 and send chat requests to /v1/chat/completions."
      },
      "relevance_score": 0.9683
    },
    {
      "index": 3,
      "document": {
        "text": "Reranking endpoints return relevance scores, not generated text."
      },
      "relevance_score": 0.0872
    }
  ]
}

Official sources

Was this page helpful?