Qwen3 Reranker API - vLLM engine

Use vLLM’s reranking endpoint to call HexGrid-hosted Qwen3 Reranker models and score how relevant each document is to a query.

This page provides copy-pasteable cURL-only examples for basic reranking, top-N reranking, instruction-aware reranking, pairwise scoring, batch pair scoring, and common reranker parameters.

Endpoint

POST http://<server-ip>:<port>/v1/rerank

Set these environment variables before running the examples:

export HEXGRID_API_KEY="your-hexgrid-api-key"
export QWEN_RERANK_BASE_URL="http://<server-ip>:<port>/v1"
export QWEN_RERANK_MODEL="Qwen/Qwen3-Reranker-4B"

You can replace Qwen/Qwen3-Reranker-4B with another HexGrid-hosted Qwen3 Reranker model, such as:

Qwen/Qwen3-Reranker-0.6B
Qwen/Qwen3-Reranker-4B
Qwen/Qwen3-Reranker-8B

Use the exact model ID configured in your HexGrid deployment.

Qwen3 Reranker model list

Model	Type	Context length	Instruction aware
`Qwen/Qwen3-Reranker-0.6B`	Text reranking	32K	Yes
`Qwen/Qwen3-Reranker-4B`	Text reranking	32K	Yes
`Qwen/Qwen3-Reranker-8B`	Text reranking	32K	Yes

Qwen3 Reranker models are cross-encoder style reranking models. They take a query and one or more candidate documents, then return relevance scores that can be used to sort the documents.

POST/v1/rerank

Instruction-aware reranking

Qwen3 Reranker models are instruction-aware. Qwen’s official reranker examples use an instruction format like:

<Instruct>: Given a web search query, retrieve relevant passages that answer the query

<Query>: What is the capital of China?

<Document>: The capital of China is Beijing.

With vLLM’s /v1/rerank endpoint, the server-side chat template handles query-document formatting for the deployed reranker. To make the task explicit from the client side, include the instruction in the query string.

Required attributes

Name
query
Type
string
Description
Instruction-aware query text.
Name
documents
Type
array
Description
Candidate documents to rerank.

Recommended instruction

Name
instruction
Type
string
Description
A short task description such as Given a web search query, retrieve relevant passages that answer the query.

Request

POST

/v1/rerank

curl -X POST "$QWEN_RERANK_BASE_URL/rerank" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$QWEN_RERANK_MODEL"'",
    "query": "Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery: What is the capital of China?",
    "documents": [
      "The capital of China is Beijing.",
      "Gravity is a force that attracts two bodies toward each other.",
      "Paris is the capital of France."
    ],
    "top_n": 2
  }'

Response shape

{
  "id": "rerank-instruct123",
  "model": "Qwen/Qwen3-Reranker-4B",
  "usage": {
    "total_tokens": 61
  },
  "results": [
    {
      "index": 0,
      "document": {
        "text": "The capital of China is Beijing."
      },
      "relevance_score": 0.9988
    },
    {
      "index": 2,
      "document": {
        "text": "Paris is the capital of France."
      },
      "relevance_score": 0.1089
    }
  ]
}

POST/v1/rerank

Rerank search results

Use reranking after a first-stage retrieval step, such as BM25, vector search, or hybrid search.

Your application sends the top candidate chunks to the reranker, then uses the returned index values to map ranked results back to the original documents.

Required attributes

Name
query
Type
string
Description
User search query.
Name
documents
Type
array
Description
Candidate search results from your first-stage retriever.
Name
top_n
Type
integer
Description
Number of final results to keep.

Request

POST

/v1/rerank

curl -X POST "$QWEN_RERANK_BASE_URL/rerank" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$QWEN_RERANK_MODEL"'",
    "query": "How do I configure streaming in vLLM?",
    "documents": [
      "vLLM exposes an OpenAI-compatible Chat Completions endpoint. Set stream to true to receive Server-Sent Events.",
      "The capital of France is Paris.",
      "Embeddings convert text into dense vectors for semantic search.",
      "Use top_k and top_p to control sampling diversity in generation."
    ],
    "top_n": 3
  }'

Response shape

{
  "id": "rerank-search123",
  "model": "Qwen/Qwen3-Reranker-4B",
  "usage": {
    "total_tokens": 72
  },
  "results": [
    {
      "index": 0,
      "document": {
        "text": "vLLM exposes an OpenAI-compatible Chat Completions endpoint. Set stream to true to receive Server-Sent Events."
      },
      "relevance_score": 0.9821
    },
    {
      "index": 3,
      "document": {
        "text": "Use top_k and top_p to control sampling diversity in generation."
      },
      "relevance_score": 0.2137
    },
    {
      "index": 2,
      "document": {
        "text": "Embeddings convert text into dense vectors for semantic search."
      },
      "relevance_score": 0.1105
    }
  ]
}

POST/v1/score

Pairwise scoring

Use the vLLM score endpoint when you want a relevance score for each aligned query-document pair.

Unlike /v1/rerank, which compares one query against many documents and returns sorted results, /v1/score can score aligned pairs.

Required attributes

Name
model
Type
string
Description
The served Qwen3 Reranker model name.
Name
queries
Type
array
Description
Queries to score.
Name
documents
Type
array
Description
Documents to score. Each document is paired with the query at the same index.

Request

POST

/v1/score

curl -X POST "$QWEN_RERANK_BASE_URL/score" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$QWEN_RERANK_MODEL"'",
    "queries": [
      "What is the capital of China?",
      "Explain gravity"
    ],
    "documents": [
      "The capital of China is Beijing.",
      "Gravity is a force that attracts two bodies toward each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
    ]
  }'

Response shape

{
  "id": "score-abc123",
  "object": "list",
  "model": "Qwen/Qwen3-Reranker-4B",
  "data": [
    {
      "index": 0,
      "object": "score",
      "score": 0.9984
    },
    {
      "index": 1,
      "object": "score",
      "score": 0.9969
    }
  ],
  "usage": {
    "total_tokens": 58
  }
}

POST/v1/score

One query against many documents with score API

Use /v1/score when you want unsorted scores and your application will handle sorting.

This is useful when you want to preserve the original candidate order or combine the reranker score with another scoring signal.

Required attributes

Name
queries
Type
array
Description
Repeat the same query once per document.
Name
documents
Type
array
Description
Candidate documents to score.

Request

POST

/v1/score

curl -X POST "$QWEN_RERANK_BASE_URL/score" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$QWEN_RERANK_MODEL"'",
    "queries": [
      "What is vLLM?",
      "What is vLLM?",
      "What is vLLM?"
    ],
    "documents": [
      "vLLM is an inference and serving engine for large language models.",
      "Beijing is the capital of China.",
      "A vector database stores embeddings for semantic search."
    ]
  }'

Response shape

{
  "id": "score-many123",
  "object": "list",
  "model": "Qwen/Qwen3-Reranker-4B",
  "data": [
    {
      "index": 0,
      "object": "score",
      "score": 0.9812
    },
    {
      "index": 1,
      "object": "score",
      "score": 0.0241
    },
    {
      "index": 2,
      "object": "score",
      "score": 0.2183
    }
  ],
  "usage": {
    "total_tokens": 47
  }
}

POST/v1/rerank

Rerank with activation control

vLLM supports the use_activation rerank parameter.

When enabled, the score is typically transformed into a normalized range such as 0–1, depending on the model pooler configuration. When disabled, the endpoint may return raw model scores.

Optional attributes used here

Name
use_activation
Type
boolean
Description
Whether to apply the model pooler’s activation function to the score.
Name
top_n
Type
integer
Description
Number of top results to return.

Request

POST

/v1/rerank

curl -X POST "$QWEN_RERANK_BASE_URL/rerank" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$QWEN_RERANK_MODEL"'",
    "query": "What is semantic search?",
    "documents": [
      "Semantic search retrieves results based on meaning rather than exact keyword overlap.",
      "The boiling point of water is 100°C at sea level.",
      "Reranking improves retrieval quality by rescoring candidate documents."
    ],
    "top_n": 2,
    "use_activation": true
  }'

Response shape

{
  "id": "rerank-activation123",
  "model": "Qwen/Qwen3-Reranker-4B",
  "usage": {
    "total_tokens": 63
  },
  "results": [
    {
      "index": 0,
      "document": {
        "text": "Semantic search retrieves results based on meaning rather than exact keyword overlap."
      },
      "relevance_score": 0.9914
    },
    {
      "index": 2,
      "document": {
        "text": "Reranking improves retrieval quality by rescoring candidate documents."
      },
      "relevance_score": 0.6532
    }
  ]
}

POST/v1/rerank

Rerank with document objects

Some rerank-compatible clients send documents as objects with a text field.

Use this when your application wants to keep the document shape close to the response shape.

Required attributes

Name
documents
Type
array
Description
Candidate document objects. Each object should contain text.
Name
text
Type
string
Description
Document text to score.

Request

POST

/v1/rerank

curl -X POST "$QWEN_RERANK_BASE_URL/rerank" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$QWEN_RERANK_MODEL"'",
    "query": "What is the role of a reranker in RAG?",
    "documents": [
      {
        "text": "A reranker rescoring stage improves RAG by evaluating retrieved documents against the query more precisely."
      },
      {
        "text": "A tokenizer converts text into tokens before model inference."
      },
      {
        "text": "A load balancer distributes traffic across multiple servers."
      }
    ],
    "top_n": 2
  }'

Response shape

{
  "id": "rerank-objects123",
  "model": "Qwen/Qwen3-Reranker-4B",
  "usage": {
    "total_tokens": 69
  },
  "results": [
    {
      "index": 0,
      "document": {
        "text": "A reranker rescoring stage improves RAG by evaluating retrieved documents against the query more precisely."
      },
      "relevance_score": 0.9887
    },
    {
      "index": 1,
      "document": {
        "text": "A tokenizer converts text into tokens before model inference."
      },
      "relevance_score": 0.1238
    }
  ]
}

POST/v1/rerank

Retrieval pipeline example

A typical retrieval pipeline uses a fast first-stage retriever, then Qwen3 Reranker as a second-stage precision filter.

Pipeline

Name
first_stage_retriever
Type
string
Description
Use BM25, vector search, hybrid search, or another retriever to fetch candidate documents.
Name
reranker
Type
string
Description
Send the top candidates to Qwen3 Reranker.
Name
final_context
Type
array
Description
Use the top reranked documents as context for downstream generation.

Request

POST

/v1/rerank

curl -X POST "$QWEN_RERANK_BASE_URL/rerank" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$QWEN_RERANK_MODEL"'",
    "query": "How does reranking improve RAG answers?",
    "documents": [
      "RAG retrieves external documents and passes relevant context to a language model.",
      "Reranking improves RAG by rescoring first-stage retrieval candidates using a stronger cross-encoder relevance model.",
      "Embeddings are dense vectors that represent text meaning.",
      "KV cache stores attention keys and values during autoregressive decoding.",
      "Hybrid search combines sparse keyword matching and dense semantic retrieval."
    ],
    "top_n": 2
  }'

Response shape

{
  "id": "rerank-pipeline123",
  "model": "Qwen/Qwen3-Reranker-4B",
  "usage": {
    "total_tokens": 89
  },
  "results": [
    {
      "index": 1,
      "document": {
        "text": "Reranking improves RAG by rescoring first-stage retrieval candidates using a stronger cross-encoder relevance model."
      },
      "relevance_score": 0.9962
    },
    {
      "index": 0,
      "document": {
        "text": "RAG retrieves external documents and passes relevant context to a language model."
      },
      "relevance_score": 0.8127
    }
  ]
}

POST/v1/rerank

Multiple rerank parameters

Use this example when you need a broader set of reranker controls.

Parameters shown here

Name
model
Type
string
Description
The served Qwen3 Reranker model name.
Name
query
Type
string
Description
Query to compare with each document.
Name
documents
Type
array
Description
Candidate documents to rerank.
Name
top_n
Type
integer
Description
Maximum number of results to return.
Name
use_activation
Type
boolean
Description
Whether to apply the pooler activation function to relevance scores.
Name
user
Type
string
Description
Optional end-user identifier accepted by vLLM’s rerank API.

Request

POST

/v1/rerank

curl -X POST "$QWEN_RERANK_BASE_URL/rerank" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$QWEN_RERANK_MODEL"'",
    "query": "Instruct: Given a technical support question, retrieve passages that directly answer the issue\nQuery: Why is my vLLM request returning 404 for chat completions?",
    "documents": [
      "Use the base URL ending in /v1 and send chat requests to /v1/chat/completions.",
      "The /v1/embeddings endpoint returns vector embeddings for input text.",
      "A 404 can occur when the client appends /chat/completions to a URL that already includes /chat/completions.",
      "Reranking endpoints return relevance scores, not generated text."
    ],
    "top_n": 3,
    "use_activation": true,
    "user": "user_123"
  }'

Response shape

{
  "id": "rerank-params123",
  "model": "Qwen/Qwen3-Reranker-4B",
  "usage": {
    "total_tokens": 97
  },
  "results": [
    {
      "index": 2,
      "document": {
        "text": "A 404 can occur when the client appends /chat/completions to a URL that already includes /chat/completions."
      },
      "relevance_score": 0.9941
    },
    {
      "index": 0,
      "document": {
        "text": "Use the base URL ending in /v1 and send chat requests to /v1/chat/completions."
      },
      "relevance_score": 0.9683
    },
    {
      "index": 3,
      "document": {
        "text": "Reranking endpoints return relevance scores, not generated text."
      },
      "relevance_score": 0.0872
    }
  ]
}

Official sources

vLLM OpenAI-compatible server docs: https://docs.vllm.ai/en/latest/serving/openai_compatible_server/
vLLM score and rerank example docs: https://docs.vllm.ai/en/latest/examples/pooling/score/
vLLM scoring usage docs: https://docs.vllm.ai/en/latest/models/pooling_models/scoring/
Qwen3 Embedding official GitHub repository: https://github.com/QwenLM/Qwen3-Embedding
Qwen/Qwen3-Reranker-0.6B model card: https://huggingface.co/Qwen/Qwen3-Reranker-0.6B
Qwen/Qwen3-Reranker-4B model card: https://huggingface.co/Qwen/Qwen3-Reranker-4B
Qwen/Qwen3-Reranker-8B model card: https://huggingface.co/Qwen/Qwen3-Reranker-8B

Qwen3 Reranker API - vLLM engine

Required attributes

Response fields

Request

Response shape

Optional attributes used here

Request

Response shape

Required attributes

Recommended instruction

Request

Response shape

Required attributes

Request

Response shape

Required attributes

Request

Response shape

Required attributes

Request

Response shape

Optional attributes used here

Request

Response shape

Required attributes

Request

Response shape

Pipeline

Request

Response shape

Parameters shown here

Request

Response shape