BGE Reranker V2 Gemma API - vLLM engine
Use vLLM’s reranking endpoint to call HexGrid-hosted BGE Reranker V2 Gemma and score how relevant each document is to a query.
This page provides copy-pasteable cURL-only examples for basic reranking, top-N reranking, pairwise scoring, batch pair scoring, activation-controlled scores, document-object reranking, retrieval pipelines, and common reranker parameters.
Endpoint
POST http://<server-ip>:<port>/v1/rerank
Set these environment variables before running the examples:
export HEXGRID_API_KEY="your-hexgrid-api-key"
export BGE_RERANK_BASE_URL="http://<server-ip>:<port>/v1"
export BGE_RERANK_MODEL="BAAI/bge-reranker-v2-gemma"
Use the exact model ID configured in your HexGrid deployment.
Model
| Model | Base model | Type | Language support | Output |
|---|---|---|---|---|
BAAI/bge-reranker-v2-gemma | Gemma 2B | Reranker / cross-encoder | Multilingual | Query-document relevance score |
BGE rerankers are different from embedding models. They take a query and document as input and output a similarity / relevance score instead of returning an embedding vector.
Rerank documents
Rerank a list of documents for a single query.
Required attributes
- Name
model- Type
- string
- Description
The served reranker model name, for example
BAAI/bge-reranker-v2-gemma.
- Name
query- Type
- string
- Description
The query to compare against each document.
- Name
documents- Type
- array
- Description
Candidate documents to rerank. Each item can be a string document.
Response fields
- Name
results- Type
- array
- Description
Reranked results sorted by relevance.
- Name
index- Type
- integer
- Description
Original index of the document in the input
documentsarray.
- Name
relevance_score- Type
- number
- Description
Relevance score for the query-document pair.
Request
curl -X POST "$BGE_RERANK_BASE_URL/rerank" \
-H "Authorization: Bearer $HEXGRID_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "'"$BGE_RERANK_MODEL"'",
"query": "What is the capital of China?",
"documents": [
"The capital of China is Beijing.",
"Gravity is a force that attracts two bodies toward each other.",
"Paris is the capital of France."
]
}'
Response shape
{
"id": "rerank-abc123",
"model": "BAAI/bge-reranker-v2-gemma",
"usage": {
"total_tokens": 48
},
"results": [
{
"index": 0,
"document": {
"text": "The capital of China is Beijing."
},
"relevance_score": 0.9985
},
{
"index": 2,
"document": {
"text": "Paris is the capital of France."
},
"relevance_score": 0.1024
},
{
"index": 1,
"document": {
"text": "Gravity is a force that attracts two bodies toward each other."
},
"relevance_score": 0.0187
}
]
}
Top-N reranking
Return only the top N most relevant documents.
Optional attributes used here
- Name
top_n- Type
- integer
- Description
Maximum number of ranked documents to return. If omitted, vLLM returns all documents sorted by relevance.
Request
curl -X POST "$BGE_RERANK_BASE_URL/rerank" \
-H "Authorization: Bearer $HEXGRID_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "'"$BGE_RERANK_MODEL"'",
"query": "Explain gravity",
"documents": [
"The capital of China is Beijing.",
"Gravity is a force that attracts two bodies toward each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.",
"A neural network is a machine learning model inspired by biological neurons.",
"Photosynthesis is the process by which plants convert light energy into chemical energy."
],
"top_n": 2
}'
Response shape
{
"id": "rerank-topn123",
"model": "BAAI/bge-reranker-v2-gemma",
"usage": {
"total_tokens": 78
},
"results": [
{
"index": 1,
"document": {
"text": "Gravity is a force that attracts two bodies toward each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
},
"relevance_score": 0.9971
},
{
"index": 3,
"document": {
"text": "Photosynthesis is the process by which plants convert light energy into chemical energy."
},
"relevance_score": 0.0418
}
]
}
Rerank search results
Use reranking after a first-stage retrieval step, such as BM25, vector search, or hybrid search.
Your application sends the top candidate chunks to the reranker, then uses the returned index values to map ranked results back to the original documents.
Required attributes
- Name
query- Type
- string
- Description
User search query.
- Name
documents- Type
- array
- Description
Candidate search results from your first-stage retriever.
- Name
top_n- Type
- integer
- Description
Number of final results to keep.
Request
curl -X POST "$BGE_RERANK_BASE_URL/rerank" \
-H "Authorization: Bearer $HEXGRID_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "'"$BGE_RERANK_MODEL"'",
"query": "How do I configure streaming in vLLM?",
"documents": [
"vLLM exposes an OpenAI-compatible Chat Completions endpoint. Set stream to true to receive Server-Sent Events.",
"The capital of France is Paris.",
"Embeddings convert text into dense vectors for semantic search.",
"Use top_k and top_p to control sampling diversity in generation."
],
"top_n": 3
}'
Response shape
{
"id": "rerank-search123",
"model": "BAAI/bge-reranker-v2-gemma",
"usage": {
"total_tokens": 72
},
"results": [
{
"index": 0,
"document": {
"text": "vLLM exposes an OpenAI-compatible Chat Completions endpoint. Set stream to true to receive Server-Sent Events."
},
"relevance_score": 0.9821
},
{
"index": 3,
"document": {
"text": "Use top_k and top_p to control sampling diversity in generation."
},
"relevance_score": 0.2137
},
{
"index": 2,
"document": {
"text": "Embeddings convert text into dense vectors for semantic search."
},
"relevance_score": 0.1105
}
]
}
Pairwise scoring
Use the vLLM score endpoint when you want a relevance score for each aligned query-document pair.
Unlike /v1/rerank, which compares one query against many documents and returns sorted results, /v1/score can score aligned pairs.
Required attributes
- Name
model- Type
- string
- Description
The served reranker model name.
- Name
queries- Type
- array
- Description
Queries to score.
- Name
documents- Type
- array
- Description
Documents to score. Each document is paired with the query at the same index.
Request
curl -X POST "$BGE_RERANK_BASE_URL/score" \
-H "Authorization: Bearer $HEXGRID_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "'"$BGE_RERANK_MODEL"'",
"queries": [
"What is the capital of China?",
"Explain gravity"
],
"documents": [
"The capital of China is Beijing.",
"Gravity is a force that attracts two bodies toward each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
]
}'
Response shape
{
"id": "score-abc123",
"object": "list",
"model": "BAAI/bge-reranker-v2-gemma",
"data": [
{
"index": 0,
"object": "score",
"score": 0.9984
},
{
"index": 1,
"object": "score",
"score": 0.9969
}
],
"usage": {
"total_tokens": 58
}
}
One query against many documents with score API
Use /v1/score when you want unsorted scores and your application will handle sorting.
This is useful when you want to preserve the original candidate order or combine the reranker score with another scoring signal.
Required attributes
- Name
queries- Type
- array
- Description
Repeat the same query once per document.
- Name
documents- Type
- array
- Description
Candidate documents to score.
Request
curl -X POST "$BGE_RERANK_BASE_URL/score" \
-H "Authorization: Bearer $HEXGRID_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "'"$BGE_RERANK_MODEL"'",
"queries": [
"What is vLLM?",
"What is vLLM?",
"What is vLLM?"
],
"documents": [
"vLLM is an inference and serving engine for large language models.",
"Beijing is the capital of China.",
"A vector database stores embeddings for semantic search."
]
}'
Response shape
{
"id": "score-many123",
"object": "list",
"model": "BAAI/bge-reranker-v2-gemma",
"data": [
{
"index": 0,
"object": "score",
"score": 0.9812
},
{
"index": 1,
"object": "score",
"score": 0.0241
},
{
"index": 2,
"object": "score",
"score": 0.2183
}
],
"usage": {
"total_tokens": 47
}
}
Rerank with activation control
vLLM supports the use_activation rerank parameter.
For BGE rerankers, the model card notes that scores can be mapped to the [0, 1] range with a sigmoid function. Use use_activation: true when you want activated relevance scores from the serving layer.
Optional attributes used here
- Name
use_activation- Type
- boolean
- Description
Whether to apply the model pooler’s activation function to the score.
- Name
top_n- Type
- integer
- Description
Number of top results to return.
Request
curl -X POST "$BGE_RERANK_BASE_URL/rerank" \
-H "Authorization: Bearer $HEXGRID_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "'"$BGE_RERANK_MODEL"'",
"query": "What is semantic search?",
"documents": [
"Semantic search retrieves results based on meaning rather than exact keyword overlap.",
"The boiling point of water is 100°C at sea level.",
"Reranking improves retrieval quality by rescoring candidate documents."
],
"top_n": 2,
"use_activation": true
}'
Response shape
{
"id": "rerank-activation123",
"model": "BAAI/bge-reranker-v2-gemma",
"usage": {
"total_tokens": 63
},
"results": [
{
"index": 0,
"document": {
"text": "Semantic search retrieves results based on meaning rather than exact keyword overlap."
},
"relevance_score": 0.9914
},
{
"index": 2,
"document": {
"text": "Reranking improves retrieval quality by rescoring candidate documents."
},
"relevance_score": 0.6532
}
]
}
Raw score mode
Disable activation when you want raw score behavior from the deployed pooler.
Raw scores may be useful if your application calibrates thresholds separately or combines reranker scores with other ranking signals.
Optional attributes used here
- Name
use_activation- Type
- boolean
- Description
Set to
falseto request raw relevance scores when supported by the deployment.
Request
curl -X POST "$BGE_RERANK_BASE_URL/rerank" \
-H "Authorization: Bearer $HEXGRID_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "'"$BGE_RERANK_MODEL"'",
"query": "What is semantic search?",
"documents": [
"Semantic search retrieves results based on meaning rather than exact keyword overlap.",
"The boiling point of water is 100°C at sea level.",
"Reranking improves retrieval quality by rescoring candidate documents."
],
"top_n": 2,
"use_activation": false
}'
Response shape
{
"id": "rerank-raw123",
"model": "BAAI/bge-reranker-v2-gemma",
"usage": {
"total_tokens": 63
},
"results": [
{
"index": 0,
"document": {
"text": "Semantic search retrieves results based on meaning rather than exact keyword overlap."
},
"relevance_score": 5.214
},
{
"index": 2,
"document": {
"text": "Reranking improves retrieval quality by rescoring candidate documents."
},
"relevance_score": 0.632
}
]
}
Rerank with document objects
Some rerank-compatible clients send documents as objects with a text field.
Use this when your application wants to keep the document shape close to the response shape.
Required attributes
- Name
documents- Type
- array
- Description
Candidate document objects. Each object should contain text.
- Name
text- Type
- string
- Description
Document text to score.
Request
curl -X POST "$BGE_RERANK_BASE_URL/rerank" \
-H "Authorization: Bearer $HEXGRID_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "'"$BGE_RERANK_MODEL"'",
"query": "What is the role of a reranker in RAG?",
"documents": [
{
"text": "A reranker rescoring stage improves RAG by evaluating retrieved documents against the query more precisely."
},
{
"text": "A tokenizer converts text into tokens before model inference."
},
{
"text": "A load balancer distributes traffic across multiple servers."
}
],
"top_n": 2
}'
Response shape
{
"id": "rerank-objects123",
"model": "BAAI/bge-reranker-v2-gemma",
"usage": {
"total_tokens": 69
},
"results": [
{
"index": 0,
"document": {
"text": "A reranker rescoring stage improves RAG by evaluating retrieved documents against the query more precisely."
},
"relevance_score": 0.9887
},
{
"index": 1,
"document": {
"text": "A tokenizer converts text into tokens before model inference."
},
"relevance_score": 0.1238
}
]
}
Retrieval pipeline example
A typical retrieval pipeline uses a fast first-stage retriever, then BGE Reranker V2 Gemma as a second-stage precision filter.
Pipeline
- Name
first_stage_retriever- Type
- string
- Description
Use BM25, vector search, hybrid search, or another retriever to fetch candidate documents.
- Name
reranker- Type
- string
- Description
Send the top candidates to BGE Reranker V2 Gemma.
- Name
final_context- Type
- array
- Description
Use the top reranked documents as context for downstream generation.
Request
curl -X POST "$BGE_RERANK_BASE_URL/rerank" \
-H "Authorization: Bearer $HEXGRID_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "'"$BGE_RERANK_MODEL"'",
"query": "How does reranking improve RAG answers?",
"documents": [
"RAG retrieves external documents and passes relevant context to a language model.",
"Reranking improves RAG by rescoring first-stage retrieval candidates using a stronger cross-encoder relevance model.",
"Embeddings are dense vectors that represent text meaning.",
"KV cache stores attention keys and values during autoregressive decoding.",
"Hybrid search combines sparse keyword matching and dense semantic retrieval."
],
"top_n": 2
}'
Response shape
{
"id": "rerank-pipeline123",
"model": "BAAI/bge-reranker-v2-gemma",
"usage": {
"total_tokens": 89
},
"results": [
{
"index": 1,
"document": {
"text": "Reranking improves RAG by rescoring first-stage retrieval candidates using a stronger cross-encoder relevance model."
},
"relevance_score": 0.9962
},
{
"index": 0,
"document": {
"text": "RAG retrieves external documents and passes relevant context to a language model."
},
"relevance_score": 0.8127
}
]
}
Multiple rerank parameters
Use this example when you need a broader set of reranker controls.
Parameters shown here
- Name
model- Type
- string
- Description
The served BGE Reranker V2 Gemma model name.
- Name
query- Type
- string
- Description
Query to compare with each document.
- Name
documents- Type
- array
- Description
Candidate documents to rerank.
- Name
top_n- Type
- integer
- Description
Maximum number of results to return.
- Name
use_activation- Type
- boolean
- Description
Whether to apply the pooler activation function to relevance scores.
- Name
user- Type
- string
- Description
Optional end-user identifier accepted by vLLM’s rerank API.
Request
curl -X POST "$BGE_RERANK_BASE_URL/rerank" \
-H "Authorization: Bearer $HEXGRID_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "'"$BGE_RERANK_MODEL"'",
"query": "Why is my vLLM request returning 404 for chat completions?",
"documents": [
"Use the base URL ending in /v1 and send chat requests to /v1/chat/completions.",
"The /v1/embeddings endpoint returns vector embeddings for input text.",
"A 404 can occur when the client appends /chat/completions to a URL that already includes /chat/completions.",
"Reranking endpoints return relevance scores, not generated text."
],
"top_n": 3,
"use_activation": true,
"user": "user_123"
}'
Response shape
{
"id": "rerank-params123",
"model": "BAAI/bge-reranker-v2-gemma",
"usage": {
"total_tokens": 97
},
"results": [
{
"index": 2,
"document": {
"text": "A 404 can occur when the client appends /chat/completions to a URL that already includes /chat/completions."
},
"relevance_score": 0.9941
},
{
"index": 0,
"document": {
"text": "Use the base URL ending in /v1 and send chat requests to /v1/chat/completions."
},
"relevance_score": 0.9683
},
{
"index": 3,
"document": {
"text": "Reranking endpoints return relevance scores, not generated text."
},
"relevance_score": 0.0872
}
]
}
Official sources
- vLLM OpenAI-compatible server docs: https://docs.vllm.ai/en/latest/serving/openai_compatible_server/
- vLLM scoring usage docs: https://docs.vllm.ai/en/latest/models/pooling_models/scoring/
- vLLM score and rerank example docs: https://docs.vllm.ai/en/latest/examples/pooling/score/
- BAAI/bge-reranker-v2-gemma model card: https://huggingface.co/BAAI/bge-reranker-v2-gemma
- BAAI/bge-reranker-v2-m3 model card and BGE reranker model list: https://huggingface.co/BAAI/bge-reranker-v2-m3
- FlagEmbedding repository: https://github.com/FlagOpen/FlagEmbedding