BGE Reranker V2 Gemma API - vLLM engine

Use vLLM’s reranking endpoint to call HexGrid-hosted BGE Reranker V2 Gemma and score how relevant each document is to a query.

This page provides copy-pasteable cURL-only examples for basic reranking, top-N reranking, pairwise scoring, batch pair scoring, activation-controlled scores, document-object reranking, retrieval pipelines, and common reranker parameters.

Endpoint

POST http://<server-ip>:<port>/v1/rerank

Set these environment variables before running the examples:

export HEXGRID_API_KEY="your-hexgrid-api-key"
export BGE_RERANK_BASE_URL="http://<server-ip>:<port>/v1"
export BGE_RERANK_MODEL="BAAI/bge-reranker-v2-gemma"

Use the exact model ID configured in your HexGrid deployment.

Model

Model	Base model	Type	Language support	Output
`BAAI/bge-reranker-v2-gemma`	Gemma 2B	Reranker / cross-encoder	Multilingual	Query-document relevance score

BGE rerankers are different from embedding models. They take a query and document as input and output a similarity / relevance score instead of returning an embedding vector.

POST/v1/rerank

Rerank documents

Rerank a list of documents for a single query.

Required attributes

Name
model
Type
string
Description
The served reranker model name, for example BAAI/bge-reranker-v2-gemma.
Name
query
Type
string
Description
The query to compare against each document.
Name
documents
Type
array
Description
Candidate documents to rerank. Each item can be a string document.

Response fields

Name
results
Type
array
Description
Reranked results sorted by relevance.
Name
index
Type
integer
Description
Original index of the document in the input documents array.
Name
relevance_score
Type
number
Description
Relevance score for the query-document pair.

Request

POST

/v1/rerank

curl -X POST "$BGE_RERANK_BASE_URL/rerank" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$BGE_RERANK_MODEL"'",
    "query": "What is the capital of China?",
    "documents": [
      "The capital of China is Beijing.",
      "Gravity is a force that attracts two bodies toward each other.",
      "Paris is the capital of France."
    ]
  }'

Response shape

{
  "id": "rerank-abc123",
  "model": "BAAI/bge-reranker-v2-gemma",
  "usage": {
    "total_tokens": 48
  },
  "results": [
    {
      "index": 0,
      "document": {
        "text": "The capital of China is Beijing."
      },
      "relevance_score": 0.9985
    },
    {
      "index": 2,
      "document": {
        "text": "Paris is the capital of France."
      },
      "relevance_score": 0.1024
    },
    {
      "index": 1,
      "document": {
        "text": "Gravity is a force that attracts two bodies toward each other."
      },
      "relevance_score": 0.0187
    }
  ]
}

POST/v1/rerank

Top-N reranking

Return only the top N most relevant documents.

Optional attributes used here

Name
top_n
Type
integer
Description
Maximum number of ranked documents to return. If omitted, vLLM returns all documents sorted by relevance.

Request

POST

/v1/rerank

curl -X POST "$BGE_RERANK_BASE_URL/rerank" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$BGE_RERANK_MODEL"'",
    "query": "Explain gravity",
    "documents": [
      "The capital of China is Beijing.",
      "Gravity is a force that attracts two bodies toward each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.",
      "A neural network is a machine learning model inspired by biological neurons.",
      "Photosynthesis is the process by which plants convert light energy into chemical energy."
    ],
    "top_n": 2
  }'

Response shape

{
  "id": "rerank-topn123",
  "model": "BAAI/bge-reranker-v2-gemma",
  "usage": {
    "total_tokens": 78
  },
  "results": [
    {
      "index": 1,
      "document": {
        "text": "Gravity is a force that attracts two bodies toward each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
      },
      "relevance_score": 0.9971
    },
    {
      "index": 3,
      "document": {
        "text": "Photosynthesis is the process by which plants convert light energy into chemical energy."
      },
      "relevance_score": 0.0418
    }
  ]
}

POST/v1/rerank

Rerank search results

Use reranking after a first-stage retrieval step, such as BM25, vector search, or hybrid search.

Your application sends the top candidate chunks to the reranker, then uses the returned index values to map ranked results back to the original documents.

Required attributes

Name
query
Type
string
Description
User search query.
Name
documents
Type
array
Description
Candidate search results from your first-stage retriever.
Name
top_n
Type
integer
Description
Number of final results to keep.

Request

POST

/v1/rerank

curl -X POST "$BGE_RERANK_BASE_URL/rerank" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$BGE_RERANK_MODEL"'",
    "query": "How do I configure streaming in vLLM?",
    "documents": [
      "vLLM exposes an OpenAI-compatible Chat Completions endpoint. Set stream to true to receive Server-Sent Events.",
      "The capital of France is Paris.",
      "Embeddings convert text into dense vectors for semantic search.",
      "Use top_k and top_p to control sampling diversity in generation."
    ],
    "top_n": 3
  }'

Response shape

{
  "id": "rerank-search123",
  "model": "BAAI/bge-reranker-v2-gemma",
  "usage": {
    "total_tokens": 72
  },
  "results": [
    {
      "index": 0,
      "document": {
        "text": "vLLM exposes an OpenAI-compatible Chat Completions endpoint. Set stream to true to receive Server-Sent Events."
      },
      "relevance_score": 0.9821
    },
    {
      "index": 3,
      "document": {
        "text": "Use top_k and top_p to control sampling diversity in generation."
      },
      "relevance_score": 0.2137
    },
    {
      "index": 2,
      "document": {
        "text": "Embeddings convert text into dense vectors for semantic search."
      },
      "relevance_score": 0.1105
    }
  ]
}

POST/v1/score

Pairwise scoring

Use the vLLM score endpoint when you want a relevance score for each aligned query-document pair.

Unlike /v1/rerank, which compares one query against many documents and returns sorted results, /v1/score can score aligned pairs.

Required attributes

Name
model
Type
string
Description
The served reranker model name.
Name
queries
Type
array
Description
Queries to score.
Name
documents
Type
array
Description
Documents to score. Each document is paired with the query at the same index.

Request

POST

/v1/score

curl -X POST "$BGE_RERANK_BASE_URL/score" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$BGE_RERANK_MODEL"'",
    "queries": [
      "What is the capital of China?",
      "Explain gravity"
    ],
    "documents": [
      "The capital of China is Beijing.",
      "Gravity is a force that attracts two bodies toward each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
    ]
  }'

Response shape

{
  "id": "score-abc123",
  "object": "list",
  "model": "BAAI/bge-reranker-v2-gemma",
  "data": [
    {
      "index": 0,
      "object": "score",
      "score": 0.9984
    },
    {
      "index": 1,
      "object": "score",
      "score": 0.9969
    }
  ],
  "usage": {
    "total_tokens": 58
  }
}

POST/v1/score

One query against many documents with score API

Use /v1/score when you want unsorted scores and your application will handle sorting.

This is useful when you want to preserve the original candidate order or combine the reranker score with another scoring signal.

Required attributes

Name
queries
Type
array
Description
Repeat the same query once per document.
Name
documents
Type
array
Description
Candidate documents to score.

Request

POST

/v1/score

curl -X POST "$BGE_RERANK_BASE_URL/score" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$BGE_RERANK_MODEL"'",
    "queries": [
      "What is vLLM?",
      "What is vLLM?",
      "What is vLLM?"
    ],
    "documents": [
      "vLLM is an inference and serving engine for large language models.",
      "Beijing is the capital of China.",
      "A vector database stores embeddings for semantic search."
    ]
  }'

Response shape

{
  "id": "score-many123",
  "object": "list",
  "model": "BAAI/bge-reranker-v2-gemma",
  "data": [
    {
      "index": 0,
      "object": "score",
      "score": 0.9812
    },
    {
      "index": 1,
      "object": "score",
      "score": 0.0241
    },
    {
      "index": 2,
      "object": "score",
      "score": 0.2183
    }
  ],
  "usage": {
    "total_tokens": 47
  }
}

POST/v1/rerank

Rerank with activation control

vLLM supports the use_activation rerank parameter.

For BGE rerankers, the model card notes that scores can be mapped to the [0, 1] range with a sigmoid function. Use use_activation: true when you want activated relevance scores from the serving layer.

Optional attributes used here

Name
use_activation
Type
boolean
Description
Whether to apply the model pooler’s activation function to the score.
Name
top_n
Type
integer
Description
Number of top results to return.

Request

POST

/v1/rerank

curl -X POST "$BGE_RERANK_BASE_URL/rerank" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$BGE_RERANK_MODEL"'",
    "query": "What is semantic search?",
    "documents": [
      "Semantic search retrieves results based on meaning rather than exact keyword overlap.",
      "The boiling point of water is 100°C at sea level.",
      "Reranking improves retrieval quality by rescoring candidate documents."
    ],
    "top_n": 2,
    "use_activation": true
  }'

Response shape

{
  "id": "rerank-activation123",
  "model": "BAAI/bge-reranker-v2-gemma",
  "usage": {
    "total_tokens": 63
  },
  "results": [
    {
      "index": 0,
      "document": {
        "text": "Semantic search retrieves results based on meaning rather than exact keyword overlap."
      },
      "relevance_score": 0.9914
    },
    {
      "index": 2,
      "document": {
        "text": "Reranking improves retrieval quality by rescoring candidate documents."
      },
      "relevance_score": 0.6532
    }
  ]
}

POST/v1/rerank

Raw score mode

Disable activation when you want raw score behavior from the deployed pooler.

Raw scores may be useful if your application calibrates thresholds separately or combines reranker scores with other ranking signals.

Optional attributes used here

Name
use_activation
Type
boolean
Description
Set to false to request raw relevance scores when supported by the deployment.

Request

POST

/v1/rerank

curl -X POST "$BGE_RERANK_BASE_URL/rerank" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$BGE_RERANK_MODEL"'",
    "query": "What is semantic search?",
    "documents": [
      "Semantic search retrieves results based on meaning rather than exact keyword overlap.",
      "The boiling point of water is 100°C at sea level.",
      "Reranking improves retrieval quality by rescoring candidate documents."
    ],
    "top_n": 2,
    "use_activation": false
  }'

Response shape

{
  "id": "rerank-raw123",
  "model": "BAAI/bge-reranker-v2-gemma",
  "usage": {
    "total_tokens": 63
  },
  "results": [
    {
      "index": 0,
      "document": {
        "text": "Semantic search retrieves results based on meaning rather than exact keyword overlap."
      },
      "relevance_score": 5.214
    },
    {
      "index": 2,
      "document": {
        "text": "Reranking improves retrieval quality by rescoring candidate documents."
      },
      "relevance_score": 0.632
    }
  ]
}

POST/v1/rerank

Rerank with document objects

Some rerank-compatible clients send documents as objects with a text field.

Use this when your application wants to keep the document shape close to the response shape.

Required attributes

Name
documents
Type
array
Description
Candidate document objects. Each object should contain text.
Name
text
Type
string
Description
Document text to score.

Request

POST

/v1/rerank

curl -X POST "$BGE_RERANK_BASE_URL/rerank" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$BGE_RERANK_MODEL"'",
    "query": "What is the role of a reranker in RAG?",
    "documents": [
      {
        "text": "A reranker rescoring stage improves RAG by evaluating retrieved documents against the query more precisely."
      },
      {
        "text": "A tokenizer converts text into tokens before model inference."
      },
      {
        "text": "A load balancer distributes traffic across multiple servers."
      }
    ],
    "top_n": 2
  }'

Response shape

{
  "id": "rerank-objects123",
  "model": "BAAI/bge-reranker-v2-gemma",
  "usage": {
    "total_tokens": 69
  },
  "results": [
    {
      "index": 0,
      "document": {
        "text": "A reranker rescoring stage improves RAG by evaluating retrieved documents against the query more precisely."
      },
      "relevance_score": 0.9887
    },
    {
      "index": 1,
      "document": {
        "text": "A tokenizer converts text into tokens before model inference."
      },
      "relevance_score": 0.1238
    }
  ]
}

POST/v1/rerank

Retrieval pipeline example

A typical retrieval pipeline uses a fast first-stage retriever, then BGE Reranker V2 Gemma as a second-stage precision filter.

Pipeline

Name
first_stage_retriever
Type
string
Description
Use BM25, vector search, hybrid search, or another retriever to fetch candidate documents.
Name
reranker
Type
string
Description
Send the top candidates to BGE Reranker V2 Gemma.
Name
final_context
Type
array
Description
Use the top reranked documents as context for downstream generation.

Request

POST

/v1/rerank

curl -X POST "$BGE_RERANK_BASE_URL/rerank" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$BGE_RERANK_MODEL"'",
    "query": "How does reranking improve RAG answers?",
    "documents": [
      "RAG retrieves external documents and passes relevant context to a language model.",
      "Reranking improves RAG by rescoring first-stage retrieval candidates using a stronger cross-encoder relevance model.",
      "Embeddings are dense vectors that represent text meaning.",
      "KV cache stores attention keys and values during autoregressive decoding.",
      "Hybrid search combines sparse keyword matching and dense semantic retrieval."
    ],
    "top_n": 2
  }'

Response shape

{
  "id": "rerank-pipeline123",
  "model": "BAAI/bge-reranker-v2-gemma",
  "usage": {
    "total_tokens": 89
  },
  "results": [
    {
      "index": 1,
      "document": {
        "text": "Reranking improves RAG by rescoring first-stage retrieval candidates using a stronger cross-encoder relevance model."
      },
      "relevance_score": 0.9962
    },
    {
      "index": 0,
      "document": {
        "text": "RAG retrieves external documents and passes relevant context to a language model."
      },
      "relevance_score": 0.8127
    }
  ]
}

POST/v1/rerank

Multiple rerank parameters

Use this example when you need a broader set of reranker controls.

Parameters shown here

Name
model
Type
string
Description
The served BGE Reranker V2 Gemma model name.
Name
query
Type
string
Description
Query to compare with each document.
Name
documents
Type
array
Description
Candidate documents to rerank.
Name
top_n
Type
integer
Description
Maximum number of results to return.
Name
use_activation
Type
boolean
Description
Whether to apply the pooler activation function to relevance scores.
Name
user
Type
string
Description
Optional end-user identifier accepted by vLLM’s rerank API.

Request

POST

/v1/rerank

curl -X POST "$BGE_RERANK_BASE_URL/rerank" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$BGE_RERANK_MODEL"'",
    "query": "Why is my vLLM request returning 404 for chat completions?",
    "documents": [
      "Use the base URL ending in /v1 and send chat requests to /v1/chat/completions.",
      "The /v1/embeddings endpoint returns vector embeddings for input text.",
      "A 404 can occur when the client appends /chat/completions to a URL that already includes /chat/completions.",
      "Reranking endpoints return relevance scores, not generated text."
    ],
    "top_n": 3,
    "use_activation": true,
    "user": "user_123"
  }'

Response shape

{
  "id": "rerank-params123",
  "model": "BAAI/bge-reranker-v2-gemma",
  "usage": {
    "total_tokens": 97
  },
  "results": [
    {
      "index": 2,
      "document": {
        "text": "A 404 can occur when the client appends /chat/completions to a URL that already includes /chat/completions."
      },
      "relevance_score": 0.9941
    },
    {
      "index": 0,
      "document": {
        "text": "Use the base URL ending in /v1 and send chat requests to /v1/chat/completions."
      },
      "relevance_score": 0.9683
    },
    {
      "index": 3,
      "document": {
        "text": "Reranking endpoints return relevance scores, not generated text."
      },
      "relevance_score": 0.0872
    }
  ]
}

Official sources

vLLM OpenAI-compatible server docs: https://docs.vllm.ai/en/latest/serving/openai_compatible_server/
vLLM scoring usage docs: https://docs.vllm.ai/en/latest/models/pooling_models/scoring/
vLLM score and rerank example docs: https://docs.vllm.ai/en/latest/examples/pooling/score/
BAAI/bge-reranker-v2-gemma model card: https://huggingface.co/BAAI/bge-reranker-v2-gemma
BAAI/bge-reranker-v2-m3 model card and BGE reranker model list: https://huggingface.co/BAAI/bge-reranker-v2-m3
FlagEmbedding repository: https://github.com/FlagOpen/FlagEmbedding