BGE Reranker V2 Gemma API - vLLM engine

Use vLLM’s reranking endpoint to call HexGrid-hosted BGE Reranker V2 Gemma and score how relevant each document is to a query.

This page provides copy-pasteable cURL-only examples for basic reranking, top-N reranking, pairwise scoring, batch pair scoring, activation-controlled scores, document-object reranking, retrieval pipelines, and common reranker parameters.


Endpoint

POST http://<server-ip>:<port>/v1/rerank

Set these environment variables before running the examples:

export HEXGRID_API_KEY="your-hexgrid-api-key"
export BGE_RERANK_BASE_URL="http://<server-ip>:<port>/v1"
export BGE_RERANK_MODEL="BAAI/bge-reranker-v2-gemma"

Use the exact model ID configured in your HexGrid deployment.


Model

ModelBase modelTypeLanguage supportOutput
BAAI/bge-reranker-v2-gemmaGemma 2BReranker / cross-encoderMultilingualQuery-document relevance score

BGE rerankers are different from embedding models. They take a query and document as input and output a similarity / relevance score instead of returning an embedding vector.


POST/v1/rerank

Rerank documents

Rerank a list of documents for a single query.

Required attributes

  • Name
    model
    Type
    string
    Description

    The served reranker model name, for example BAAI/bge-reranker-v2-gemma.

  • Name
    query
    Type
    string
    Description

    The query to compare against each document.

  • Name
    documents
    Type
    array
    Description

    Candidate documents to rerank. Each item can be a string document.

Response fields

  • Name
    results
    Type
    array
    Description

    Reranked results sorted by relevance.

  • Name
    index
    Type
    integer
    Description

    Original index of the document in the input documents array.

  • Name
    relevance_score
    Type
    number
    Description

    Relevance score for the query-document pair.

Request

POST
/v1/rerank
curl -X POST "$BGE_RERANK_BASE_URL/rerank" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$BGE_RERANK_MODEL"'",
    "query": "What is the capital of China?",
    "documents": [
      "The capital of China is Beijing.",
      "Gravity is a force that attracts two bodies toward each other.",
      "Paris is the capital of France."
    ]
  }'

Response shape

{
  "id": "rerank-abc123",
  "model": "BAAI/bge-reranker-v2-gemma",
  "usage": {
    "total_tokens": 48
  },
  "results": [
    {
      "index": 0,
      "document": {
        "text": "The capital of China is Beijing."
      },
      "relevance_score": 0.9985
    },
    {
      "index": 2,
      "document": {
        "text": "Paris is the capital of France."
      },
      "relevance_score": 0.1024
    },
    {
      "index": 1,
      "document": {
        "text": "Gravity is a force that attracts two bodies toward each other."
      },
      "relevance_score": 0.0187
    }
  ]
}

POST/v1/rerank

Top-N reranking

Return only the top N most relevant documents.

Optional attributes used here

  • Name
    top_n
    Type
    integer
    Description

    Maximum number of ranked documents to return. If omitted, vLLM returns all documents sorted by relevance.

Request

POST
/v1/rerank
curl -X POST "$BGE_RERANK_BASE_URL/rerank" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$BGE_RERANK_MODEL"'",
    "query": "Explain gravity",
    "documents": [
      "The capital of China is Beijing.",
      "Gravity is a force that attracts two bodies toward each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.",
      "A neural network is a machine learning model inspired by biological neurons.",
      "Photosynthesis is the process by which plants convert light energy into chemical energy."
    ],
    "top_n": 2
  }'

Response shape

{
  "id": "rerank-topn123",
  "model": "BAAI/bge-reranker-v2-gemma",
  "usage": {
    "total_tokens": 78
  },
  "results": [
    {
      "index": 1,
      "document": {
        "text": "Gravity is a force that attracts two bodies toward each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
      },
      "relevance_score": 0.9971
    },
    {
      "index": 3,
      "document": {
        "text": "Photosynthesis is the process by which plants convert light energy into chemical energy."
      },
      "relevance_score": 0.0418
    }
  ]
}

POST/v1/rerank

Rerank search results

Use reranking after a first-stage retrieval step, such as BM25, vector search, or hybrid search.

Your application sends the top candidate chunks to the reranker, then uses the returned index values to map ranked results back to the original documents.

Required attributes

  • Name
    query
    Type
    string
    Description

    User search query.

  • Name
    documents
    Type
    array
    Description

    Candidate search results from your first-stage retriever.

  • Name
    top_n
    Type
    integer
    Description

    Number of final results to keep.

Request

POST
/v1/rerank
curl -X POST "$BGE_RERANK_BASE_URL/rerank" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$BGE_RERANK_MODEL"'",
    "query": "How do I configure streaming in vLLM?",
    "documents": [
      "vLLM exposes an OpenAI-compatible Chat Completions endpoint. Set stream to true to receive Server-Sent Events.",
      "The capital of France is Paris.",
      "Embeddings convert text into dense vectors for semantic search.",
      "Use top_k and top_p to control sampling diversity in generation."
    ],
    "top_n": 3
  }'

Response shape

{
  "id": "rerank-search123",
  "model": "BAAI/bge-reranker-v2-gemma",
  "usage": {
    "total_tokens": 72
  },
  "results": [
    {
      "index": 0,
      "document": {
        "text": "vLLM exposes an OpenAI-compatible Chat Completions endpoint. Set stream to true to receive Server-Sent Events."
      },
      "relevance_score": 0.9821
    },
    {
      "index": 3,
      "document": {
        "text": "Use top_k and top_p to control sampling diversity in generation."
      },
      "relevance_score": 0.2137
    },
    {
      "index": 2,
      "document": {
        "text": "Embeddings convert text into dense vectors for semantic search."
      },
      "relevance_score": 0.1105
    }
  ]
}

POST/v1/score

Pairwise scoring

Use the vLLM score endpoint when you want a relevance score for each aligned query-document pair.

Unlike /v1/rerank, which compares one query against many documents and returns sorted results, /v1/score can score aligned pairs.

Required attributes

  • Name
    model
    Type
    string
    Description

    The served reranker model name.

  • Name
    queries
    Type
    array
    Description

    Queries to score.

  • Name
    documents
    Type
    array
    Description

    Documents to score. Each document is paired with the query at the same index.

Request

POST
/v1/score
curl -X POST "$BGE_RERANK_BASE_URL/score" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$BGE_RERANK_MODEL"'",
    "queries": [
      "What is the capital of China?",
      "Explain gravity"
    ],
    "documents": [
      "The capital of China is Beijing.",
      "Gravity is a force that attracts two bodies toward each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
    ]
  }'

Response shape

{
  "id": "score-abc123",
  "object": "list",
  "model": "BAAI/bge-reranker-v2-gemma",
  "data": [
    {
      "index": 0,
      "object": "score",
      "score": 0.9984
    },
    {
      "index": 1,
      "object": "score",
      "score": 0.9969
    }
  ],
  "usage": {
    "total_tokens": 58
  }
}

POST/v1/score

One query against many documents with score API

Use /v1/score when you want unsorted scores and your application will handle sorting.

This is useful when you want to preserve the original candidate order or combine the reranker score with another scoring signal.

Required attributes

  • Name
    queries
    Type
    array
    Description

    Repeat the same query once per document.

  • Name
    documents
    Type
    array
    Description

    Candidate documents to score.

Request

POST
/v1/score
curl -X POST "$BGE_RERANK_BASE_URL/score" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$BGE_RERANK_MODEL"'",
    "queries": [
      "What is vLLM?",
      "What is vLLM?",
      "What is vLLM?"
    ],
    "documents": [
      "vLLM is an inference and serving engine for large language models.",
      "Beijing is the capital of China.",
      "A vector database stores embeddings for semantic search."
    ]
  }'

Response shape

{
  "id": "score-many123",
  "object": "list",
  "model": "BAAI/bge-reranker-v2-gemma",
  "data": [
    {
      "index": 0,
      "object": "score",
      "score": 0.9812
    },
    {
      "index": 1,
      "object": "score",
      "score": 0.0241
    },
    {
      "index": 2,
      "object": "score",
      "score": 0.2183
    }
  ],
  "usage": {
    "total_tokens": 47
  }
}

POST/v1/rerank

Rerank with activation control

vLLM supports the use_activation rerank parameter.

For BGE rerankers, the model card notes that scores can be mapped to the [0, 1] range with a sigmoid function. Use use_activation: true when you want activated relevance scores from the serving layer.

Optional attributes used here

  • Name
    use_activation
    Type
    boolean
    Description

    Whether to apply the model pooler’s activation function to the score.

  • Name
    top_n
    Type
    integer
    Description

    Number of top results to return.

Request

POST
/v1/rerank
curl -X POST "$BGE_RERANK_BASE_URL/rerank" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$BGE_RERANK_MODEL"'",
    "query": "What is semantic search?",
    "documents": [
      "Semantic search retrieves results based on meaning rather than exact keyword overlap.",
      "The boiling point of water is 100°C at sea level.",
      "Reranking improves retrieval quality by rescoring candidate documents."
    ],
    "top_n": 2,
    "use_activation": true
  }'

Response shape

{
  "id": "rerank-activation123",
  "model": "BAAI/bge-reranker-v2-gemma",
  "usage": {
    "total_tokens": 63
  },
  "results": [
    {
      "index": 0,
      "document": {
        "text": "Semantic search retrieves results based on meaning rather than exact keyword overlap."
      },
      "relevance_score": 0.9914
    },
    {
      "index": 2,
      "document": {
        "text": "Reranking improves retrieval quality by rescoring candidate documents."
      },
      "relevance_score": 0.6532
    }
  ]
}

POST/v1/rerank

Raw score mode

Disable activation when you want raw score behavior from the deployed pooler.

Raw scores may be useful if your application calibrates thresholds separately or combines reranker scores with other ranking signals.

Optional attributes used here

  • Name
    use_activation
    Type
    boolean
    Description

    Set to false to request raw relevance scores when supported by the deployment.

Request

POST
/v1/rerank
curl -X POST "$BGE_RERANK_BASE_URL/rerank" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$BGE_RERANK_MODEL"'",
    "query": "What is semantic search?",
    "documents": [
      "Semantic search retrieves results based on meaning rather than exact keyword overlap.",
      "The boiling point of water is 100°C at sea level.",
      "Reranking improves retrieval quality by rescoring candidate documents."
    ],
    "top_n": 2,
    "use_activation": false
  }'

Response shape

{
  "id": "rerank-raw123",
  "model": "BAAI/bge-reranker-v2-gemma",
  "usage": {
    "total_tokens": 63
  },
  "results": [
    {
      "index": 0,
      "document": {
        "text": "Semantic search retrieves results based on meaning rather than exact keyword overlap."
      },
      "relevance_score": 5.214
    },
    {
      "index": 2,
      "document": {
        "text": "Reranking improves retrieval quality by rescoring candidate documents."
      },
      "relevance_score": 0.632
    }
  ]
}

POST/v1/rerank

Rerank with document objects

Some rerank-compatible clients send documents as objects with a text field.

Use this when your application wants to keep the document shape close to the response shape.

Required attributes

  • Name
    documents
    Type
    array
    Description

    Candidate document objects. Each object should contain text.

  • Name
    text
    Type
    string
    Description

    Document text to score.

Request

POST
/v1/rerank
curl -X POST "$BGE_RERANK_BASE_URL/rerank" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$BGE_RERANK_MODEL"'",
    "query": "What is the role of a reranker in RAG?",
    "documents": [
      {
        "text": "A reranker rescoring stage improves RAG by evaluating retrieved documents against the query more precisely."
      },
      {
        "text": "A tokenizer converts text into tokens before model inference."
      },
      {
        "text": "A load balancer distributes traffic across multiple servers."
      }
    ],
    "top_n": 2
  }'

Response shape

{
  "id": "rerank-objects123",
  "model": "BAAI/bge-reranker-v2-gemma",
  "usage": {
    "total_tokens": 69
  },
  "results": [
    {
      "index": 0,
      "document": {
        "text": "A reranker rescoring stage improves RAG by evaluating retrieved documents against the query more precisely."
      },
      "relevance_score": 0.9887
    },
    {
      "index": 1,
      "document": {
        "text": "A tokenizer converts text into tokens before model inference."
      },
      "relevance_score": 0.1238
    }
  ]
}

POST/v1/rerank

Retrieval pipeline example

A typical retrieval pipeline uses a fast first-stage retriever, then BGE Reranker V2 Gemma as a second-stage precision filter.

Pipeline

  • Name
    first_stage_retriever
    Type
    string
    Description

    Use BM25, vector search, hybrid search, or another retriever to fetch candidate documents.

  • Name
    reranker
    Type
    string
    Description

    Send the top candidates to BGE Reranker V2 Gemma.

  • Name
    final_context
    Type
    array
    Description

    Use the top reranked documents as context for downstream generation.

Request

POST
/v1/rerank
curl -X POST "$BGE_RERANK_BASE_URL/rerank" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$BGE_RERANK_MODEL"'",
    "query": "How does reranking improve RAG answers?",
    "documents": [
      "RAG retrieves external documents and passes relevant context to a language model.",
      "Reranking improves RAG by rescoring first-stage retrieval candidates using a stronger cross-encoder relevance model.",
      "Embeddings are dense vectors that represent text meaning.",
      "KV cache stores attention keys and values during autoregressive decoding.",
      "Hybrid search combines sparse keyword matching and dense semantic retrieval."
    ],
    "top_n": 2
  }'

Response shape

{
  "id": "rerank-pipeline123",
  "model": "BAAI/bge-reranker-v2-gemma",
  "usage": {
    "total_tokens": 89
  },
  "results": [
    {
      "index": 1,
      "document": {
        "text": "Reranking improves RAG by rescoring first-stage retrieval candidates using a stronger cross-encoder relevance model."
      },
      "relevance_score": 0.9962
    },
    {
      "index": 0,
      "document": {
        "text": "RAG retrieves external documents and passes relevant context to a language model."
      },
      "relevance_score": 0.8127
    }
  ]
}

POST/v1/rerank

Multiple rerank parameters

Use this example when you need a broader set of reranker controls.

Parameters shown here

  • Name
    model
    Type
    string
    Description

    The served BGE Reranker V2 Gemma model name.

  • Name
    query
    Type
    string
    Description

    Query to compare with each document.

  • Name
    documents
    Type
    array
    Description

    Candidate documents to rerank.

  • Name
    top_n
    Type
    integer
    Description

    Maximum number of results to return.

  • Name
    use_activation
    Type
    boolean
    Description

    Whether to apply the pooler activation function to relevance scores.

  • Name
    user
    Type
    string
    Description

    Optional end-user identifier accepted by vLLM’s rerank API.

Request

POST
/v1/rerank
curl -X POST "$BGE_RERANK_BASE_URL/rerank" \
  -H "Authorization: Bearer $HEXGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"$BGE_RERANK_MODEL"'",
    "query": "Why is my vLLM request returning 404 for chat completions?",
    "documents": [
      "Use the base URL ending in /v1 and send chat requests to /v1/chat/completions.",
      "The /v1/embeddings endpoint returns vector embeddings for input text.",
      "A 404 can occur when the client appends /chat/completions to a URL that already includes /chat/completions.",
      "Reranking endpoints return relevance scores, not generated text."
    ],
    "top_n": 3,
    "use_activation": true,
    "user": "user_123"
  }'

Response shape

{
  "id": "rerank-params123",
  "model": "BAAI/bge-reranker-v2-gemma",
  "usage": {
    "total_tokens": 97
  },
  "results": [
    {
      "index": 2,
      "document": {
        "text": "A 404 can occur when the client appends /chat/completions to a URL that already includes /chat/completions."
      },
      "relevance_score": 0.9941
    },
    {
      "index": 0,
      "document": {
        "text": "Use the base URL ending in /v1 and send chat requests to /v1/chat/completions."
      },
      "relevance_score": 0.9683
    },
    {
      "index": 3,
      "document": {
        "text": "Reranking endpoints return relevance scores, not generated text."
      },
      "relevance_score": 0.0872
    }
  ]
}

Official sources

Was this page helpful?