Ask Questions about Your PDFs with Cohere Embeddings + Gemini LLM

🔍 Experimenting with Image Embedding Using Large AI Models

Recently, I experimented with embedding images using major AI models to build a multimodal semantic search system, where users can search images with text (and vice versa).

🧐 A Surprising Discovery

I was surprised to find that as of 2025, Cohere is the only provider that supports direct image embedding via API.
Other major models like OpenAI and Gemini (by Google) do support image input in general, but do not clearly provide a direct embedding API for images.


Reason for Choosing Cohere

I chose to try Cohere’s embed-v4.0 because:

  • It supports embedding text, images, and even PDF documents (converted to images) into the same vector space.

  • You can choose the embedding size (I used the default, 1536).

  • It returns normalized embeddings that are ready to use for search and classification tasks.


⚙️ How I Built the System

I used Python for implementation. The system has two main flows:

1️⃣ Document Preparation Flow

  • Load documents, images, or text data that I want to store.

  • Use the Cohere API to embed them into vector representations.

  • Save these vectors in a database or vector store for future search queries.

2️⃣ User Query Flow

  • When a user asks a question or types a query:

    • Use Cohere to embed the query into a vector.

    • Search for the most similar documents in the vector space.

    • Return results to the user using a LLM (Large Language Model) like Gemini by Google.


🔑 How to Get API Keys

🔧 Flow 1: Setting Up Cohere and Gemini in Python

✅ Step 1: Install and Set Up Cohere

Run the following command in your terminal to install the Cohere Python SDK:

pip install -q cohere

Then, initialize the Cohere client in your Python script:

import cohere

# Replace <<YOUR_COHERE_KEY>> with your actual Cohere API key
cohere_api_key = “<<YOUR_COHERE_KEY>>”
co = cohere.ClientV2(api_key=cohere_api_key)


✅ Step 2: Install and Set Up Gemini (Google Generative AI)

Install the Gemini client library with:

pip install -q google-genai

Then, initialize the Gemini client in your Python script:

from google import genai

# Replace <<YOUR_GEMINI_KEY>> with your actual Gemini API key
gemini_api_key = “<<YOUR_GEMINI_KEY>>”
client = genai.Client(api_key=gemini_api_key)

📌 Flow 1: Document Preparation and Embedding

Chúng ta sẽ thực hiện các bước để chuyển PDF thành dữ liệu embedding bằng Cohere.


📥 Step 1: Download the PDF

We start by downloading the PDF from a given URL.

python

def download_pdf_from_url(url, save_path=”downloaded.pdf”):
response = requests.get(url)
if response.status_code == 200:
with open(save_path, “wb”) as f:
f.write(response.content)
print(“PDF downloaded successfully.”)
return save_path
else:
raise Exception(f”PDF download failed. Error code: {response.status_code}”)

# Example usage
pdf_url = “https://sgp.fas.org/crs/misc/IF10244.pdf”
local_pdf_path = download_pdf_from_url(pdf_url)


🖼️ Step 2: Convert PDF Pages to Text + Image

We extract both text and image for each page using PyMuPDF.

python

import fitz # PyMuPDF
import base64
from PIL import Image
import io

def extract_page_data(pdf_path):
doc = fitz.open(pdf_path)
pages_data = []
img_paths = []

for i, page in enumerate(doc):
text = page.get_text()

pix = page.get_pixmap()
image = Image.open(io.BytesIO(pix.tobytes(“png”)))

buffered = io.BytesIO()
image.save(buffered, format=”PNG”)
encoded_img = base64.b64encode(buffered.getvalue()).decode(“utf-8″)
data_url = f”data:image/png;base64,{encoded_img}”

content = [
{“type”: “text”, “text”: text},
{“type”: “image_url”, “image_url”: {“url”: data_url}},
]

pages_data.append({“content”: content})
img_paths.append({“data_url”: data_url})

return pages_data, img_paths

# Example usage
pages, img_paths = extract_page_data(local_pdf_path)


📤 Step 3: Embed Using Cohere

Now, send the fused text + image inputs to Cohere’s embed-v4.0 model.

python

res = co.embed(
model=”embed-v4.0″,
inputs=pages, # fused inputs
input_type=”search_document”,
embedding_types=[“float”],
output_dimension=1024,
)

embeddings = res.embeddings.float_
print(f”Number of embedded pages: {len(embeddings)}”)


Flow 1 complete: You now have the embedded vector representations of your PDF pages.

👉 Proceed to Flow 2 (e.g., storing, indexing, or querying the embeddings).

🔍 Flow 2: Ask a Question and Retrieve the Answer Using Image + LLM

This flow allows the user to ask a natural language question, find the most relevant image using Cohere Embed v4, and then answer the question using Gemini 2.5 Vision LLM.


💬 Step 1: Ask the Question

We define the user query in plain English.

python
question = “What was the total number of wildfires in the United States from 2007 to 2015?”

🧠 Step 2: Convert the Question to Embedding & Find Relevant Image

We use embed-v4.0 with input type search_query, then calculate cosine similarity between the question embedding and previously embedded document images.

python

def search(question, max_img_size=800):
# Get embedding for the query
api_response = co.embed(
model=”embed-v4.0″,
input_type=”search_query”,
embedding_types=[“float”],
texts=[question],
output_dimension=1024,
)

query_emb = np.asarray(api_response.embeddings.float[0])

# Compute cosine similarity with all document embeddings
cos_sim_scores = np.dot(embeddings, query_emb)
top_idx = np.argmax(cos_sim_scores) # Most relevant image

hit_img_path = img_paths[top_idx]
base64url = hit_img_path[“data_url”]

print(“Question:”, question)
print(“Most relevant image:”, hit_img_path)

# Display the matched image
if base64url.startswith(“data:image”):
base64_str = base64url.split(“,”)[1]
else:
base64_str = base64url

image_data = base64.b64decode(base64_str)
image = Image.open(io.BytesIO(image_data))

image.thumbnail((max_img_size, max_img_size))
display(image)

return base64url


🤖 Step 3: Use Vision-LLM (Gemini 2.5) to Answer

We use Gemini 2.5 Flash to answer the question based on the most relevant image.

python

def answer(question, base64_img_str):
if base64_img_str.startswith(“data:image”):
base64_img_str = base64_img_str.split(“,”)[1]

image_bytes = base64.b64decode(base64_img_str)
image = Image.open(io.BytesIO(image_bytes))

prompt = [
f”””Answer the question based on the following image.
Don’t use markdown.
Please provide enough context for your answer.

Question: {question}”””,
image
]

response = client.models.generate_content(
model=”gemini-2.5-flash-preview-04-17″,
contents=prompt
)

answer = response.text
print(“LLM Answer:”, answer)


▶️ Step 4: Run the Full Flow

python
top_image_path = search(question)
answer(question, top_image_path)

🧪 Example Usage:

question = “What was the total number of wildfires in the United States from 2007 to 2015?

# Step 1: Find the best-matching image
top_image_path = search(question)

# Step 2: Use the image to answer the question
answer(question, top_image_path)

🧾 Output:

Question: What was the total number of wildfires in the United States from 2007 to 2015?

Most relevant image:

 

LLM Answer: Based on the provided image, to find the total number of wildfires in the United States from 2007 to 2015, we need to sum the number of wildfires for each year in this period. Figure 1 shows the annual number of fires in thousands from 1993 to 2022, which covers the requested period. Figure 2 provides the specific number of fires for 2007 and 2015 among other years. Using the specific values from Figure 2 for 2007 and 2015, and estimating the number of fires for the years from 2008 to 2014 from Figure 1, we can calculate the total.

 

The number of wildfires in 2007 was 67.8 thousand (from Figure 2).

Estimating from Figure 1:

2008 was approximately 75 thousand fires.

2009 was approximately 75 thousand fires.

2010 was approximately 67 thousand fires.

2011 was approximately 74 thousand fires.

2012 was approximately 68 thousand fires.

2013 was approximately 47 thousand fires.

2014 was approximately 64 thousand fires.

The number of wildfires in 2015 was 68.2 thousand (from Figure 2).

 

Summing these values:

Total = 67.8 + 75 + 75 + 67 + 74 + 68 + 47 + 64 + 68.2 = 606 thousand fires.

 

Therefore, the total number of wildfires in the United States from 2007 to 2015 was approximately 606,000. This number is based on the sum of the annual number of fires obtained from Figure 2 for 2007 and 2015, and estimates from Figure 1 for the years 2008 through 2014.

Try this full pipeline on Google Colab: https://colab.research.google.com/drive/1kdIO-Xi0MnB1c8JrtF26Do3T54dij8Sf

🧩 Final Thoughts

This simple yet powerful two-step pipeline demonstrates how you can combine Cohere’s Embed v4 with Gemini’s Vision-Language capabilities to build a system that understands both text and images. By embedding documents (including large images) and using semantic similarity to retrieve relevant content, we can create a more intuitive, multimodal question-answering experience.

This approach is especially useful in scenarios where information is stored in visual formats like financial reports, dashboards, or charts — allowing LLMs to not just “see” the image but reason over it in context.

Multimodal retrieval-augmented generation (RAG) is no longer just theoretical — it’s practical, fast, and deployable today.

CoRAG: Revolutionizing RAG Systems with Intelligent Retrieval Chains

Large Language Models (LLMs) have demonstrated powerful content generation capabilities, but they often struggle with accessing the latest information, leading to hallucinations. Retrieval-Augmented Generation (RAG) addresses this issue by using external data sources, enabling models to provide more accurate and context-aware responses.

Key Advantages of RAG:

  • Improves factual accuracy by retrieving up-to-date information.
  • Enhances context comprehension by incorporating external data sources.
  • Reduces reliance on pre-trained memorization, allowing more flexible responses.

However, conventional RAG models have limitations that affect their effectiveness in complex reasoning tasks. Despite its advantages, standard RAG has notable drawbacks:

  1. Single Retrieval Step: Traditional RAG retrieves information only once before generating a response. If the retrieval is incorrect or incomplete, the model cannot refine its search.
  2. Limited Context Understanding: Since retrieval is static, it fails in multi-hop reasoning tasks that require step-by-step information gathering.
  3. Susceptibility to Hallucinations: If relevant information is not retrieved, the model may generate inaccurate or misleading responses.
  4. Inefficiency in Long Queries: For complex queries requiring multiple reasoning steps, a single retrieval step is often insufficient, leading to incomplete or incorrect answers.

CORAG (Chain-of-Retrieval Augmented Generation) is proposed to address these issues by leveraging the Monte Carlo Tree Search (MCTS) algorithm to optimize the information retrieval process.

CoRAG Solution

CoRAG is an enhanced version of RAG that introduces iterative retrieval and reasoning. Instead of retrieving information once, CoRAG performs multiple retrieval steps, dynamically reformulating queries based on evolving context.

How CoRAG Solves RAG’s Limitations

  • Step-by-step retrieval: Instead of relying on a single search, CoRAG retrieves information iteratively, refining the query at each step.
  • Query Reformulation: The system learns to modify its search queries based on previously retrieved results, enhancing accuracy.
  • Adaptive Reasoning: CoRAG dynamically determines the number of retrieval steps needed, ensuring more complete responses.
  • Better Performance in Multi-hop Tasks: CoRAG significantly outperforms RAG in tasks requiring multiple steps of logical reasoning.

CoRAG operates by employing a retrieval chain mechanism, where each retrieval step is informed by the results of previous steps. This allows the system to refine queries dynamically instead of relying on a single retrieval attempt as in traditional RAG. One of the most crucial aspects of CoRAG is query reformulation, which adjusts search queries in real time to retrieve the most relevant information. Thanks to this iterative approach, CoRAG significantly enhances its ability to handle complex, multi-hop reasoning tasks, leading to improved accuracy and reduced misinformation.

Training CoRAG involves the use of rejection sampling to generate intermediate retrieval chains, allowing the model to learn how to optimize search and filter information more effectively. Instead of only predicting the final answer, CoRAG is trained to retrieve information step by step, refining queries based on newly gathered knowledge. This method strengthens the model’s reasoning ability and improves performance on knowledge-intensive tasks.

Fine-tuning the model on optimized datasets is another crucial aspect of CoRAG training. Performance evaluation is conducted using metrics such as Exact Match (EM) score and F1-score, which assess the accuracy and comprehensiveness of responses compared to traditional RAG models.

Overview of CoRAG

Overview of CoRAG(Source: https://arxiv.org/html/2501.14342v1)

A key feature of CoRAG is its decoding strategies, which influence how the model retrieves and processes information. These strategies include:

  • Greedy Decoding: Selecting the most relevant information at each step without exploring alternative options.
  • Best-of-N Sampling: Running multiple retrieval attempts and choosing the most optimal result.
  • Tree Search: Using a structured search approach to explore different reasoning paths and enhance inference quality.

With its enhanced retrieval and reasoning mechanisms, CoRAG represents a major advancement in AI, enabling models to retrieve and synthesize information more effectively.

Comparison Between CoRAG and Traditional RAG

The following table provides a concise comparison between Traditional RAG and CoRAG. While Traditional RAG is more efficient in terms of computational cost, CoRAG excels in accuracy and adaptability for complex tasks. The iterative retrieval process in CoRAG ensures more precise results, making it suitable for specialized applications requiring deep contextual understanding.

Feature Traditional RAG CoRAG
Retrieval Strategy Single-step retrieval Iterative retrieval
Query Reformulation Fixed query Dynamic query adjustment
Multi-Hop Reasoning Limited Strong
Handling Hallucinations Prone to errors Reduces errors
Computational Cost Lower Higher
Adaptability Good for simple queries Ideal for complex domain

Key Differences Between CoRAG and Traditional RAG

  1. Retrieval Strategy
    • Traditional RAG: Performs a single retrieval step, fetching relevant documents once before generating a response. This limits its ability to refine searches based on partial information. Example:
      • Query: “Who wrote book X, and when was it published ?”
      • Traditional RAG: Fails if author and publication year are in separate chunks.
    • CoRAG: Utilizes an iterative retrieval process where multiple search steps refine the query dynamically, leading to more accurate and contextually appropriate responses. Example:
      • Query: “How many months apart are Johan Mjallby and Neil Lennon in age?”
      • CoRAG:
        1. Retrieve Johan Mjallby’s birth date.
        2. Retrieve Neil Lennon’s birth date.
        3. Calculate the time difference.
  1. Query Reformulation
    • Traditional RAG: Uses a fixed query that remains unchanged throughout the retrieval process.
    • CoRAG: Continuously modifies queries based on retrieved results, improving the relevance of later search steps.
  2. Multi-Hop Reasoning
    1. Traditional RAG: Struggles with tasks requiring multiple steps of reasoning, as it retrieves all information at once.
    • CoRAG: Adapts to multi-hop queries, progressively retrieving and synthesizing information step by step.
  3. Handling Hallucinations
    • Traditional RAG: More prone to hallucinations due to incomplete or inaccurate retrieval.
    • CoRAG: Reduces hallucinations by iteratively validating retrieved knowledge before generating responses.

Performance Comparison

Experiments on WikiPassageQA and MARCO datasets show that CORAG improves accuracy by up to 30% over traditional RAG methods. The system achieves higher ROUGE scores than baselines like RAPTOR and NaiveRAG while optimizing retrieval costs.

Efficiency Comparison

Efficiency Comparison (Source: https://arxiv.org/html/2411.00744v1)

Additionally, CORAG demonstrates excellent scalability, with retrieval time increasing by only 10% even when input data volume grows significantly.

  1. Accuracy and Relevance
    • Benchmark Results: Studies show that CoRAG achieves higher accuracy scores in question-answering tasks, outperforming RAG on datasets requiring multi-step reasoning.
    • Real-World Application: AI chatbots and research assistants using CoRAG provide more contextually aware and reliable answers compared to those using traditional RAG.
  2. Computational Cost
    • Traditional RAG: Less computationally expensive as it performs only a single retrieval step.
    • CoRAG: Higher computational demands due to iterative retrieval but offers significantly improved response quality.
  3. Adaptability to Different Domains
    • Traditional RAG: Works well for simple fact-based queries but struggles with domain-specific knowledge that requires iterative retrieval.
    • CoRAG: Excels in complex domains such as legal, medical, and academic research where deep contextual understanding is necessary.

When to Use CoRAG vs. Traditional RAG?

Choosing between CoRAG and traditional RAG depends on the nature of the tasks at hand. Each method has its own advantages and is suited for different use cases.

  • Best Use Cases for Traditional RAG
    • Simple question-answering tasks where a single retrieval suffices.
    • Use cases with strict computational constraints where efficiency is prioritized over deep reasoning.
    • Applications requiring quick but approximate answers, such as customer support chatbots handling FAQ-based interactions.
  • Best Use Cases for CoRAG
    • Complex queries requiring multi-hop reasoning and deep contextual understanding.
    • Research and academic applications where iterative refinement improves information accuracy.
    • AI-driven assistants handling specialized tasks such as legal document analysis and medical diagnosis support.

Conclusion

CoRAG (Chain-of-Retrieval Augmented Generation) represents a significant advancement in AI-driven knowledge retrieval and synthesis. By integrating vector search, contrastive ranking, and decision tree modeling, CoRAG enhances the accuracy, relevance, and structure of information provided to large language models. This systematic approach not only reduces hallucinations but also optimizes AI-generated responses, making it a powerful tool for applications requiring high-quality knowledge retrieval.

With its intelligent ability to retrieve, rank, and organize information, CoRAG opens new possibilities in enterprise search, research assistance, and AI-driven decision-making. As AI continues to evolve, systems like CoRAG will play a crucial role in bridging raw data with actionable knowledge, fostering more intelligent and reliable AI applications.