File Search Tool in Gemini API

🔍 File Search Tool in Gemini API

Build Smart RAG Applications with Google Gemini

📋 Table of Contents

🎯 What is File Search Tool?

Google has just launched an extremely powerful feature in the Gemini API: File Search Tool.
This is a fully managed RAG (Retrieval-Augmented Generation) system
that significantly simplifies the process of integrating your data into AI applications.

💡 What is RAG?

RAG (Retrieval-Augmented Generation) is a technique that combines information retrieval
from databases with the text generation capabilities of AI models. Instead of relying solely on pre-trained
knowledge, the model can retrieve and use information from your documents to provide
more accurate and up-to-date answers.

If you’ve ever wanted to build:

  • 🤖 Chatbot that answers questions about company documents
  • 📚 Research assistant that understands scientific papers
  • 🎯 Customer support system with product knowledge
  • 💻 Code documentation search tool

Then File Search Tool is the solution you need!

✨ Key Features

🚀 Simple Integration

Automatically manages file storage, content chunking, embedding generation,
and context insertion into prompts. No complex infrastructure setup required.

🔍 Powerful Vector Search

Uses the latest Gemini Embedding models for semantic search.
Finds relevant information even without exact keyword matches.

📚 Built-in Citations

Answers automatically include citations indicating which parts of documents
were used, making verification easy and transparent.

📄 Multiple Format Support

Supports PDF, DOCX, TXT, JSON, and many programming language files.
Build a comprehensive knowledge base easily.

🎉 Main Benefits

  • Fast: Deploy RAG in minutes instead of days
  • 💰 Cost-effective: No separate vector database management needed
  • 🔧 Easy maintenance: Google handles updates and scaling
  • Reliable: Includes citations for information verification

⚙️ How It Works

File Search Tool operates in 3 simple steps:

  • Create File Search Store
    This is the “storage” for your processed data. The store maintains embeddings
    and search indices for fast retrieval.
  • Upload and Import Files
    Upload your documents and the system automatically:

    • Splits content into chunks
    • Creates vector embeddings for each chunk
    • Builds an index for fast searching
  • Query with File Search
    Use the File Search tool in API calls to perform semantic searches
    and receive accurate answers with citations.

File Search Tool Workflow Diagram

Figure 1: File Search Tool Workflow Process

🛠️ Detailed Installation Guide

Step 1: Environment Preparation

✅ System Requirements

  • Python 3.8 or higher
  • pip (Python package manager)
  • Internet connection
  • Google Cloud account

📦 Required Tools

  • Terminal/Command Prompt
  • Text Editor or IDE
  • Git (recommended)
  • Virtual environment tool

Step 2: Install Python and Dependencies

2.1. Check Python

python –version

Expected output: Python 3.8.x or higher

2.2. Create Virtual Environment (Recommended)

# Create virtual environment
python -m venv gemini-env# Activate (Windows)
gemini-env\Scripts\activate# Activate (Linux/Mac)
source gemini-env/bin/activate

2.3. Install Google Genai SDK

pip install google-genai

Wait for the installation to complete. Upon success, you’ll see:

# Output when installation is successful:
Successfully installed google-genai-x.x.x

Package installation output

Figure 2: Successful Google Genai SDK installation

Step 3: Get API Key

  • Access Google AI Studio
    Open your browser and go to:
    https://aistudio.google.com/
  • Log in with Google Account
    Use your Google account to sign in
  • Create New API Key
    Click “Get API Key” → “Create API Key” → Select a project or create a new one
  • Copy API Key
    Save the API key securely – you’ll need it for authentication

Google AI Studio - Get API Key

Figure 3: Google AI Studio page to create API Key

Step 4: Configure API Key

Method 1: Use Environment Variable (Recommended)

On Windows:

set GEMINI_API_KEY=your_api_key_here

On Linux/Mac:

export GEMINI_API_KEY=’your_api_key_here’

Method 2: Use .env File

# Create .env file
GEMINI_API_KEY=your_api_key_here

Then load in Python:

from dotenv import load_dotenv
import osload_dotenv()
api_key = os.getenv(“GEMINI_API_KEY”)

⚠️ Security Notes

  • 🔒 DO NOT commit API keys to Git
  • 📝 Add .env to .gitignore
  • 🔑 Don’t share API keys publicly
  • ♻️ Rotate keys periodically if exposed

Step 5: Verify Setup

Run test script to verify complete setup:

python test_connection.py

The script will automatically check Python environment, API key, package installation, API connection, and demo source code files.

Successful setup test result

Figure 4: Successful setup test result

🎮 Demo and Screenshots

According to project requirements, this section demonstrates 2 main parts:

  • Demo 1: Create sample code and verify functionality
  • Demo 2: Check behavior through “Ask the Manual” Demo App

Demo 1: Sample Code – Create and Verify Operation

We’ll write our own code to test how File Search Tool works.

Step 1: Create File Search Store

Code to create File Search Store

Figure 5: Code to create File Search Store

Output when store is successfully created

Figure 6: Output when store is successfully created

Step 2: Upload and Process File

Upload and process file

Figure 7: File processing workflow

Step 3: Query and Receive Response with Citations

Query and Response with citations

Figure 8: Answer with citations

Demo 2: Check Behavior with “Ask the Manual” Demo App

Google provides a ready-made demo app to test File Search Tool’s behavior and features.
This is the best way to understand how the tool works before writing your own code.

🎨 Try Google’s Demo App

Google provides an interactive demo app called “Ask the Manual” to let you
test File Search Tool right away without coding!

🚀 Open Demo App

Ask the Manual demo app interface

Figure 9: Ask the Manual demo app interface (including API key selection)

Testing with Demo App:

  1. Select/enter your API key in the Settings field
  2. Upload PDF file or DOCX to the app
  3. Wait for processing (usually < 1 minute)
  4. Chat and ask questions about the PDF file content
  5. View answers returned from PDF data with citations
  6. Click on citations to verify sources

Files uploaded in demo app

Figure 10: Files uploaded in demo app

Query and response with citations

Figure 11: Query and response with citations in demo app

✅ Demo Summary According to Requirements

We have completed all requirements:

  • Introduce features: Introduced 4 main features at the beginning
  • Check behavior by demo app: Tested directly with “Ask the Manual” Demo App
  • Introduce getting started: Provided detailed 5-step installation guide
  • Make sample code: Created our own code and verified actual operation

Through the demo, we see that File Search Tool works very well with automatic chunking,
embedding, semantic search, and accurate results with citations!

💻 Complete Code Examples

Below are official code examples from Google Gemini API Documentation
that you can copy and use directly:

Example 1: Upload Directly to File Search Store

The fastest way – upload file directly to store in 1 step:

from google import genai
from google.genai import types
import timeclient = genai.Client()# Create the file search store with an optional display name
file_search_store = client.file_search_stores.create(
config={‘display_name’: ‘your-fileSearchStore-name’}
)# Upload and import a file into the file search store
operation = client.file_search_stores.upload_to_file_search_store(
file=‘sample.txt’,
file_search_store_name=file_search_store.name,
config={
‘display_name’: ‘display-file-name’,
}
)# Wait until import is complete
while not operation.done:
time.sleep(5)
operation = client.operations.get(operation)# Ask a question about the file
response = client.models.generate_content(
model=“gemini-2.5-flash”,
contents=“””Can you tell me about Robert Graves”””,
config=types.GenerateContentConfig(
tools=[
file_search=(
file_search_store_names=[file_search_store.name]
)
]
)
)print(response.text)

Example 2: Upload then Import File (2 Separate Steps)

If you want to upload file first, then import it to store:

from google import genai
from google.genai import types
import timeclient = genai.Client()# Upload the file using the Files API
sample_file = client.files.upload(
file=‘sample.txt’,
config={‘name’: ‘display_file_name’}
)# Create the file search store
file_search_store = client.file_search_stores.create(
config={‘display_name’: ‘your-fileSearchStore-name’}
)# Import the file into the file search store
operation = client.file_search_stores.import_file(
file_search_store_name=file_search_store.name,
file_name=sample_file.name
)# Wait until import is complete
while not operation.done:
time.sleep(5)
operation = client.operations.get(operation)# Ask a question about the file
response = client.models.generate_content(
model=“gemini-2.5-flash”,
contents=“””Can you tell me about Robert Graves”””,
config=types.GenerateContentConfig(
tools=[
file_search=(
file_search_store_names=[file_search_store.name]
)
]
)
)print(response.text)
📚 Source: Code examples are taken from

Gemini API Official Documentation – File Search

🎯 Real-World Applications

1. 📚 Document Q&A System

Use Case: Company Documentation Chatbot

Problem: New employees need to look up information from hundreds of pages of internal documents

Solution:

  • Upload all HR documents, policies, and guidelines to File Search Store
  • Create chatbot interface for employees to ask questions
  • System provides accurate answers with citations from original documents
  • Employees can verify information through citations

Benefits: Saves search time, reduces burden on HR team

2. 🔬 Research Assistant

Use Case: Scientific Paper Synthesis

Problem: Researchers need to read and synthesize dozens of papers

Solution:

  • Upload PDF files of research papers
  • Query to find studies related to specific topics
  • Request comparisons of methodologies between papers
  • Automatically create literature reviews with citations

Benefits: Accelerates research process, discovers new insights

3. 🎧 Customer Support Enhancement

Use Case: Automated Support System

Problem: Customers have many product questions, need 24/7 support

Solution:

  • Upload product documentation, FAQs, troubleshooting guides
  • Integrate into website chat widget
  • Automatically answer customer questions
  • Escalate to human agent if information not found

Benefits: Reduce 60-70% of basic tickets, improve customer satisfaction

4. 💻 Code Documentation Navigator

Use Case: Developer Onboarding Support

Problem: New developers need to quickly understand large codebase

Solution:

  • Upload API docs, architecture diagrams, code comments
  • Developers ask about implementing specific features
  • System points to correct files and functions to review
  • Explains design decisions with context

Benefits: Reduces onboarding time from weeks to days

📊 Comparison with Other Solutions

Criteria File Search Tool Self-hosted RAG Traditional Search
Setup Time ✅ < 5 minutes ⚠️ 1-2 days ✅ < 1 hour
Infrastructure ✅ Not needed ❌ Requires vector DB ⚠️ Requires search engine
Semantic Search ✅ Built-in ✅ Customizable ❌ Keyword only
Citations ✅ Automatic ⚠️ Must build yourself ⚠️ Basic highlighting
Maintenance ✅ Google handles ❌ Self-maintain ⚠️ Moderate
Cost 💰 Pay per use 💰💰 Infrastructure + Dev 💰 Hosting

🌟 Best Practices

📄 File Preparation

✅ Do’s

  • Use well-structured files
  • Add headings and sections
  • Use descriptive file names
  • Split large files into parts
  • Use OCR for scanned PDFs

❌ Don’ts

  • Files too large (>50MB)
  • Complex formats with many images
  • Poor quality scanned files
  • Mixed languages in one file
  • Corrupted or password-protected files

🗂️ Store Management

📋 Efficient Store Organization

  • By topic: Create separate stores for each domain (HR, Tech, Sales…)
  • By language: Separate stores for each language to optimize search
  • By time: Archive old stores, create new ones for updated content
  • Naming convention: Use meaningful names: hr-policies-2025-q1

🔍 Query Optimization

# ❌ Poor query
“info” # Too general# ✅ Good query
“What is the employee onboarding process in the first month?”# ❌ Poor query
“python” # Single keyword# ✅ Good query
“How to implement error handling in Python API?”# ✅ Query with context
“””
I need information about the deployment process.
Specifically the steps to deploy to production environment
and checklist to verify before deployment.
“””

⚡ Performance Tips

Speed Up Processing

  1. Batch upload: Upload multiple files at once instead of one by one
  2. Async processing: No need to wait for each file to complete
  3. Cache results: Cache answers for common queries
  4. Optimize file size: Compress PDFs, remove unnecessary images
  5. Monitor API limits: Track usage to avoid hitting rate limits

🔒 Security

Security Checklist

  • ☑️ API keys must not be committed to Git
  • ☑️ Use environment variables or secret management
  • ☑️ Implement rate limiting at application layer
  • ☑️ Validate and sanitize user input before querying
  • ☑️ Don’t upload files with sensitive data if not necessary
  • ☑️ Rotate API keys periodically
  • ☑️ Monitor usage logs for abnormal patterns
  • ☑️ Implement authentication for end users

💰 Cost Optimization

Strategy Description Savings
Cache responses Cache answers for identical queries ~30-50%
Batch processing Process multiple files at once ~20%
Smart indexing Only index necessary content ~15-25%
Archive old stores Delete unused stores Variable

🎊 Conclusion

File Search Tool in Gemini API provides a simple yet powerful RAG solution for integrating data into AI.
This blog has fully completed all requirements: Introducing features, demonstrating with “Ask the Manual” app, detailed installation guide,
and creating sample code with 11 illustrative screenshots.

🚀 Quick Setup • 🔍 Automatic Vector Search • 📚 Accurate Citations • 💰 Pay-per-use

🔗 Official Resources

 

Codex CLI vs Gemini CLI vs Claude Code

1. Codex CLI – Capabilities and New Features

According to OpenAI’s official announcement (“Introducing upgrades to Codex”), Codex CLI has been rebuilt on top of GPT-5-Codex, turning it into an agentic programming assistant — a developer AI that can autonomously plan, reason, and execute tasks across coding environments.

🌟 Core Abilities

  • Handles both small and large tasks: From writing a single function to refactoring entire projects.
  • Cross-platform integration: Works seamlessly across terminal (CLI), IDE (extension), and cloud environments.
  • Task reasoning and autonomy: Can track progress, decompose goals, and manage multi-step operations independently.
  • Secure by design: Runs in a sandbox with explicit permission requests for risky operations.

📈 Performance Highlights

  • Uses 93.7% fewer reasoning tokens for simple tasks, but invests 2× more computation on complex ones.
  • Successfully ran over 7 hours autonomously on long software tasks during testing.
  • Produces more precise code reviews than older Codex versions.

🟢 In short: Codex CLI 2025 is not just a code generator — it’s an intelligent coding agent capable of reasoning, multitasking, and working securely across terminal, IDE, and cloud environments.

2.Codex CLI vs Gemini CLI vs Claude Code: The New Era of AI in the Terminal

The command line has quietly become the next frontier for artificial intelligence.
While graphical AI tools dominate headlines, the real evolution is unfolding inside the terminal — where AI coding assistants now operate directly beside you, as part of your shell workflow.

Three major players define this new space: Codex CLI, Gemini CLI, and Claude Code.
Each represents a different philosophy of how AI should collaborate with developers — from speed and connectivity to reasoning depth. Let’s break down what makes each contender unique, and where they shine.


🧩 Codex CLI — OpenAI’s Code-Focused Terminal Companion

Codex CLI acts as a conversational layer over your terminal.
It listens to natural language commands, interprets your intent, and translates it into executable code or shell operations.
Now powered by OpenAI’s Codex5-Medium, it builds on the strengths of the o4-mini generation while adding adaptive reasoning and a larger 256K-token context window.

Once installed, Codex CLI integrates seamlessly with your local filesystem.
You can type:

“Create a Python script that fetches GitHub issues and logs them daily,”
and watch it instantly scaffold the files, import the right modules, and generate functional code.

Codex CLI supports multiple languages — Python, JavaScript, Go, Rust, and more — and is particularly strong at rapid prototyping and bug fixing.
Its defining trait is speed: responses feel immediate, making it perfect for fast iteration cycles.

Best for: developers who want quick, high-quality code generation and real-time debugging without leaving the terminal.


🌤️ Gemini CLI — Google’s Adaptive Terminal Intelligence

Gemini CLI embodies Google’s broader vision for connected AI development — blending reasoning, utility, and live data access.
Built on Gemini 2.5 Pro, this CLI isn’t just a coding bot — it’s a true multitool for developers and power users alike.

Beyond writing code, Gemini CLI can run shell commands, retrieve live web data, or interface with Google Cloud services.
It’s ideal for workflows that merge coding with external context — for example:

  • fetching live API responses,

  • monitoring real-time metrics,

  • or updating deployment configurations on-the-fly.

Tight integration with VS Code, Google Cloud SDK, and Workspace tools turns Gemini CLI into a full-spectrum AI companion rather than a mere code generator.

Best for: developers seeking a versatile assistant that combines coding intelligence with live, connected utility inside the terminal.


🧠 Claude Code — Anthropic’s Deep Code Reasoner

If Codex is about speed, and Gemini is about connectivity, Claude Code represents depth.
Built on Claude Sonnet 4.5, Anthropic’s upgraded reasoning model, Claude Code is designed to operate as a true engineering collaborator.

It excels at understanding, refactoring, and maintaining large-scale codebases.
Claude Code can read entire repositories, preserve logic across files, and even generate complete pull requests with human-like commit messages.
Its upgraded 250K-token context window allows it to track dependencies, explain architectural patterns, and ensure code consistency over time.

Claude’s replies are more analytical — often including explanations, design alternatives, and justifications for each change.
It trades a bit of speed for a lot more insight and reliability.

Best for: professional engineers or teams managing complex, multi-file projects that demand reasoning, consistency, and full-codebase awareness.

3.Codex CLI vs Gemini CLI vs Claude Code: Hands-on With Two Real Projects

While benchmarks and specs are useful, nothing beats actually putting AI coding agents to work.
To see how they perform on real, practical front-end tasks, I tested three leading terminal assistants — Codex CLI (Codex5-Medium), Gemini CLI (Gemini 2.5 Pro), and Claude Code (Sonnet 4.5) — by asking each to build two classic web projects using only HTML, CSS, and JavaScript.

  • 🎮 Project 1: Snake Game — canvas-based, pixel-style, smooth movement, responsive.

  • Project 2: Todo App — CRUD features, inline editing, filters, localStorage, dark theme, accessibility + keyboard support.

🎮 Task 1 — Snake Game

Goal

Create a playable 2D Snake Game using HTML, CSS, and JavaScript.
Display a grid-based canvas with a moving snake that grows when it eats food.
The snake should move continuously and respond to arrow-key inputs.
The game ends when the snake hits the wall or itself.
Include a score counter and a restart button with pixel-style graphics and responsive design.

Prompt

Create a playable 2D Snake Game using HTML, CSS, and JavaScript.

  The game should display a grid-based canvas with a moving snake that grows when it eats

  food.

  The snake should move continuously and respond to keyboard arrow keys for direction

  changes.

  The game ends when the snake hits the wall or itself.

  Show a score counter and a restart button.

  Use smooth movement, pixel-style graphics, and responsive design for different screen sizes

Observations

Codex CLI — Generated the basic canvas scaffold in seconds. Game loop, input, and scoring worked out of the box, but it required minor tuning for smoother turning and anti-reverse logic.

Gemini CLI — Delivered well-structured, commented code and used requestAnimationFrame properly. Gameplay worked fine, though the UI looked plain — more functional than fun.

Claude Code — Produced modular, production-ready code with solid collision handling, restart logic, and a polished HUD. Slightly slower response but the most complete result overall.

✅ Task 2 — Todo App

Goal

Build a complete, user-friendly Todo List App using only HTML, CSS, and JavaScript (no frameworks).
Features: add/edit/delete tasks, mark complete/incomplete, filter All / Active / Completed, clear completed, persist via localStorage, live counter, dark responsive UI, and full keyboard accessibility (Enter/Space/Delete).
Deliverables: index.html, style.css, app.js — clean, modular, commented, semantic HTML + ARIA.

Prompt

Develop a complete and user-friendly Todo List App using only HTML, CSS, and JavaScript (no frameworks). The app should include the following functionality and design requirements:

    1. Input field and ‘Add’ button to create new tasks.
    2. Ability to mark tasks as complete/incomplete via checkboxes.
    3. Inline editing of tasks by double-clicking — pressing Enter saves changes and Esc cancels.
    4. Delete buttons to remove tasks individually.
    5. Filter controls for All, Active, and Completed tasks.
    6. A ‘Clear Completed’ button to remove all completed tasks at once.
    7. Automatic saving and loading of todos using localStorage.
    8. A live counter showing the number of active (incomplete) tasks.
    9. A modern, responsive dark theme UI using CSS variables, rounded corners, and hover effects.
    10. Keyboard accessibility — Enter to add, Space to toggle, Delete to remove tasks.
      Ensure the project is well structured with three separate files:
    • index.html
    • style.css
    • app.js
      Code should be clean, modular, and commented, with semantic HTML and appropriate ARIA attributes for accessibility.

Observations

Codex CLI — Created a functional 3-file structure with working CRUD, filters, and persistence. Fast, but accessibility and keyboard flows needed manual reminders.

Gemini CLI — Balanced logic and UI nicely. Used CSS variables for a simple dark theme and implemented localStorage properly.
Performance was impressive — Gemini was the fastest overall, but its default design felt utilitarian, almost as if it “just wanted to get the job done.”
Gemini focuses on correctness and functionality rather than visual finesse.

Claude Code — Implemented inline editing, keyboard shortcuts, ARIA live counters, and semantic roles perfectly. The result was polished, responsive, and highly maintainable.

4.Codex CLI vs Gemini CLI vs Claude Code — Real-World Comparison

When testing AI coding assistants, speed isn’t everything — clarity, structure, and the quality of generated code all matter. To see how today’s top command-line tools compare, I ran the same set of projects across Claude Code, Gemini CLI, and Codex CLI, including a 2D Snake Game and a Todo List App.
Here’s how they performed.


Claude Code: Polished and Reliable

Claude Code consistently produced the most professional and complete results.
Its generated code came with clear structure, organized logic, and well-commented sections.
In the Snake Game test, Claude built the best-looking user interface, with a balanced layout, responsive design, and smooth movement logic.
Error handling was handled cleanly, and the overall experience felt refined — something you could hand over to a production team with confidence.
Although it wasn’t the fastest, Claude made up for it with code quality, structure, and ease of prompt engineering.
If your workflow values polish, maintainability, and readability, Claude Code is the most dependable choice.


Gemini CLI: Fastest but Basic

Gemini CLI clearly took the top spot for speed.
It executed quickly, generated files almost instantly, and made iteration cycles shorter.
However, the output itself felt minimal and unrefined — both the UI and the underlying logic were quite basic compared to Claude or Codex.
In the Snake Game task, Gemini produced a playable result but lacked visual polish and consistent structure.
Documentation and comments were also limited.
In short, Gemini is great for rapid prototyping or testing ideas quickly, but not for projects where you need beautiful UI, advanced logic, or long-term maintainability.


Codex CLI: Flexible but Slower

Codex CLI offered good flexibility and handled diverse prompts reasonably well.
It could generate functional UIs with decent styling, somewhere between Gemini’s simplicity and Claude’s refinement.
However, its main drawback was speed — responses were slower, and sometimes additional manual intervention was needed to correct or complete the code.
Codex is still a solid option when you need to tweak results manually or explore multiple implementation approaches, but it doesn’t match Claude’s polish or Gemini’s speed.


Overall Impression

After testing multiple projects, the overall ranking became clear:

  • Gemini CLI is the fastest but produces simple and unpolished code.

  • Claude Code delivers the most reliable, structured, and visually refined results.

  • Codex CLI sits in between — flexible but slower and less cohesive.

Each tool has its strengths. Gemini is ideal for quick builds, Codex for experimentation, and Claude Code for professional, trust-ready outputs.

In short:

Gemini wins on speed. Claude wins on quality. Codex stands in between — flexible but slower.

Ask Questions about Your PDFs with Cohere Embeddings + Gemini LLM

🔍 Experimenting with Image Embedding Using Large AI Models

Recently, I experimented with embedding images using major AI models to build a multimodal semantic search system, where users can search images with text (and vice versa).

🧐 A Surprising Discovery

I was surprised to find that as of 2025, Cohere is the only provider that supports direct image embedding via API.
Other major models like OpenAI and Gemini (by Google) do support image input in general, but do not clearly provide a direct embedding API for images.


Reason for Choosing Cohere

I chose to try Cohere’s embed-v4.0 because:

  • It supports embedding text, images, and even PDF documents (converted to images) into the same vector space.

  • You can choose the embedding size (I used the default, 1536).

  • It returns normalized embeddings that are ready to use for search and classification tasks.


⚙️ How I Built the System

I used Python for implementation. The system has two main flows:

1️⃣ Document Preparation Flow

  • Load documents, images, or text data that I want to store.

  • Use the Cohere API to embed them into vector representations.

  • Save these vectors in a database or vector store for future search queries.

2️⃣ User Query Flow

  • When a user asks a question or types a query:

    • Use Cohere to embed the query into a vector.

    • Search for the most similar documents in the vector space.

    • Return results to the user using a LLM (Large Language Model) like Gemini by Google.


🔑 How to Get API Keys

🔧 Flow 1: Setting Up Cohere and Gemini in Python

✅ Step 1: Install and Set Up Cohere

Run the following command in your terminal to install the Cohere Python SDK:

pip install -q cohere

Then, initialize the Cohere client in your Python script:

import cohere

# Replace <<YOUR_COHERE_KEY>> with your actual Cohere API key
cohere_api_key = “<<YOUR_COHERE_KEY>>”
co = cohere.ClientV2(api_key=cohere_api_key)


✅ Step 2: Install and Set Up Gemini (Google Generative AI)

Install the Gemini client library with:

pip install -q google-genai

Then, initialize the Gemini client in your Python script:

from google import genai

# Replace <<YOUR_GEMINI_KEY>> with your actual Gemini API key
gemini_api_key = “<<YOUR_GEMINI_KEY>>”
client = genai.Client(api_key=gemini_api_key)

📌 Flow 1: Document Preparation and Embedding

Chúng ta sẽ thực hiện các bước để chuyển PDF thành dữ liệu embedding bằng Cohere.


📥 Step 1: Download the PDF

We start by downloading the PDF from a given URL.

python

def download_pdf_from_url(url, save_path=”downloaded.pdf”):
response = requests.get(url)
if response.status_code == 200:
with open(save_path, “wb”) as f:
f.write(response.content)
print(“PDF downloaded successfully.”)
return save_path
else:
raise Exception(f”PDF download failed. Error code: {response.status_code}”)

# Example usage
pdf_url = “https://sgp.fas.org/crs/misc/IF10244.pdf”
local_pdf_path = download_pdf_from_url(pdf_url)


🖼️ Step 2: Convert PDF Pages to Text + Image

We extract both text and image for each page using PyMuPDF.

python

import fitz # PyMuPDF
import base64
from PIL import Image
import io

def extract_page_data(pdf_path):
doc = fitz.open(pdf_path)
pages_data = []
img_paths = []

for i, page in enumerate(doc):
text = page.get_text()

pix = page.get_pixmap()
image = Image.open(io.BytesIO(pix.tobytes(“png”)))

buffered = io.BytesIO()
image.save(buffered, format=”PNG”)
encoded_img = base64.b64encode(buffered.getvalue()).decode(“utf-8″)
data_url = f”data:image/png;base64,{encoded_img}”

content = [
{“type”: “text”, “text”: text},
{“type”: “image_url”, “image_url”: {“url”: data_url}},
]

pages_data.append({“content”: content})
img_paths.append({“data_url”: data_url})

return pages_data, img_paths

# Example usage
pages, img_paths = extract_page_data(local_pdf_path)


📤 Step 3: Embed Using Cohere

Now, send the fused text + image inputs to Cohere’s embed-v4.0 model.

python

res = co.embed(
model=”embed-v4.0″,
inputs=pages, # fused inputs
input_type=”search_document”,
embedding_types=[“float”],
output_dimension=1024,
)

embeddings = res.embeddings.float_
print(f”Number of embedded pages: {len(embeddings)}”)


Flow 1 complete: You now have the embedded vector representations of your PDF pages.

👉 Proceed to Flow 2 (e.g., storing, indexing, or querying the embeddings).

🔍 Flow 2: Ask a Question and Retrieve the Answer Using Image + LLM

This flow allows the user to ask a natural language question, find the most relevant image using Cohere Embed v4, and then answer the question using Gemini 2.5 Vision LLM.


💬 Step 1: Ask the Question

We define the user query in plain English.

python
question = “What was the total number of wildfires in the United States from 2007 to 2015?”

🧠 Step 2: Convert the Question to Embedding & Find Relevant Image

We use embed-v4.0 with input type search_query, then calculate cosine similarity between the question embedding and previously embedded document images.

python

def search(question, max_img_size=800):
# Get embedding for the query
api_response = co.embed(
model=”embed-v4.0″,
input_type=”search_query”,
embedding_types=[“float”],
texts=[question],
output_dimension=1024,
)

query_emb = np.asarray(api_response.embeddings.float[0])

# Compute cosine similarity with all document embeddings
cos_sim_scores = np.dot(embeddings, query_emb)
top_idx = np.argmax(cos_sim_scores) # Most relevant image

hit_img_path = img_paths[top_idx]
base64url = hit_img_path[“data_url”]

print(“Question:”, question)
print(“Most relevant image:”, hit_img_path)

# Display the matched image
if base64url.startswith(“data:image”):
base64_str = base64url.split(“,”)[1]
else:
base64_str = base64url

image_data = base64.b64decode(base64_str)
image = Image.open(io.BytesIO(image_data))

image.thumbnail((max_img_size, max_img_size))
display(image)

return base64url


🤖 Step 3: Use Vision-LLM (Gemini 2.5) to Answer

We use Gemini 2.5 Flash to answer the question based on the most relevant image.

python

def answer(question, base64_img_str):
if base64_img_str.startswith(“data:image”):
base64_img_str = base64_img_str.split(“,”)[1]

image_bytes = base64.b64decode(base64_img_str)
image = Image.open(io.BytesIO(image_bytes))

prompt = [
f”””Answer the question based on the following image.
Don’t use markdown.
Please provide enough context for your answer.

Question: {question}”””,
image
]

response = client.models.generate_content(
model=”gemini-2.5-flash-preview-04-17″,
contents=prompt
)

answer = response.text
print(“LLM Answer:”, answer)


▶️ Step 4: Run the Full Flow

python
top_image_path = search(question)
answer(question, top_image_path)

🧪 Example Usage:

question = “What was the total number of wildfires in the United States from 2007 to 2015?

# Step 1: Find the best-matching image
top_image_path = search(question)

# Step 2: Use the image to answer the question
answer(question, top_image_path)

🧾 Output:

Question: What was the total number of wildfires in the United States from 2007 to 2015?

Most relevant image:

 

LLM Answer: Based on the provided image, to find the total number of wildfires in the United States from 2007 to 2015, we need to sum the number of wildfires for each year in this period. Figure 1 shows the annual number of fires in thousands from 1993 to 2022, which covers the requested period. Figure 2 provides the specific number of fires for 2007 and 2015 among other years. Using the specific values from Figure 2 for 2007 and 2015, and estimating the number of fires for the years from 2008 to 2014 from Figure 1, we can calculate the total.

 

The number of wildfires in 2007 was 67.8 thousand (from Figure 2).

Estimating from Figure 1:

2008 was approximately 75 thousand fires.

2009 was approximately 75 thousand fires.

2010 was approximately 67 thousand fires.

2011 was approximately 74 thousand fires.

2012 was approximately 68 thousand fires.

2013 was approximately 47 thousand fires.

2014 was approximately 64 thousand fires.

The number of wildfires in 2015 was 68.2 thousand (from Figure 2).

 

Summing these values:

Total = 67.8 + 75 + 75 + 67 + 74 + 68 + 47 + 64 + 68.2 = 606 thousand fires.

 

Therefore, the total number of wildfires in the United States from 2007 to 2015 was approximately 606,000. This number is based on the sum of the annual number of fires obtained from Figure 2 for 2007 and 2015, and estimates from Figure 1 for the years 2008 through 2014.

Try this full pipeline on Google Colab: https://colab.research.google.com/drive/1kdIO-Xi0MnB1c8JrtF26Do3T54dij8Sf

🧩 Final Thoughts

This simple yet powerful two-step pipeline demonstrates how you can combine Cohere’s Embed v4 with Gemini’s Vision-Language capabilities to build a system that understands both text and images. By embedding documents (including large images) and using semantic similarity to retrieve relevant content, we can create a more intuitive, multimodal question-answering experience.

This approach is especially useful in scenarios where information is stored in visual formats like financial reports, dashboards, or charts — allowing LLMs to not just “see” the image but reason over it in context.

Multimodal retrieval-augmented generation (RAG) is no longer just theoretical — it’s practical, fast, and deployable today.