🔍 Experimenting with Image Embedding Using Large AI Models
Recently, I experimented with embedding images using major AI models to build a multimodal semantic search system, where users can search images with text (and vice versa).
🧐 A Surprising Discovery
I was surprised to find that as of 2025, Cohere is the only provider that supports direct image embedding via API.
Other major models like OpenAI and Gemini (by Google) do support image input in general, but do not clearly provide a direct embedding API for images.
Reason for Choosing Cohere
I chose to try Cohere’s embed-v4.0
because:
-
It supports embedding text, images, and even PDF documents (converted to images) into the same vector space.
-
You can choose the embedding size (I used the default, 1536).
-
It returns normalized embeddings that are ready to use for search and classification tasks.
⚙️ How I Built the System
I used Python for implementation. The system has two main flows:
1️⃣ Document Preparation Flow
-
Load documents, images, or text data that I want to store.
-
Use the Cohere API to embed them into vector representations.
-
Save these vectors in a database or vector store for future search queries.
2️⃣ User Query Flow
-
When a user asks a question or types a query:
-
Use Cohere to embed the query into a vector.
-
Search for the most similar documents in the vector space.
-
Return results to the user using a LLM (Large Language Model) like Gemini by Google.
-
🔑 How to Get API Keys
-
To use Cohere, go to: https://cohere.com, sign up, and get your API key.
(Cohere currently offers a free tier – see details here: docs.cohere.com/docs/rate-limits) -
To use Gemini (Google), go to: https://aistudio.google.com, sign up, and get your API key.
(Gemini also has a free tier – see details here: ai.google.dev/gemini-api/docs/rate-limits)
🔧 Flow 1: Setting Up Cohere and Gemini in Python
✅ Step 1: Install and Set Up Cohere
Run the following command in your terminal to install the Cohere Python SDK:
pip install -q cohere
Then, initialize the Cohere client in your Python script:
import cohere
# Replace <<YOUR_COHERE_KEY>> with your actual Cohere API key
cohere_api_key = “<<YOUR_COHERE_KEY>>”
co = cohere.ClientV2(api_key=cohere_api_key)
✅ Step 2: Install and Set Up Gemini (Google Generative AI)
Install the Gemini client library with:
pip install -q google-genai
Then, initialize the Gemini client in your Python script:
from google import genai
# Replace <<YOUR_GEMINI_KEY>> with your actual Gemini API key
gemini_api_key = “<<YOUR_GEMINI_KEY>>”
client = genai.Client(api_key=gemini_api_key)
📌 Flow 1: Document Preparation and Embedding
Chúng ta sẽ thực hiện các bước để chuyển PDF thành dữ liệu embedding bằng Cohere.
📥 Step 1: Download the PDF
We start by downloading the PDF from a given URL.
def download_pdf_from_url(url, save_path=”downloaded.pdf”):
response = requests.get(url)
if response.status_code == 200:
with open(save_path, “wb”) as f:
f.write(response.content)
print(“PDF downloaded successfully.”)
return save_path
else:
raise Exception(f”PDF download failed. Error code: {response.status_code}”)
# Example usage
pdf_url = “https://sgp.fas.org/crs/misc/IF10244.pdf”
local_pdf_path = download_pdf_from_url(pdf_url)
🖼️ Step 2: Convert PDF Pages to Text + Image
We extract both text and image for each page using PyMuPDF.
import fitz # PyMuPDF
import base64
from PIL import Image
import io
def extract_page_data(pdf_path):
doc = fitz.open(pdf_path)
pages_data = []
img_paths = []
for i, page in enumerate(doc):
text = page.get_text()
pix = page.get_pixmap()
image = Image.open(io.BytesIO(pix.tobytes(“png”)))
buffered = io.BytesIO()
image.save(buffered, format=”PNG”)
encoded_img = base64.b64encode(buffered.getvalue()).decode(“utf-8″)
data_url = f”data:image/png;base64,{encoded_img}”
content = [
{“type”: “text”, “text”: text},
{“type”: “image_url”, “image_url”: {“url”: data_url}},
]
pages_data.append({“content”: content})
img_paths.append({“data_url”: data_url})
return pages_data, img_paths
# Example usage
pages, img_paths = extract_page_data(local_pdf_path)
📤 Step 3: Embed Using Cohere
Now, send the fused text + image inputs to Cohere’s embed-v4.0
model.
res = co.embed(
model=”embed-v4.0″,
inputs=pages, # fused inputs
input_type=”search_document”,
embedding_types=[“float”],
output_dimension=1024,
)
embeddings = res.embeddings.float_
print(f”Number of embedded pages: {len(embeddings)}”)
✅ Flow 1 complete: You now have the embedded vector representations of your PDF pages.
👉 Proceed to Flow 2 (e.g., storing, indexing, or querying the embeddings).
🔍 Flow 2: Ask a Question and Retrieve the Answer Using Image + LLM
This flow allows the user to ask a natural language question, find the most relevant image using Cohere Embed v4, and then answer the question using Gemini 2.5 Vision LLM.
💬 Step 1: Ask the Question
We define the user query in plain English.
🧠 Step 2: Convert the Question to Embedding & Find Relevant Image
We use embed-v4.0
with input type search_query
, then calculate cosine similarity between the question embedding and previously embedded document images.
def search(question, max_img_size=800):
# Get embedding for the query
api_response = co.embed(
model=”embed-v4.0″,
input_type=”search_query”,
embedding_types=[“float”],
texts=[question],
output_dimension=1024,
)
query_emb = np.asarray(api_response.embeddings.float[0])
# Compute cosine similarity with all document embeddings
cos_sim_scores = np.dot(embeddings, query_emb)
top_idx = np.argmax(cos_sim_scores) # Most relevant image
hit_img_path = img_paths[top_idx]
base64url = hit_img_path[“data_url”]
print(“Question:”, question)
print(“Most relevant image:”, hit_img_path)
# Display the matched image
if base64url.startswith(“data:image”):
base64_str = base64url.split(“,”)[1]
else:
base64_str = base64url
image_data = base64.b64decode(base64_str)
image = Image.open(io.BytesIO(image_data))
image.thumbnail((max_img_size, max_img_size))
display(image)
return base64url
🤖 Step 3: Use Vision-LLM (Gemini 2.5) to Answer
We use Gemini 2.5 Flash to answer the question based on the most relevant image.
def answer(question, base64_img_str):
if base64_img_str.startswith(“data:image”):
base64_img_str = base64_img_str.split(“,”)[1]
image_bytes = base64.b64decode(base64_img_str)
image = Image.open(io.BytesIO(image_bytes))
prompt = [
f”””Answer the question based on the following image.
Don’t use markdown.
Please provide enough context for your answer.
Question: {question}”””,
image
]
response = client.models.generate_content(
model=”gemini-2.5-flash-preview-04-17″,
contents=prompt
)
answer = response.text
print(“LLM Answer:”, answer)
▶️ Step 4: Run the Full Flow
answer(question, top_image_path)
🧪 Example Usage:
question = “What was the total number of wildfires in the United States from 2007 to 2015?”
# Step 1: Find the best-matching image
top_image_path = search(question)# Step 2: Use the image to answer the question
answer(question, top_image_path)
🧾 Output:
Question: What was the total number of wildfires in the United States from 2007 to 2015?
Most relevant image:
LLM Answer: Based on the provided image, to find the total number of wildfires in the United States from 2007 to 2015, we need to sum the number of wildfires for each year in this period. Figure 1 shows the annual number of fires in thousands from 1993 to 2022, which covers the requested period. Figure 2 provides the specific number of fires for 2007 and 2015 among other years. Using the specific values from Figure 2 for 2007 and 2015, and estimating the number of fires for the years from 2008 to 2014 from Figure 1, we can calculate the total.
The number of wildfires in 2007 was 67.8 thousand (from Figure 2).
Estimating from Figure 1:
2008 was approximately 75 thousand fires.
2009 was approximately 75 thousand fires.
2010 was approximately 67 thousand fires.
2011 was approximately 74 thousand fires.
2012 was approximately 68 thousand fires.
2013 was approximately 47 thousand fires.
2014 was approximately 64 thousand fires.
The number of wildfires in 2015 was 68.2 thousand (from Figure 2).
Summing these values:
Total = 67.8 + 75 + 75 + 67 + 74 + 68 + 47 + 64 + 68.2 = 606 thousand fires.
Therefore, the total number of wildfires in the United States from 2007 to 2015 was approximately 606,000. This number is based on the sum of the annual number of fires obtained from Figure 2 for 2007 and 2015, and estimates from Figure 1 for the years 2008 through 2014.
Try this full pipeline on Google Colab: https://colab.research.google.com/drive/1kdIO-Xi0MnB1c8JrtF26Do3T54dij8Sf
🧩 Final Thoughts
This simple yet powerful two-step pipeline demonstrates how you can combine Cohere’s Embed v4 with Gemini’s Vision-Language capabilities to build a system that understands both text and images. By embedding documents (including large images) and using semantic similarity to retrieve relevant content, we can create a more intuitive, multimodal question-answering experience.
This approach is especially useful in scenarios where information is stored in visual formats like financial reports, dashboards, or charts — allowing LLMs to not just “see” the image but reason over it in context.
Multimodal retrieval-augmented generation (RAG) is no longer just theoretical — it’s practical, fast, and deployable today.