🔍 Experimenting with Image Embedding Using Large AI Models
Recently, I experimented with embedding images using major AI models to build a multimodal semantic search system, where users can search images with text (and vice versa).
🧐 A Surprising Discovery
I was surprised to find that as of 2025, Cohere is the only provider that supports direct image embedding via API.
Other major models like OpenAI and Gemini (by Google) do support image input in general, but do not clearly provide a direct embedding API for images.
Reason for Choosing Cohere
I chose to try Cohere’s embed-v4.0
because:
-
It supports embedding text, images, and even PDF documents (converted to images) into the same vector space.
-
You can choose the embedding size (I used the default, 1536).
-
It returns normalized embeddings that are ready to use for search and classification tasks.
⚙️ How I Built the System
I used Python for implementation. The system has two main flows:
1️⃣ Document Preparation Flow
-
Load documents, images, or text data that I want to store.
-
Use the Cohere API to embed them into vector representations.
-
Save these vectors in a database or vector store for future search queries.
2️⃣ User Query Flow
-
When a user asks a question or types a query:
-
Use Cohere to embed the query into a vector.
-
Search for the most similar documents in the vector space.
-
Return results to the user using a LLM (Large Language Model) like Gemini by Google.
-
🔑 How to Get API Keys
-
To use Cohere, go to: https://cohere.com, sign up, and get your API key.
(Cohere currently offers a free tier – see details here: docs.cohere.com/docs/rate-limits) -
To use Gemini (Google), go to: https://aistudio.google.com, sign up, and get your API key.
(Gemini also has a free tier – see details here: ai.google.dev/gemini-api/docs/rate-limits)
🔧 Flow 1: Setting Up Cohere and Gemini in Python
✅ Step 1: Install and Set Up Cohere
Run the following command in your terminal to install the Cohere Python SDK:
Then, initialize the Cohere client in your Python script:
✅ Step 2: Install and Set Up Gemini (Google Generative AI)
Install the Gemini client library with:
Then, initialize the Gemini client in your Python script:
🧩 Final Thoughts
This simple yet powerful two-step pipeline demonstrates how you can combine Cohere’s Embed v4 with Gemini’s Vision-Language capabilities to build a system that understands both text and images. By embedding documents (including large images) and using semantic similarity to retrieve relevant content, we can create a more intuitive, multimodal question-answering experience.
This approach is especially useful in scenarios where information is stored in visual formats like financial reports, dashboards, or charts — allowing LLMs to not just “see” the image but reason over it in context.
Multimodal retrieval-augmented generation (RAG) is no longer just theoretical — it’s practical, fast, and deployable today.