Exploring Claude Code Subagents: A Demo Setup for a RAG-Based Website Project

Posted on August 1, 2025 by hello@scuti

1. Introduction

Recently, Anthropic released an incredible new feature for its product Claude: subagents — secondary agents with specific tasks for different purposes within a user’s project.

2. Main Content

a. How to Set It Up:
First, install Claude using the following command in your Terminal window:

npm i @anthropic-ai/claude-code

If Claude is already installed but it’s an older version, it won’t have the subagent feature.

to update claude, command : claude update

Launch Claude Code in your working directory, then run the command:
/agents

Press Enter, and a management screen for agents will appear, allowing you to start creating agents with specific purposes for your project.

Here, I will set it up following Claude’s recommendation.

After the setup, I have the following subagents:

I will ask Claude to help me build a website using RAG with the following prompt:

The first subagents have started working.

The setup of the RAG project has been completed.

However, I noticed that the subagent ‘production-code-reviewer (Review RAG system code)’ didn’t function after the coding was completed. It might be an issue with my prompt, so I will ask Claude to review the code for me

After the whole working process, Claude Code will deliver an excellent final product.
Link: https://github.com/mhieupham1/claudecode-subagent

3. Conclusion

Through the entire setup process and practical use in a project, it’s clear how powerful and beneficial the Sub-agents feature introduced by Anthropic for Claude Code is. It enables us to have AI “teammates” with specialized skills and roles that operate independently without interfering with each other — allowing projects to be organized, easy to understand, and efficient.

Gemini CLI vs. Claude Code CLI: A Comprehensive Comparison for Developers

Posted on July 24, 2025July 24, 2025 by hello@scuti

1. Introduction to the Launch of Gemini CLI

Recently, Google launched Gemini CLI – an open-source AI agent that can be directly integrated into the terminal for work. In previous articles about Claude Code CLI, we already saw its powerful features. Now, with the interesting arrival of Gemini CLI, users have even more options when choosing which agent to use. In this article, we’ll explore and compare the different criteria between Claude Code CLI and Gemini CLI to see which agent might best suit your needs.

2. Comparison Criteria Between the Two CLI Agents

a. Platform Support

Claude Code CLI: This tool has certain limitations when it comes to operating system support. It works well on MacOS and Ubuntu, but for Windows users, it requires extra steps such as installing an Ubuntu WSL virtual machine. Even then, there are still some restrictions and a less-than-ideal user experience.
Gemini CLI: Google’s new tool supports all operating systems, allowing users on any platform to set up and use it quickly and easily.

b. Open Source

Claude Code CLI: This is a closed-source tool, so its development is entirely controlled by Anthropic.
Gemini CLI: Google’s tool is open source, licensed under Apache 2.0, which enables the user community to access and collaborate on making the tool more robust and faster.

c. AI Model

Claude Code CLI: Utilizes powerful Anthropic models such as Claude Opus 4 and Claude Sonnet 3.7, both highly effective for coding tasks.
Gemini CLI: Gives access to Gemini 2.5 Pro and Gemini 2.5 Flash, each useful for different needs.

d. Context Limitations

Claude Code CLI: This is a paid tool. Users can access it through their Claude account with various tiers, each offering different token limits (from 250K to 1M tokens per model). Users can also use Claude’s API key to pay based on token usage.
Gemini CLI: Google’s tool provides a free version, which allows access to Gemini 2.5 Pro, but can quickly hit the limit and drop down to Gemini 2.5 Flash.

e. Community and Extensibility

Claude Code CLI: As a closed-source tool, only the developer (Anthropic) can improve and maintain it.
Gemini CLI: Being open source, it has a large and vibrant community contributing to its rapid improvement and greater capabilities.

3. Gemini CLI

Link: https://github.com/mhieupham1/Flashcard_GeminiCLI
Prompt Example:
- Please make for me a website about using flashcard for learning English with HTML, CSS, Javascript, do the best for UI/UX
- A flashcard set can archive many words, user can add more word to a new set or existed set
- Function for folder that can add existed flashcard sets or remove it
- Function for flashcard set that can edit transfer user to a web to practice in this flashcard set
- Dashboard need to have more eye-catching, good layout
- And many prompts to ask Gemini CLI to fix their own bugs
- Make the web has layout, functions like an official website with better CSS, JS, HTML
Strengths:
- Can handle large token requests and good at reading context
- Cost: Free version can access Gemini 2.5 Pro, but may quickly hit limits and fall back to Gemini 2.5 Flash. Sometimes, after logging out and back in, it works normally again with Gemini 2.5 Flash. A pro account offers a one-month free trial, after which users can cancel or continue with the stated price.

Weaknesses:
- Requires a very large number of tokens (1M tokens for pro, 11M for flash) to build the website (even when incomplete)

- Prone to repeated error loops, wasting tokens
- Codebase is still weak and doesn’t always fully understand user intentions or basic web concepts, so prompts need to be very detailed

4. Claude Code CLI

Link: https://github.com/mhieupham1/Flashcard_ClaudeCodeCLI
Prompt Example:
- Please make for me a website about using flashcard for learning English with HTML, CSS, Javascript, do the best for UI/UX
- A flashcard set can archive many words, user can add more word to a new set or existed set
- Function for folder that can add existed flashcard sets or remove it
- Function for flashcard set that can edit transfer user to a web to practice in this flashcard set
- Dashboard need to have more eye-catching, good layout
Strengths:
- Understands user ideas very well, outputs high-quality, efficient, and minimal code without missing features
- Only required 30K tokens for the flashcard web demo
- Good, user-friendly UI/UX
- Produced the demo with a single request (using only a pro account, not the max tier)
Weaknesses:
- Requires a paid account or API key (tokens = dollars), but the code quality is worth the price

5. Conclusion

With the comparison above, it’s clear that Gemini CLI is currently much stronger than Claude Code CLI. However, a deeper dive into their practical efficiency and benefits for different use cases is still needed.

a. Gemini CLI

Strengths:
- Free to use with high token limits, suitable for large projects needing a large context window
- Highly compatible across platforms and easy to set up
- Open source, ensuring rapid improvement through community contributions
- Fast code reading and generation
Weaknesses:
- Can randomly hit usage limits, dropping from Gemini Pro 2.5 to Gemini Flash 2.5, reducing effectiveness
- Prone to repeated errors/loops, which can be difficult to escape’

- Codebase may not be as efficient, often needing very detailed prompts

b. Claude Code CLI:

Strengths:
- High-quality, thoughtful, and efficient codebase generation
- Highly suitable for commercial projects thanks to token optimization
Weaknesses:
- Requires a paid account, with different tiers for different performance levels; top tier is expensive
- Limited cross-platform compatibility, making it less accessible or offering a poorer experience for some users

6. Which Should You Use? Summary of Best Use Cases

When is Claude Code CLI most convenient?
Claude Code CLI is the better choice if you prioritize high-quality, efficient, and minimal code output, especially for commercial projects that require clean UI/UX and robust functionality. It is also ideal when you want to achieve your result in a single, well-phrased prompt. However, you need to be willing to pay for a subscription or API access, and set up the tool on a supported platform.

When is Gemini CLI more convenient?
Gemini CLI is perfect if you need a free, open-source tool that works across all major operating systems and is easy to install. It’s best for large projects that require handling a lot of data or context, and for those who want to benefit from fast community-driven improvements. Gemini CLI is especially suitable for personal, experimental, or learning projects, or when you need flexibility and cross-platform compatibility—even though it might sometimes require more detailed prompts or troubleshooting.

Combining tmux and Claude to Build an Automated AI Agent System (for Mac & Linux)

Posted on June 29, 2025June 29, 2025 by hello@scuti

1. Introduction

With the rapid growth of AI, multi-agent systems are attracting more attention due to their ability to coordinate, split tasks, and handle complex automation. An “agent” can be an independent AI responsible for a specific role or task.

In this article, I’ll show you how to combine tmux (a powerful terminal multiplexer) with Claude (Anthropic’s AI model) to build a virtual organization. Here, AI agents can communicate, collaborate, and work together automatically via the terminal.

2. What is tmux?

tmux lets you split your terminal into multiple windows or sessions, each running its own process independently. Even if you disconnect, these sessions stay alive. This is super useful when you want to run several agents in parallel, each in their own terminal, without interfering with each other.

3. What is Claude?

Claude is an advanced language AI model developed by Anthropic. It can understand and respond to text requests, and it’s easy to integrate into automated systems—acting as a “virtual employee” taking on part of your workflow.

4. Why combine tmux and Claude?

Parallel & Distributed: Each agent is an independent Claude instance running in its own tmux session.

Workflow Automation: Easily simulate complex workflows between virtual departments or roles.

Easy Debug & Management: You can observe each agent’s logs in separate panes or sessions.

5. System Architecture

Let’s imagine a simple company structure:

PRESIDENT: Project Director (sets direction, gives instructions)

boss1: Team Leader (splits up tasks)

worker1, worker2, worker3: Team members (do the work)

Each agent has its own instruction file so it knows its role when starting up.

Agents communicate using a script:

./agent-send.sh [recipient] “[message]”

Workflow:

PRESIDENT → boss1 → workers → boss1 → PRESIDENT

6. Installation

Since the code is a bit long, I’ll just share the GitHub link to keep things short.

tmux:
Install guide: tmux Installing Guide

Claude:
Install guide: Claude Setup Guide

Git:
Install guide: Git Download

Clone the project:

bash
git clone https://github.com/mhieupham1/claudecliagent

Inside, you’ll find the main folders and files:

CLAUDE.md: Describes the agent architecture, communication, and workflows.

instructions/: Contains guidance for each role.

.claude/: JSON files to manage permissions for bash scripts.

setup.sh: Launches tmux sessions for PRESIDENT, boss1, worker1, worker2, worker3 so agents can talk to each other.

agent-send.sh: Script for sending messages between agents.

7. Deployment

Run the setup script:

bash
./setup.sh
This will create tmux sessions for PRESIDENT and the agents (boss1, worker1, worker2, worker3) in the background.

To access the PRESIDENT session:

bash
tmux attach-session -t president

To access the multiagent session:

bash
tmux attach-session -t multiagent

In the PRESIDENT session, run the claude command to set up the Claude CLI.

Do the same for the other agents.

Now, in the PRESIDENT window, try entering a request like:

you are president. create a todo list website now
PRESIDENT will start the to-do list. PRESIDENT will send instructions to boss1, boss1 will assign tasks to worker1, worker2, and worker3.

You can watch boss1 and the workers do their jobs, approve commands to create code files, and wait for them to finish.

Result:

8. Conclusion

Combining tmux and Claude lets you create a multi-agent AI system that simulates a real company: communicating, collaborating, and automating complex workflows. Having each agent in its own session makes it easy to manage, track progress, and debug.

This system is great for AI research, testing, or even real-world workflow automation, virtual team assistants, or teamwork simulations.

If you’re interested in developing multi-agent AI systems, try deploying this model, customize roles and workflows to your needs, and feel free to contribute or suggest improvements to the original repo!

A Step-by-Step Guide to Integrating and Using Claude Code Action on GitHub

Posted on June 11, 2025 by hello@scuti

Investigate how Claude Code Action is great. Just create an issue and put a mention to Claude like @claude, Claude can write the code automatically

Introduction

In the current era of rapidly evolving technology, artificial intelligence (AI)

stands out as one of the most significant and transformative breakthroughs on a global scale. Among the various AI-driven tools, Claude — particularly the Claude Action Code — represents a powerful integration that can be embedded into user’s GitHub repositories to address raised issues with remarkable accuracy and efficiency. This paper aims to explore the capabilities and applications of Claude Action Code in modern software development workflows.

Body content

Claude Code Action is a extension categorized as a “Action” and made available on the GitHub Marketplace by Anthropic. Users can search for and utilize it by following the provided setup instructions outlined in the README documentation. Below is a summary of the basic setup steps for integrating Claude Code Action into user’s GitHub repository:

1.Create a workflow folder:

On GitHub: In user’s GitHub repository, click “Add file”:

insert the configuration into the path:“.git/workflows/[file_name].yml”. For instance:

Next, insert the appropriate workflow configuration for this extension, depending on your intended use:

For example:

name: Claude PR Assistant

on:

issue_comment:

types: [created]

pull_request_review_comment:

types: [created]

issues:

types: [opened, assigned]

pull_request_review:

types: [submitted]

jobs:

claude-code-action:

if: |

(github.event_name == ‘issue_comment’ &&

contains(github.event.comment.body, ‘@claude’)) ||

(github.event_name == ‘pull_request_review_comment’ && contains(github.event.comment.body, ‘@claude’)) ||

(github.event_name == ‘pull_request_review’ &&

contains(github.event.review.body, ‘@claude’)) ||

(github.event_name == ‘issues’ && contains(github.event.issue.body, ‘@claude’))

runs-on: ubuntu-latest

permissions:

contents: write

pull-requests: read

issues: read

id-token: write

steps:

– name: Checkout repository

uses: actions/checkout@v4

with:

fetch-depth: 1

– name: Run Claude PR Action

uses: anthropics/claude-code-action@beta

with:

anthropic_api_key: ${{ secrets.ANTHROPIC_API_KEY }}

timeout_minutes: “60”

Then, click “Commit changes” to successfully add the configuration to your repository.

On the user’s local machine: If a folder in VScode has already been connected to the GitHub repository, the user can manually create a workflow directory and a .yml file to store the Claude configuration. Then, file can be pushed to the GitHub repository

2.API key:

Claude API is not free, and you need to visit https://console.anthropic.com/dashboard to add credit and obtain an API key for your personal Anthropic account.

After that, the API key should be added to the repository’s Secrets under the Setting tab, rather than being hard-coded directly into workflow file to prevent unauthorized access

Find Action in Secret and variables

Create a new repository secret

Add your API key to Secret’s description

Name secret as key’s name in the workflow file

✅Correct

❌Never do it

3. Using Claude Code Action:

User creates a new issue within repository where Claude is intended to be used:

The user describes the issue to be resolved – such as feature creation, bug fixing, code review, … – in the issue’s description. You can tag “@claude” directly in the description or in a comment after the issue is created, in order trigger Claude to process the request

Ex: Ask Claude to generate complete login and registration pages based on the initial files in the repo

Claude is invoked via API to address the issue described, with the response time depending on the complexity of the request. It uses the token associated with your API key to read the issue content as well as to create or modify code within the repository

Claude’s response will appear in the comments section of the issue.

Here, Claude generates additional files, for example register.html and dashboard.html, as part of the requested implementation and show what changes are made to each file — including which parts are added, modified, or deleted.

At this point, Claude has created a separate branch in the repository containing the proposed changes. The user can then review and consider merging these updates into the main branch via a pull request.

After successfully merging into the main branch

Following a successful merge, the issue may be closed. At this point, Claude has been effectively utilized to generate complete, functional demo pages for user login and registration.

4.Result:

Registration page

Login screen

Dashboard screen

In summary, Claude Code Action proves to be a highly effective tool for streamlining development tasks, making it easier for both individuals and teams to enhance productivity.

Buổi học nội bộ! Tìm hiểu “MCP” – giao thức không thể thiếu trong ứng dụng AI tạo sinh

Posted on May 30, 2025June 19, 2025 by hello@scuti

Internal seminar about MCP

Xin chào, tôi là Kakeya, Giám đốc điều hành của Scuti.

Công ty chúng tôi chuyên phát triển offshore tại Việt Nam với thế mạnh về AI tạo sinh. Chúng tôi cung cấp các dịch vụ như Tư vấn AI tạo sinh và AI-OCR, và gần đây rất vui mừng khi nhận được nhiều yêu cầu phát triển hệ thống tích hợp cùng AI tạo sinh.

Gần đây, Scuti đã tổ chức một buổi học nội bộ để nâng cao hiểu biết về “MCP (Model Connection Protocol)”.

MCP là một giao thức dùng để kết nối AI – đặc biệt là LLM (Mô hình ngôn ngữ lớn) – với các dịch vụ bên ngoài. Mặc dù nghe có vẻ kỹ thuật, nhưng thực tế lại rất hữu ích ngay cả đối với những người không phải kỹ sư. Ví dụ, khi sử dụng các công cụ như Claude, MCP giúp tích hợp hiệu quả với các dịch vụ khác, nâng cao năng suất đáng kể.

Buổi học này hướng đến cả kỹ sư và những người không chuyên kỹ thuật. Nội dung bao gồm khái niệm cơ bản về MCP, các ví dụ ứng dụng thực tế, cũng như cách triển khai vào công việc hàng ngày. Một điểm nổi bật là việc sử dụng MCP kết hợp với các công cụ phát triển như Cursor để kết nối với các dịch vụ bên ngoài, từ đó tăng tốc độ phát triển và nâng cao chất lượng sản phẩm. Đây là kỹ năng gần như bắt buộc đối với kỹ sư.

Tại Scuti, chúng tôi luôn nỗ lực tạo ra môi trường giúp mọi thành viên cập nhật các công nghệ mới nhất. Ngoài các buổi hội thảo nội bộ thường xuyên, chúng tôi còn có chính sách thưởng cho việc nghiên cứu và chia sẻ kết quả, cũng như hỗ trợ nhân viên lấy các chứng chỉ kỹ thuật.

Trong thời đại AI tạo sinh ngày càng gắn liền với sự phát triển kinh doanh, saldo5d việc toàn bộ nhân viên có cùng nền tảng kiến thức và khả năng áp dụng thực tiễn là điều vô cùng quan trọng. Thông qua các buổi học như thế này, Scuti tiếp tục củng cố năng lực công nghệ và khả năng phối hợp nội bộ một cách vững chắc.

Internal Study Session Held! Learning the Essential “MCP” for Generative AI Utilization

Posted on May 30, 2025 by hello@scuti

Hello, this is Kakeya, CEO of Scuti.

Our company specializes in offshore development in Vietnam with a strong focus on generative AI. We provide services such as Generative AI Consulting and Generative AI-OCR, and we are grateful to have received a growing number of system development requests integrated with generative AI.

Recently, we held an internal study session at Scuti to deepen our understanding of “MCP (Model Connection Protocol).”

MCP is a protocol that connects AI—particularly LLMs (Large Language Models)—with external services. While it may sound technical, it is actually quite useful even for non-engineers. For example, when using tools like Claude, MCP enables seamless integration with other services, greatly enhancing efficiency.

The study session was designed to benefit both engineers and non-engineers. It covered the fundamentals of MCP, practical use cases, and how this protocol can be applied to everyday operations. One of the key highlights was how MCP can be utilized with development tools such as Cursor to connect with various external services, thus boosting development speed and product quality. For engineers, this is becoming an essential skill.

At Scuti, we are committed to fostering an environment where every team member can stay up-to-date with the latest technologies. We hold regular internal seminars, offer incentive programs for research and output sharing, and actively support the acquisition of technical certifications.

As generative AI becomes increasingly integral to business growth, it is crucial for all team members to have a common understanding of the technologies involved and be able to apply them effectively. Through sessions like this, Scuti continues to strengthen its technological capabilities and collaborative potential across teams.

Ask Questions about Your PDFs with Cohere Embeddings + Gemini LLM

Posted on May 14, 2025May 23, 2025 by hello@scuti

🔍 Experimenting with Image Embedding Using Large AI Models

Recently, I experimented with embedding images using major AI models to build a multimodal semantic search system, where users can search images with text (and vice versa).

🧐 A Surprising Discovery

I was surprised to find that as of 2025, Cohere is the only provider that supports direct image embedding via API.
Other major models like OpenAI and Gemini (by Google) do support image input in general, but do not clearly provide a direct embedding API for images.

Reason for Choosing Cohere

I chose to try Cohere’s embed-v4.0 because:

It supports embedding text, images, and even PDF documents (converted to images) into the same vector space.
You can choose the embedding size (I used the default, 1536).
It returns normalized embeddings that are ready to use for search and classification tasks.

⚙️ How I Built the System

I used Python for implementation. The system has two main flows:

1️⃣ Document Preparation Flow

Load documents, images, or text data that I want to store.
Use the Cohere API to embed them into vector representations.
Save these vectors in a database or vector store for future search queries.

2️⃣ User Query Flow

When a user asks a question or types a query:
- Use Cohere to embed the query into a vector.
- Search for the most similar documents in the vector space.
- Return results to the user using a LLM (Large Language Model) like Gemini by Google.

🔑 How to Get API Keys

To use Cohere, go to: https://cohere.com, sign up, and get your API key.
(Cohere currently offers a free tier – see details here: docs.cohere.com/docs/rate-limits)
To use Gemini (Google), go to: https://aistudio.google.com, sign up, and get your API key.
(Gemini also has a free tier – see details here: ai.google.dev/gemini-api/docs/rate-limits)

🔧 Flow 1: Setting Up Cohere and Gemini in Python

✅ Step 1: Install and Set Up Cohere

Run the following command in your terminal to install the Cohere Python SDK:

Then, initialize the Cohere client in your Python script:

✅ Step 2: Install and Set Up Gemini (Google Generative AI)

Install the Gemini client library with:

Then, initialize the Gemini client in your Python script:

from google import genai

# Replace <<YOUR_GEMINI_KEY>> with your actual Gemini API key
gemini_api_key = “<<YOUR_GEMINI_KEY>>”
client = genai.Client(api_key=gemini_api_key)

📌 Flow 1: Document Preparation and Embedding

Chúng ta sẽ thực hiện các bước để chuyển PDF thành dữ liệu embedding bằng Cohere.

📥 Step 1: Download the PDF

We start by downloading the PDF from a given URL.

🖼️ Step 2: Convert PDF Pages to Text + Image

We extract both text and image for each page using PyMuPDF.

python

import fitz # PyMuPDF
import base64
from PIL import Image
import io

def extract_page_data(pdf_path):
doc = fitz.open(pdf_path)
pages_data = []
img_paths = []

for i, page in enumerate(doc):
text = page.get_text()

pix = page.get_pixmap()
image = Image.open(io.BytesIO(pix.tobytes(“png”)))

buffered = io.BytesIO()
image.save(buffered, format=”PNG”)
encoded_img = base64.b64encode(buffered.getvalue()).decode(“utf-8″)
data_url = f”data:image/png;base64,{encoded_img}”

content = [
{“type”: “text”, “text”: text},
{“type”: “image_url”, “image_url”: {“url”: data_url}},
]

pages_data.append({“content”: content})
img_paths.append({“data_url”: data_url})

return pages_data, img_paths

# Example usage
pages, img_paths = extract_page_data(local_pdf_path)

📤 Step 3: Embed Using Cohere

Now, send the fused text + image inputs to Cohere’s embed-v4.0 model.

✅ Flow 1 complete: You now have the embedded vector representations of your PDF pages.

👉 Proceed to Flow 2 (e.g., storing, indexing, or querying the embeddings).

🔍 Flow 2: Ask a Question and Retrieve the Answer Using Image + LLM

This flow allows the user to ask a natural language question, find the most relevant image using Cohere Embed v4, and then answer the question using Gemini 2.5 Vision LLM.

💬 Step 1: Ask the Question

We define the user query in plain English.

🧠 Step 2: Convert the Question to Embedding & Find Relevant Image

We use embed-v4.0 with input type search_query, then calculate cosine similarity between the question embedding and previously embedded document images.

python

def search(question, max_img_size=800):
# Get embedding for the query
api_response = co.embed(
model=”embed-v4.0″,
input_type=”search_query”,
embedding_types=[“float”],
texts=[question],
output_dimension=1024,
)

query_emb = np.asarray(api_response.embeddings.float[0])

# Compute cosine similarity with all document embeddings
cos_sim_scores = np.dot(embeddings, query_emb)
top_idx = np.argmax(cos_sim_scores) # Most relevant image

hit_img_path = img_paths[top_idx]
base64url = hit_img_path[“data_url”]

print(“Question:”, question)
print(“Most relevant image:”, hit_img_path)

# Display the matched image
if base64url.startswith(“data:image”):
base64_str = base64url.split(“,”)[1]
else:
base64_str = base64url

image_data = base64.b64decode(base64_str)
image = Image.open(io.BytesIO(image_data))

image.thumbnail((max_img_size, max_img_size))
display(image)

return base64url

🤖 Step 3: Use Vision-LLM (Gemini 2.5) to Answer

We use Gemini 2.5 Flash to answer the question based on the most relevant image.

python

def answer(question, base64_img_str):
if base64_img_str.startswith(“data:image”):
base64_img_str = base64_img_str.split(“,”)[1]

image_bytes = base64.b64decode(base64_img_str)
image = Image.open(io.BytesIO(image_bytes))

prompt = [
f”””Answer the question based on the following image.
Don’t use markdown.
Please provide enough context for your answer.

Question: {question}”””,
image
]

response = client.models.generate_content(
model=”gemini-2.5-flash-preview-04-17″,
contents=prompt
)

answer = response.text
print(“LLM Answer:”, answer)

▶️ Step 4: Run the Full Flow

🧪 Example Usage:

question = “What was the total number of wildfires in the United States from 2007 to 2015?”

# Step 1: Find the best-matching image
top_image_path = search(question)

# Step 2: Use the image to answer the question
answer(question, top_image_path)

🧾 Output:

Question: What was the total number of wildfires in the United States from 2007 to 2015?

Most relevant image:

LLM Answer: Based on the provided image, to find the total number of wildfires in the United States from 2007 to 2015, we need to sum the number of wildfires for each year in this period. Figure 1 shows the annual number of fires in thousands from 1993 to 2022, which covers the requested period. Figure 2 provides the specific number of fires for 2007 and 2015 among other years. Using the specific values from Figure 2 for 2007 and 2015, and estimating the number of fires for the years from 2008 to 2014 from Figure 1, we can calculate the total.

The number of wildfires in 2007 was 67.8 thousand (from Figure 2).

Estimating from Figure 1:

2008 was approximately 75 thousand fires.

2009 was approximately 75 thousand fires.

2010 was approximately 67 thousand fires.

2011 was approximately 74 thousand fires.

2012 was approximately 68 thousand fires.

2013 was approximately 47 thousand fires.

2014 was approximately 64 thousand fires.

The number of wildfires in 2015 was 68.2 thousand (from Figure 2).

Summing these values:

Total = 67.8 + 75 + 75 + 67 + 74 + 68 + 47 + 64 + 68.2 = 606 thousand fires.

Therefore, the total number of wildfires in the United States from 2007 to 2015 was approximately 606,000. This number is based on the sum of the annual number of fires obtained from Figure 2 for 2007 and 2015, and estimates from Figure 1 for the years 2008 through 2014.

Try this full pipeline on Google Colab: https://colab.research.google.com/drive/1kdIO-Xi0MnB1c8JrtF26Do3T54dij8Sf

🧩 Final Thoughts

This simple yet powerful two-step pipeline demonstrates how you can combine Cohere’s Embed v4 with Gemini’s Vision-Language capabilities to build a system that understands both text and images. By embedding documents (including large images) and using semantic similarity to retrieve relevant content, we can create a more intuitive, multimodal question-answering experience.

This approach is especially useful in scenarios where information is stored in visual formats like financial reports, dashboards, or charts — allowing LLMs to not just “see” the image but reason over it in context.

Multimodal retrieval-augmented generation (RAG) is no longer just theoretical — it’s practical, fast, and deployable today.

7 Bí Quyết Giúp Nâng Cao Độ Chính Xác Của OCR Biểu Mẫu Phi Tiêu Chuẩn

Posted on May 7, 2025 by hello@scuti

Xin chào, tôi là Kakeya, đại diện của công ty Scuti.

Công ty chúng tôi chuyên cung cấp các dịch vụ như Phát triển phần mềm offshore và phát triển theo hình thức Labo tại Việt Nam, cũng như Cung cấp giải pháp AI tạo sinh. Gần đây, chúng tôi rất vinh dự khi nhận được nhiều yêu cầu phát triển hệ thống kết hợp với AI tạo sinh.

Nhiều người có thể đang cân nhắc triển khai OCR biểu mẫu phi tiêu chuẩn nhưng vẫn do dự do lo ngại về độ chính xác. OCR (Nhận dạng ký tự quang học) là một công nghệ cực kỳ hữu ích giúp chuyển đổi tài liệu giấy thành dữ liệu số. Tuy nhiên, đối với các biểu mẫu phi tiêu chuẩn có bố cục linh hoạt và định dạng không cố định, vấn đề độ chính xác thường trở nên nghiêm trọng hơn.

Ngay cả khi đã áp dụng OCR, nếu độ chính xác nhận dạng thấp, khối lượng công việc chỉnh sửa thủ công sẽ tăng lên. Kết quả là, lợi ích kỳ vọng về nâng cao hiệu suất làm việc và cắt giảm chi phí có thể không đạt được như mong đợi. Do đó, trong bài viết này, chúng tôi sẽ giới thiệu 7 phương pháp cụ thể để cải thiện đáng kể độ chính xác của OCR biểu mẫu phi tiêu chuẩn.

Bằng cách áp dụng những bí quyết này, bạn có thể nâng cao đáng kể độ chính xác của OCR, tối ưu hóa quy trình làm việc và giảm chi phí. Hãy đọc đến cuối bài viết và áp dụng những kiến thức này vào doanh nghiệp của bạn.

Bài viết này sẽ cung cấp cái nhìn toàn diện, bắt đầu từ những kiến thức cơ bản về OCR biểu mẫu phi tiêu chuẩn, tiếp đến là các kỹ thuật cải thiện độ chính xác và cuối cùng là khám phá tương lai của OCR với sự hỗ trợ của các công nghệ tiên tiến nhất.

OCR Biểu Mẫu Phi Tiêu Chuẩn Là ì?

Nếu bạn muốn tìm hiểu thêm về AI OCR, hãy xem trước bài viết này.
Bài viết liên quan: Các Ứng Dụng Đột Phá Của AI OCR Tạo Sinh Và 5 Phương Pháp Chính

Kiến thức cơ bản về OCR: Cơ chế và các loại hình

OCR (Nhận dạng ký tự quang học) là công nghệ giúp trích xuất thông tin văn bản từ dữ liệu hình ảnh. Cụ thể, nó cho phép máy tính đọc chữ từ các tài liệu giấy đã được quét hoặc chụp ảnh. OCR được ứng dụng rộng rãi trong việc số hóa tài liệu, giúp nâng cao hiệu suất làm việc và cắt giảm chi phí đáng kể. Có bốn loại OCR chính:

OCR truyền thống (Nhận dạng mẫu truyền thống): Phương pháp này nhận dạng chữ bằng cách so khớp dữ liệu hình ảnh với các mẫu ký tự đã được định nghĩa trước. Nó mang lại độ chính xác cao đối với các phông chữ và bố cục tiêu chuẩn nhưng thiếu tính linh hoạt.
Nhận dạng dấu quang học (OMR): Phương pháp này nhận diện các mẫu cụ thể, chẳng hạn như ô kiểm tra và phiếu trắc nghiệm. Nó được sử dụng phổ biến trong khảo sát và chấm điểm bài thi.
Nhận dạng ký tự thông minh (ICR): Phương pháp này nhận diện chữ viết tay, vốn không tuân theo một mẫu cố định. Nó phù hợp để nhận diện biểu mẫu viết tay và chữ ký.
Nhận dạng mã vạch: Phương pháp này đọc các ký hiệu như mã vạch và mã QR. Nó được ứng dụng nhiều trong quản lý sản phẩm và kho hàng.

Thách thức và giải pháp của OCR biểu mẫu phi tiêu chuẩn

Biểu mẫu phi tiêu chuẩn là các tài liệu không tuân theo một định dạng cố định. Ví dụ, hóa đơn và đơn đặt hàng có bố cục khác nhau tùy theo từng doanh nghiệp. OCR biểu mẫu phi tiêu chuẩn là công nghệ giúp trích xuất thông tin văn bản từ các bố cục linh hoạt như vậy, nhưng OCR truyền thống thường gặp vấn đề về độ chính xác. Các công nghệ OCR truyền thống thường không đủ khả năng xử lý sự đa dạng của các định dạng và bố cục này.

Để giải quyết vấn đề này, công nghệ OCR tiên tiến ứng dụng AI và máy học đã ra đời. AI OCR học hỏi đặc điểm của văn bản từ một lượng lớn dữ liệu, giúp nhận diện ký tự với độ chính xác cao ngay cả đối với các biểu mẫu phi tiêu chuẩn. Các thuật toán máy học có khả năng tự động nhận diện và học tập các mẫu, từ đó tăng tính linh hoạt trong xử lý nhiều loại bố cục và định dạng khác nhau.

7 Bí quyết giúp nâng cao độ chính xác của OCR biểu mẫu phi tiêu chuẩn

1. Sử dụng hình ảnh chất lượng cao: Tầm quan trọng của máy quét và độ phân giải

Độ chính xác của OCR bị ảnh hưởng đáng kể bởi chất lượng hình ảnh đầu vào. Việc sử dụng hình ảnh chất lượng cao giúp nâng cao độ chính xác của OCR. Cụ thể, cần sử dụng máy quét hiệu suất cao và quét với độ phân giải phù hợp.

Lựa chọn máy quét: Hãy chọn máy quét dựa trên các yếu tố như độ phân giải, tốc độ quét và định dạng tệp hỗ trợ. Máy quét hiệu suất cao giúp tạo ra hình ảnh rõ nét hơn, góp phần nâng cao độ chính xác của OCR.
Cài đặt độ phân giải: Thông thường, độ phân giải tối thiểu 300 dpi được khuyến nghị. Độ phân giải càng cao, ký tự càng rõ nét, giúp giảm nguy cơ nhận dạng sai. Tuy nhiên, nếu độ phân giải quá cao, kích thước tệp sẽ tăng và có thể làm chậm tốc độ xử lý, do đó cần cân nhắc sự cân bằng hợp lý.

2. Tiền xử lý hình ảnh: Loại bỏ nhiễu và điều chỉnh độ tương phản

Hình ảnh quét có thể chứa nhiễu và bụi bẩn. Những yếu tố này có thể làm giảm độ chính xác của OCR, vì vậy việc loại bỏ chúng thông qua tiền xử lý là rất quan trọng.

Loại bỏ nhiễu: Sử dụng phần mềm chỉnh sửa hình ảnh hoặc công cụ tiền xử lý chuyên dụng để loại bỏ nhiễu và bụi bẩn trong hình ảnh. Điều này giúp OCR nhận diện ký tự chính xác hơn.
Điều chỉnh độ tương phản: Bằng cách điều chỉnh độ tương phản giữa ký tự và nền, các ký tự sẽ nổi bật rõ ràng hơn, giúp cải thiện độ chính xác khi nhận diện. Đặc biệt, nếu nền có vết bẩn hoặc bóng đổ, việc tăng độ tương phản sẽ làm tăng khả năng nhìn thấy ký tự.

3. Lựa chọn phần mềm OCR phù hợp: Tận dụng AI OCR

Có nhiều loại phần mềm OCR khác nhau. Đối với các biểu mẫu không chuẩn, việc chọn phần mềm phù hợp, như AI OCR, phù hợp với loại hình và mục đích của biểu mẫu là rất quan trọng, thay vì sử dụng phần mềm OCR đơn giản.

AI OCR: AI OCR có thể nhận diện ký tự với độ chính xác cao ngay cả đối với các biểu mẫu không chuẩn, vì nó học các đặc điểm của ký tự từ lượng dữ liệu lớn. Bằng cách tận dụng các thuật toán học máy, nó có thể linh hoạt xử lý các bố cục phức tạp và các định dạng đa dạng.
OCR dựa trên đám mây: Các dịch vụ OCR dựa trên đám mây cũng là một lựa chọn. Vì có thể truy cập qua internet, chúng giúp giảm chi phí triển khai và cung cấp khả năng mở rộng tốt. Tuy nhiên, việc kiểm tra độ tin cậy của dịch vụ từ góc độ bảo mật và quyền riêng tư dữ liệu là rất quan trọng trước khi sử dụng.

4. Sử dụng chức năng nhận diện bố cục: Cấu trúc hóa văn bản
Phần mềm OCR hiện đại có tích hợp chức năng nhận diện bố cục. Việc sử dụng chức năng này giúp nhận diện cấu trúc của văn bản, từ đó cải thiện độ chính xác khi trích xuất dữ liệu.

Trích xuất dữ liệu bảng biểu:
Bằng cách sử dụng chức năng nhận diện bố cục, dữ liệu dạng bảng có thể được trích xuất chính xác. Điều này cho phép sử dụng dữ liệu bảng dưới dạng dữ liệu số mà vẫn giữ nguyên mối quan hệ và cấu trúc vị trí của các dữ liệu trong bảng.
Liên kết tên mục và giá trị:
Việc tự động liên kết tên mục và giá trị của chúng giúp giảm đáng kể công sức nhập liệu. Điều này giúp duy trì tính toàn vẹn của dữ liệu và đảm bảo quá trình xử lý dữ liệu sau này được suôn sẻ.

5. Đăng ký từ điển: Hỗ trợ các thuật ngữ chuyên ngành
Một số phần mềm OCR có tính năng đăng ký từ điển. Bằng cách đăng ký các thuật ngữ chuyên ngành hoặc thuật ngữ đặc thù của ngành vào từ điển, có thể cải thiện độ chính xác trong việc nhận diện.

Ngăn ngừa nhận diện sai:
Việc đăng ký từ điển giúp phần mềm OCR nhận diện chính xác các thuật ngữ chuyên ngành. Điều này giúp giảm thiểu sự nhận diện sai và cải thiện độ chính xác của dữ liệu.
Cải thiện tỷ lệ nhận diện:
Việc đăng ký thuật ngữ chuyên ngành vào từ điển giúp cải thiện tỷ lệ nhận diện tổng thể của phần mềm OCR. Điều này đặc biệt hiệu quả đối với các mẫu biểu có nhiều thuật ngữ chuyên ngành.

6. Tạo mẫu: Tối ưu hóa theo loại biểu mẫu
Khi xử lý nhiều lần các biểu mẫu cùng loại bằng OCR, việc tạo mẫu sẽ giúp cải thiện độ chính xác và giảm thời gian xử lý.

Cố định vị trí mục:
Bằng cách sử dụng mẫu, các vị trí của từng mục có thể được cố định. Điều này giúp phần mềm OCR nhận diện văn bản dễ dàng hơn và giảm thiểu việc nhận diện sai.
Giảm thời gian xử lý:
Việc sử dụng mẫu giúp giảm đáng kể thời gian xử lý OCR. Với một bố cục cố định, phần mềm có thể trích xuất dữ liệu một cách hiệu quả, từ đó nâng cao tốc độ xử lý tổng thể.

7. Kiểm tra bởi con người: Tầm quan trọng của việc xác nhận cuối cùng

Sau khi xử lý OCR, việc thực hiện kiểm tra bởi con người là rất quan trọng. Phần mềm OCR có độ chính xác cao, nhưng không phải lúc nào cũng hoàn hảo, và có thể xảy ra nhận diện sai.

Sửa chữa nhận diện sai:
Bằng cách thực hiện kiểm tra bởi con người, những sai sót trong nhận diện của phần mềm OCR có thể được sửa chữa. Điều này giúp cải thiện độ chính xác của dữ liệu.
Cải thiện độ chính xác của dữ liệu:
Thông qua việc xác nhận cuối cùng, độ chính xác của dữ liệu được nâng cao hơn nữa. Đối với các dữ liệu kinh doanh quan trọng, quá trình kiểm tra này là không thể thiếu để đảm bảo độ tin cậy.

Tương lai của OCR trên mẫu không chuẩn: Sự tiến hóa qua công nghệ AI

Với sự tiến hóa của công nghệ AI, độ chính xác của OCR dự kiến sẽ tiếp tục được cải thiện trong tương lai. Đặc biệt, công nghệ học sâu (deep learning) đã đóng góp lớn trong việc nâng cao độ chính xác của OCR.

Sự tiến bộ trong nhận diện chữ viết tay:
Học sâu đã cải thiện đáng kể độ chính xác trong nhận diện chữ viết tay. Nhờ đó, việc số hóa các biểu mẫu viết tay và chữ ký giờ đây có thể được thực hiện chính xác và hiệu quả hơn.
Tăng cường hỗ trợ đa ngôn ngữ:
Phần mềm OCR sử dụng học sâu đã được cải thiện khả năng hỗ trợ đa ngôn ngữ. Điều này giúp các doanh nghiệp mở rộng toàn cầu xử lý tài liệu đa ngôn ngữ dễ dàng hơn, dự kiến sẽ nâng cao hiệu quả công việc quốc tế.

Kết Luận: Tối đa hóa việc sử dụng OCR trên mẫu không chuẩn

OCR trên mẫu không chuẩn là một công cụ mạnh mẽ giúp đạt được hiệu quả công việc và giảm chi phí. Bằng cách thực hiện 7 mẹo để cải thiện độ chính xác, bạn có thể tối đa hóa hiệu quả của OCR. Với sự tiến hóa của công nghệ AI, OCR sẽ tiếp tục phát triển và được kỳ vọng sẽ mang lại độ chính xác và tính linh hoạt cao hơn trong tương lai. Hãy áp dụng những chiến lược này để thúc đẩy quá trình số hóa trong doanh nghiệp của bạn.

7 Secrets To Improving The Accuracy Of Non-Standard Form OCR

Posted on May 7, 2025 by hello@scuti

Hello, I am Kakeya, the representative of Scuti.

Our company specializes in services such as Offshore Development And Lab-type Development in Vietnam, as well as Generative AI Consulting. Recently, we have been fortunate to receive numerous requests for system development in collaboration with generative AI.

Many people may be considering implementing non-standard form OCR but hesitate due to concerns about accuracy. OCR (Optical Character Recognition) is a highly useful technology that converts paper documents into digital data. However, when dealing with non-standard forms that have flexible layouts and inconsistent formats, accuracy issues tend to become more pronounced.

Even if OCR is introduced, low recognition accuracy may lead to an increased need for manual corrections. As a result, the expected improvements in operational efficiency and cost reduction may not be fully realized. Therefore, in this article, we introduce seven specific methods to dramatically enhance the accuracy of non-standard form OCR.

By applying these techniques, you can significantly improve OCR accuracy, streamline operations, and reduce costs. We encourage you to read through to the end and apply these insights to your business.

This article provides a comprehensive explanation, starting with the fundamentals of non-standard form OCR, followed by specific techniques for improving accuracy, and finally exploring the future of OCR through the utilization of the latest technologies.

What Is Non-Standard Form OCR?

If you want to learn more about AI OCR, be sure to check out this article first.
Related article: Innovative Applications Of Generative AI OCR And Five Key Methods

Basic Knowledge of OCR: Mechanism and Types

OCR (Optical Character Recognition) is a technology that extracts text information from image data. Specifically, it enables computers to read text from scanned or photographed paper documents. OCR is widely used for digitizing various types of documents, significantly contributing to operational efficiency and cost reduction. There are four main types of OCR:

Traditional OCR (Traditional Pattern Recognition): This method recognizes text by matching image data with predefined character templates. It delivers high accuracy for standardized fonts and layouts but lacks flexibility.
Optical Mark Recognition (OMR): This method identifies specific patterns, such as checkboxes and mark sheets. It is widely used for surveys and test scoring.
Intelligent Character Recognition (ICR): This method recognizes handwritten characters, which do not follow a fixed pattern. It is suitable for recognizing handwritten forms and signatures.
Barcode Recognition: This method reads symbols such as barcodes and QR codes. It is commonly used for product and inventory management.

Challenges and Solutions of Non-Standard Form OCR

Non-standard forms refer to documents that do not follow a fixed format. Examples include invoices and purchase orders, which vary in layout depending on the company. Non-standard form OCR is a technology that extracts text information from such flexible layouts, but traditional OCR often struggles with accuracy. Conventional OCR technologies alone are often insufficient to handle the wide variety of formats and layouts.

To address this challenge, advanced OCR technologies utilizing AI and machine learning have emerged. AI OCR learns text characteristics from large datasets, enabling high-accuracy recognition even for non-standard forms. Machine learning algorithms automatically identify and learn patterns, allowing for greater adaptability to diverse layouts and formats.

7 Secrets to Improving the Accuracy of Non-Standard Form OCR

1. Use High-Quality Images: The Importance of Scanners and Resolution

The accuracy of OCR is greatly influenced by the quality of the input images. Using high-quality images improves OCR recognition accuracy. Specifically, it is important to use a high-performance scanner and scan at an appropriate resolution.

Choosing a Scanner: Select a scanner by considering factors such as resolution, scanning speed, and supported file formats. A high-performance scanner provides clearer images, contributing to improved OCR accuracy.
Setting the Resolution: A resolution of at least 300 dpi is generally recommended. Higher resolution results in clearer character recognition and reduces the risk of misinterpretation. However, excessively high resolution increases file size and may slow down processing speed, so it is necessary to find a balance.

2. Image Preprocessing: Noise Removal and Contrast Adjustment

Scanned images may contain noise and dirt. These noise factors can reduce the accuracy of OCR, so it is important to remove them through preprocessing.

Noise Removal: Using image editing software or dedicated preprocessing tools, noise and dirt within the image are removed. This makes it easier for OCR to recognize characters accurately.
Contrast Adjustment: By adjusting the contrast between the characters and the background, the characters stand out more clearly, improving recognition accuracy. In particular, if there are spots or shadows on the background, increasing the contrast can improve the visibility of the characters.Choosing the Right OCR Software: Leveraging AI OCR

3. There Are Various Types of OCR Software

For non-standard forms, it is important to select the appropriate software, such as AI OCR, tailored to the type and purpose of the form, rather than using a simple OCR software.

AI OCR: AI OCR can achieve high-accuracy character recognition even for non-standard forms, as it learns the characteristics of characters from large volumes of data. By leveraging machine learning algorithms, it can flexibly handle complex layouts and various formats.
Cloud-based OCR: Cloud-based OCR services are also an option. Since they can be accessed via the internet, they help reduce implementation costs and offer good scalability. However, it is important to check the reliability of the service from the perspective of security and data privacy before using it.

4. Utilizing Layout Recognition Features: Structuring Text

Modern OCR software includes layout recognition features. By utilizing this feature, the structure of the text can be recognized, enabling more accurate data extraction.

Extracting Tabular Data:
By using the layout recognition feature, tabular data can be extracted accurately. This allows the data in the table to be utilized as digital data while maintaining its positional relationships and structure.
Linking Item Names and Values:
By automatically linking item names and their values, the effort required for data entry is significantly reduced. This maintains data integrity and ensures smooth processing of subsequent data.

5. Dictionary Registration: Supporting Technical Terms

Some OCR software includes a dictionary registration feature. By registering technical terms or industry-specific terminology in the dictionary, the recognition accuracy can be improved.

Preventing Misrecognition:
With dictionary registration, OCR software can accurately recognize technical terms. This reduces misrecognition and improves data accuracy.
Improving Recognition Rate:
Registering technical terms in the dictionary improves the overall recognition rate of the OCR software. This is particularly effective for forms that use many industry-specific terms.

6. Creating Templates: Optimization for Form Types

When repeatedly processing the same type of form with OCR, creating templates leads to improved accuracy and reduced processing time.

Fixing Item Positions:
By using templates, the positions of each item can be fixed. This makes it easier for OCR software to recognize the text and reduces misrecognition.
Reducing Processing Time:
Using templates significantly reduces OCR processing time. With a fixed layout, the software can efficiently extract data, improving overall processing speed.

7. Human Review: The Importance of Final Confirmation

After OCR processing, it is crucial to perform a human check. While OCR software is highly accurate, it is not perfect, and there is always a possibility of misrecognition.

Correcting Misrecognition:
By performing a human check, any misrecognition made by the OCR software can be corrected. This improves the accuracy of the data.
Improving Data Accuracy:
Through final confirmation, the accuracy of the data is further enhanced. For important business data, this review process is essential to ensure reliability.

The Future of Unconventional Form OCR: Evolution Through AI Technology

With the evolution of AI technology, the accuracy of OCR is expected to continue improving in the future. In particular, deep learning technology has made a significant contribution to improving the accuracy of OCR.

Advancements in Handwritten Character Recognition:
Deep learning has dramatically improved the accuracy of handwritten character recognition. As a result, the digitization of handwritten forms and signatures can now be done more accurately and efficiently.
Enhanced Multilingual Support:
OCR software using deep learning has strengthened multilingual support. This makes it easier for globally expanding businesses to process multilingual documents, which is expected to improve the efficiency of international operations.

Conclusion: Maximizing the Use of Unconventional Form OCR

Unconventional form OCR is a powerful tool for achieving business efficiency and cost reduction. By implementing the 7 tips for improving accuracy, you can maximize the effectiveness of OCR. With the evolution of AI technology, OCR will continue to evolve, and it is expected to offer even higher accuracy and flexibility in the future. By adopting these strategies, accelerate the digitalization of your business.

AI OCR: Tăng Hiệu Quả Công Việc Một Cách Đáng Kể Trong Việc Trích Xuất Dữ Liệu Từ Các Tài Liệu Không Chuẩn! Hướng Dẫn Chi Tiết Các Phương Pháp Cụ Thể

Posted on May 6, 2025May 6, 2025 by hello@scuti

Xin chào, tôi là Kakeya, đại diện của công ty Scuti.

Dành cho những ai gặp khó khăn trong việc trích xuất dữ liệu từ các tài liệu không chuẩn, sự tiến bộ của công nghệ AI OCR đã giúp việc trích xuất dữ liệu một cách chính xác và hiệu quả từ các bố cục phức tạp và chữ viết tay trở nên khả thi. Việc tự động hóa các công việc nhập liệu và kiểm tra dữ liệu, vốn trước đây được thực hiện thủ công, giúp giảm đáng kể thời gian và chi phí, đồng thời ngăn ngừa sai sót do con người gây ra.

Bài viết này sẽ giải thích chi tiết cách AI OCR đơn giản hóa việc trích xuất dữ liệu từ các tài liệu không chuẩn và đóng góp vào việc nâng cao hiệu quả công việc. Nó sẽ trình bày các bước cụ thể, các ví dụ ứng dụng và những điểm cần lưu ý khi triển khai công nghệ này. Việc áp dụng AI OCR có thể giúp công việc của bạn tiến triển một cách mạnh mẽ

Kiến Thức Cơ Bản Về AI OCR Và Ứng Dụng Của Nó Đối Với Các Tài Liệu Không Chuẩn

Nếu bạn muốn tìm hiểu thêm về AI OCR, hãy xem trước bài viết này.
Bài viết liên quan: AI OCR là gì? Giải thích chi tiết về công nghệ mới nhất và các trường hợp ứng dụng trong ngành.

AI OCR Là Gì? Hiểu Về Công Nghệ Và Cơ Chế Của Nó

AI OCR (Nhận dạng ký tự quang học) là một công nghệ tự động nhận dạng thông tin văn bản từ các tài liệu kỹ thuật số như hình ảnh quét và PDF, sau đó chuyển đổi chúng thành dữ liệu văn bản. OCR truyền thống chỉ giới hạn đối với các tài liệu có phông chữ và bố cục chuẩn, nhưng nhờ sự tiến bộ của công nghệ AI, việc nhận dạng ký tự chính xác cao giờ đây có thể thực hiện được ngay cả với các tài liệu không chuẩn, bao gồm chữ viết tay hoặc bố cục phức tạp.

Bằng cách kết hợp công nghệ xử lý hình ảnh, xử lý ngôn ngữ tự nhiên và học máy, AI OCR hiểu nội dung của tài liệu và trích xuất thông tin cần thiết. Đặc biệt, AI OCR sử dụng học sâu (deep learning) đã cải thiện đáng kể khả năng xử lý các tài liệu không chuẩn nhờ việc học từ một lượng lớn dữ liệu.

Lợi Ích Của AI OCR Trong Việc Xử Lý Tài Liệu Không Chuẩn

AI OCR mang lại nhiều lợi ích trong việc xử lý các tài liệu không chuẩn.

Tăng hiệu quả công việc: Tự động hóa việc nhập liệu dữ liệu vốn trước đây được thực hiện thủ công giúp tiết kiệm thời gian và giảm chi phí đáng kể.
Cải thiện độ chính xác: Ngăn ngừa sai sót do con người giúp cải thiện độ chính xác của việc nhập liệu dữ liệu.
Thúc đẩy việc sử dụng dữ liệu: Dữ liệu đã được trích xuất có thể được phân tích để góp phần vào việc cải tiến công việc và ra quyết định.

Những Ví Dụ Cụ Thể Về Việc Ứng Dụng AI OCR

Cải Thiện Hiệu Quả Công Việc Thông Qua Tự Động Hóa Việc Xử Lý Hóa Đơn

AI OCR rất hiệu quả trong việc tự động hóa xử lý hóa đơn. Các công ty nhận được rất nhiều hóa đơn hàng ngày, nhưng việc xử lý chúng thủ công tốn rất nhiều thời gian và công sức. Bằng cách triển khai AI OCR, có thể tự động trích xuất các thông tin cần thiết từ hóa đơn (chẳng hạn như số hóa đơn, ngày hóa đơn, tên nhà cung cấp, số tiền hóa đơn, và số tiền thuế giá trị gia tăng) và tích hợp vào hệ thống kế toán.

Ví dụ, phần mềm AI OCR như Docsumo có khả năng trích xuất dữ liệu chính xác cao, giúp việc xử lý hóa đơn diễn ra một cách suôn sẻ. Điều này giúp ngăn ngừa các lỗi nhập liệu thủ công và cải thiện hiệu quả công việc

Trích Xuất Dữ Liệu Tự Động Để Tối Ưu Hóa Quản Lý Hợp Đồng

Quản lý hợp đồng cũng là một lĩnh vực có thể áp dụng AI OCR. Các hợp đồng chứa những thông tin quan trọng như ngày hết hạn hợp đồng, ngày gia hạn, các bên tham gia và số tiền hợp đồng, nhưng việc quản lý thủ công là rất khó khăn. Bằng cách sử dụng AI OCR, có thể tự động trích xuất thông tin cần thiết từ hợp đồng và lưu trữ vào cơ sở dữ liệu.

Điều này cho phép xây dựng một hệ thống tự động thông báo thời gian gia hạn hợp đồng. Kết quả là, hiệu quả và độ chính xác trong quản lý hợp đồng sẽ được cải thiện đáng kể.

Trích Xuất Tự Động Dữ Liệu Hồ Sơ Y Tế Và Báo Cáo Chuẩn Đoán Trong Lĩnh Vực Y Tế

Việc sử dụng AI OCR cũng đang phát triển trong lĩnh vực y tế. Các tài liệu y tế như hồ sơ bệnh án và báo cáo chẩn đoán thường chứa nhiều chữ viết tay và thuật ngữ chuyên ngành, khiến việc số hóa chúng trở nên khó khăn. Bằng cách áp dụng AI OCR, có thể tự động trích xuất các thông tin cần thiết như tên bệnh nhân, ngày sinh, chẩn đoán và đơn thuốc từ các tài liệu này và tích hợp chúng vào hệ thống hồ sơ y tế điện tử.

Điều này giúp giảm bớt gánh nặng công việc cho các nhân viên y tế và việc chia sẻ thông tin y tế trở nên thuận tiện hơn. Việc triển khai AI OCR đóng góp lớn vào việc nâng cao hiệu quả và độ chính xác trong các cơ sở y tế.

Các Bước Cụ Thể Để Triển Khai AI OCR

Các Bước Làm Rõ Mục Tiêu Và Yêu Cầu

Trước khi triển khai AI OCR, việc làm rõ mục tiêu muốn đạt được là rất quan trọng. Ví dụ, đặt ra các mục tiêu cụ thể như “Giảm 50% thời gian xử lý hóa đơn” hoặc “Loại bỏ tình trạng bỏ sót gia hạn hợp đồng.”

Ngoài ra, yêu cầu đối với AI OCR cũng cần được làm rõ. Điều này bao gồm việc xác định loại tài liệu cần xử lý, các trường dữ liệu cần thiết, mục tiêu độ chính xác, và yêu cầu tích hợp hệ thống, nhằm xây dựng nền tảng cho việc vận hành suôn sẻ sau khi triển khai.

Cách Chọn Phần Mềm AI OCR Phù Hợp

Phần mềm AI OCR có nhiều loại khác nhau, mỗi sản phẩm có các tính năng và đặc điểm khác nhau. Việc chọn sản phẩm phù hợp với mục tiêu và yêu cầu của bạn là rất quan trọng. Ví dụ, Docsumo hỗ trợ nhiều loại tài liệu không chuẩn như hóa đơn, hợp đồng và biên lai, cung cấp khả năng trích xuất dữ liệu chính xác cao và giao diện dễ sử dụng.

Ngoài ra, nó còn có khả năng tích hợp mạnh mẽ với các hệ thống hiện có, giúp việc vận hành sau khi triển khai diễn ra suôn sẻ. Việc so sánh các tính năng của từng sản phẩm và chọn phần mềm phù hợp nhất với nhu cầu của công ty bạn là chìa khóa thành công.

Chuẩn Bị Dữ Liệu Và Quy Trình Huấn Luyện Mô Hình AI OCR

Để cải thiện độ chính xác của AI OCR, việc chuẩn bị dữ liệu phù hợp và huấn luyện mô hình là rất cần thiết. Đầu tiên, thu thập dữ liệu mẫu của các tài liệu cần xử lý và huấn luyện mô hình AI OCR. Càng có nhiều dữ liệu huấn luyện, độ chính xác nhận diện của mô hình sẽ càng cao.

Đặc biệt, việc chuẩn bị dữ liệu đa dạng, bao gồm cả chữ viết tay và tài liệu có bố cục phức tạp là rất quan trọng. Điều này giúp mô hình AI OCR có thể xử lý các mẫu tài liệu đa dạng và trích xuất dữ liệu với độ chính xác cao trong quá trình vận hành thực tế.

Cách Đạt Được Sự Tích Hợp Suôn Sẻ với Các Hệ Thống Hiện Có

Để tận dụng hiệu quả dữ liệu được trích xuất bằng AI OCR, việc tích hợp với các hệ thống kế toán và hệ thống nghiệp vụ hiện có là điều không thể thiếu. Ví dụ, dữ liệu trích xuất từ hóa đơn có thể được tự động nhập vào hệ thống kế toán, hoặc thông tin từ hợp đồng có thể được đăng ký vào hệ thống quản lý hợp đồng.

Khi chọn phần mềm AI OCR, việc kiểm tra khả năng tích hợp với các hệ thống hiện có là rất quan trọng. Điều này mở rộng phạm vi sử dụng dữ liệu và giúp nâng cao hiệu quả công việc tổng thể.

Những Lưu Ý Và Giải Pháp Cho Các Vấn Đề Khi Triển Khai AI OCR

Các Thách Thức Trong Việc Cải Thiện Độ Chính Xác Đối Với Chữ Viết Tay Và Bố Cục Phức Tạp

AI OCR có thể gặp khó khăn trong việc nhận dạng chữ viết tay và các tài liệu có bố cục phức tạp. Đặc biệt, khi ký tự không rõ ràng hoặc bố cục bị sai lệch, độ chính xác nhận dạng có thể bị giảm. Để nâng cao độ chính xác, việc sử dụng máy quét chất lượng cao và thực hiện xử lý hình ảnh trước là rất hiệu quả.

Ngoài ra, việc huấn luyện mô hình AI OCR với dữ liệu đa dạng có thể cải thiện độ chính xác nhận dạng. Việc cải tiến mô hình liên tục và tăng cường dữ liệu là chìa khóa để nâng cao độ chính xác.

Cách Cân Bằng Giữa Chi Phí Triển Khai Và Chi Phí Vận Hành

Việc triển khai phần mềm AI OCR phát sinh chi phí ban đầu và chi phí vận hành. Cần xem xét các khoản chi như phí bản quyền, chi phí máy chủ và chi phí bảo trì, đồng thời chú trọng đến hiệu quả chi phí.

Để giảm thiểu chi phí, có thể sử dụng dịch vụ AI OCR dựa trên nền tảng đám mây hoặc tận dụng phần mềm AI OCR mã nguồn mở. Việc lựa chọn giải pháp phù hợp với ngân sách và nhu cầu của doanh nghiệp là rất quan trọng, hướng tới việc giảm chi phí trong dài hạn.

Tầm Quan Trọng Của Việc Bảo Vệ Thông Tin Mật Và Thực Hiện Các Biện Pháp Bảo Mật

Các tài liệu được xử lý bằng AI OCR có thể chứa thông tin cá nhân hoặc thông tin mật. Do đó, việc thực hiện các biện pháp bảo mật là vô cùng quan trọng. Khi lựa chọn phần mềm AI OCR, cần ưu tiên các sản phẩm có tính năng bảo mật mạnh mẽ.

Cần thiết lập hợp lý nơi lưu trữ dữ liệu và quyền truy cập để ngăn chặn rò rỉ thông tin. Bằng cách thực hiện những biện pháp này, doanh nghiệp có thể yên tâm ứng dụng AI OCR và thúc đẩy hiệu quả công việc.

Tổng Kết: Trích Xuất Dữ Liệu Từ Tài Liệu Phi Cấu Trúc Một Cách Hiệu Quả Bằng AI OCR

AI OCR là một công cụ mạnh mẽ giúp tối ưu hóa việc trích xuất dữ liệu từ các tài liệu phi cấu trúc. Công nghệ này mang lại nhiều lợi ích như nâng cao hiệu quả công việc, tăng độ chính xác và tận dụng tốt dữ liệu. Khi triển khai, cần làm rõ mục tiêu và yêu cầu, đồng thời lựa chọn phần mềm AI OCR phù hợp.

Ngoài ra, cần chú ý đầy đủ đến các yếu tố như độ chính xác, chi phí và bảo mật. Việc ứng dụng hiệu quả AI OCR sẽ giúp giải quyết các thách thức trong xử lý tài liệu phi cấu trúc và nâng cao hiệu suất công việc.

1. Introduction

2. Main Content

3. Conclusion

1. Introduction to the Launch of Gemini CLI

2. Comparison Criteria Between the Two CLI Agents

3. Gemini CLI

4. Claude Code CLI

5. Conclusion

6. Which Should You Use? Summary of Best Use Cases

1. Introduction

2. What is tmux?

3. What is Claude?

4. Why combine tmux and Claude?

5. System Architecture

6. Installation

7. Deployment

8. Conclusion

1.Create a workflow folder:

2.API key:

3. Using Claude Code Action:

4.Result:

🔍 Experimenting with Image Embedding Using Large AI Models

🧐 A Surprising Discovery

Reason for Choosing Cohere

⚙️ How I Built the System

1️⃣ Document Preparation Flow

2️⃣ User Query Flow

🔑 How to Get API Keys

🔧 Flow 1: Setting Up Cohere and Gemini in Python

✅ Step 1: Install and Set Up Cohere

✅ Step 2: Install and Set Up Gemini (Google Generative AI)

📌 Flow 1: Document Preparation and Embedding

📥 Step 1: Download the PDF

🖼️ Step 2: Convert PDF Pages to Text + Image

📤 Step 3: Embed Using Cohere

🔍 Flow 2: Ask a Question and Retrieve the Answer Using Image + LLM

💬 Step 1: Ask the Question

🧠 Step 2: Convert the Question to Embedding & Find Relevant Image

🤖 Step 3: Use Vision-LLM (Gemini 2.5) to Answer

▶️ Step 4: Run the Full Flow

🧪 Example Usage:

🧾 Output:

🧩 Final Thoughts

OCR Biểu Mẫu Phi Tiêu Chuẩn Là ì?

Kiến thức cơ bản về OCR: Cơ chế và các loại hình

Thách thức và giải pháp của OCR biểu mẫu phi tiêu chuẩn

7 Bí quyết giúp nâng cao độ chính xác của OCR biểu mẫu phi tiêu chuẩn

1. Sử dụng hình ảnh chất lượng cao: Tầm quan trọng của máy quét và độ phân giải

2. Tiền xử lý hình ảnh: Loại bỏ nhiễu và điều chỉnh độ tương phản

3. Lựa chọn phần mềm OCR phù hợp: Tận dụng AI OCR

7. Kiểm tra bởi con người: Tầm quan trọng của việc xác nhận cuối cùng

Tương lai của OCR trên mẫu không chuẩn: Sự tiến hóa qua công nghệ AI

Kết Luận: Tối đa hóa việc sử dụng OCR trên mẫu không chuẩn

What Is Non-Standard Form OCR?

Basic Knowledge of OCR: Mechanism and Types

Challenges and Solutions of Non-Standard Form OCR

7 Secrets to Improving the Accuracy of Non-Standard Form OCR

1. Use High-Quality Images: The Importance of Scanners and Resolution

2. Image Preprocessing: Noise Removal and Contrast Adjustment

3. There Are Various Types of OCR Software

4. Utilizing Layout Recognition Features: Structuring Text

5. Dictionary Registration: Supporting Technical Terms

6. Creating Templates: Optimization for Form Types

7. Human Review: The Importance of Final Confirmation

The Future of Unconventional Form OCR: Evolution Through AI Technology

Conclusion: Maximizing the Use of Unconventional Form OCR

Kiến Thức Cơ Bản Về AI OCR Và Ứng Dụng Của Nó Đối Với Các Tài Liệu Không Chuẩn

AI OCR Là Gì? Hiểu Về Công Nghệ Và Cơ Chế Của Nó​

Lợi Ích Của AI OCR Trong Việc Xử Lý Tài Liệu Không Chuẩn

Những Ví Dụ Cụ Thể Về Việc Ứng Dụng AI OCR

Cải Thiện Hiệu Quả Công Việc Thông Qua Tự Động Hóa Việc Xử Lý Hóa Đơn

Trích Xuất Dữ Liệu Tự Động Để Tối Ưu Hóa Quản Lý Hợp Đồng

Trích Xuất Tự Động Dữ Liệu Hồ Sơ Y Tế Và Báo Cáo Chuẩn Đoán Trong Lĩnh Vực Y Tế

Các Bước Cụ Thể Để Triển Khai AI OCR

Các Bước Làm Rõ Mục Tiêu Và Yêu Cầu

Cách Chọn Phần Mềm AI OCR Phù Hợp

Chuẩn Bị Dữ Liệu Và Quy Trình Huấn Luyện Mô Hình AI OCR

Cách Đạt Được Sự Tích Hợp Suôn Sẻ với Các Hệ Thống Hiện Có

Những Lưu Ý Và Giải Pháp Cho Các Vấn Đề Khi Triển Khai AI OCR

Các Thách Thức Trong Việc Cải Thiện Độ Chính Xác Đối Với Chữ Viết Tay Và Bố Cục Phức Tạp

AI OCR Là Gì? Hiểu Về Công Nghệ Và Cơ Chế Của Nó