Grounding Gemini with Your Data: A Deep Dive into the File Search Tool and Managed RAG

Grounding Gemini with Your Data: File Search Tool

The true potential of Large Language Models (LLMs) is unlocked when they can interact with specific, private, and up-to-date data outside their initial training corpus. This is the core principle of Retrieval-Augmented Generation (RAG). The Gemini File Search Tool is Google’s dedicated solution for enabling RAG, providing a fully managed, scalable, and reliable system to ground the Gemini model in your own proprietary documents.

This guide serves as a complete walkthrough (AI Quest Type 2): we’ll explore the tool’s advanced features, demonstrate its behavior via the official demo, and provide a detailed, working Python code sample to show you exactly how to integrate RAG into your applications.


1. Core Features and Technical Advantage

1.1. Why Use a Managed RAG Solution?

Building a custom RAG pipeline involves several complex, maintenance-heavy steps: chunking algorithms, selecting and running an embedding model, maintaining a vector store (like Vector Database or Vector Store), and integrating the search results back into the prompt.

The Gemini File Search Tool eliminates this complexity by providing a fully managed RAG pipeline:

  • Automatic Indexing: When you upload a file, the system automatically handles document parsing, chunking, and generating vector embeddings using a state-of-the-art model.
  • Scalable Storage: Files are stored and indexed in a dedicated File Search Store—a persistent, highly available vector repository managed entirely by Google.
  • Zero-Shot Tool Use: You don’t write any search code. You simply enable the tool, and the Gemini model automatically decides when to call the File Search service to retrieve context, ensuring optimal performance.

1.2. Key Features

  • Semantic Search: Unlike simple keyword matching, File Search uses the generated vector embeddings to understand the meaning and intent (semantics) of your query, fetching the most relevant passages, even if the phrasing is different.
  • Built-in Citations: Crucially, every generated answer includes clear **citations (Grounding Metadata)** that point directly to the source file and the specific text snippet used. This ensures **transparency and trust**.
  • Broad File Support: Supports common formats including PDF, DOCX, TXT, JSON, and more.

2. Checking Behavior via the Official Demo App: A Visual RAG Walkthrough 🔎

This section fulfills the requirement to check the behavior by demo app using a structured test scenario. The goal is to visibly demonstrate how the Gemini model uses the File Search Tool to become grounded in your private data, confirming that RAG is active and reliable.

2.1. Test Scenario Preparation

To prove that the model prioritizes the uploaded file over its general knowledge, we’ll use a file containing specific, non-public details.

Access: Go to the “Ask the Manual” template on Google AI Studio: https://aistudio.google.com/apps/bundled/ask_the_manual?showPreview=true&showAssistant=true.

Test File (Pricing_Override.txt):

Pricing_Override.txt content:

The official retail price for Product X is set at $10,000 USD.
All customer service inquiries must be directed to Ms. Jane Doe at extension 301.
We currently offer an unlimited lifetime warranty on all purchases.

2.2. Step-by-Step Execution and Observation

Step 1: Upload the Source File

Navigate to the demo and upload the Pricing_Override.txt file. The File Search system indexes the content, and the file should be listed as “Ready” or “Loaded” in the interface, confirming the source is available for retrieval.

Image of the Gemini AI Studio interface showing the Pricing_Override.txt file successfully uploaded and ready for use in the File Search Tool

Step 2: Pose the Retrieval Query

Ask a question directly answerable only by the file: “What is the retail price of Product X and who handles customer service?” The model internally triggers the File Search Tool to retrieve the specific price and contact person from the file’s content.

Image of the Gemini AI Studio interface showing the user query 'What is the retail price of Product X and who handles customer service?' entered into the chat box

Step 3: Observe Grounded Response & Citation

Observe the model’s response. The Expected RAG Behavior is crucial: the response must state the file-specific price ($10,000 USD) and contact (Ms. Jane Doe), followed immediately by a citation mark (e.g., [1] The uploaded file). This confirms the answer is grounded.

Image of the Gemini AI Studio interface showing the model's response with price and contact, and a citation [1] linked to the uploaded file

Step 4: Verify Policy Retrieval

Ask a supplementary policy question: “What is the current warranty offering?” The model successfully retrieves and restates the specific policy phrase from the file, demonstrating continuous access to the knowledge base.

Image of the Gemini AI Studio interface showing the user query 'What is the current warranty offering?' and the grounded model response with citation

Conclusion from Demo

This visual walkthrough confirms that the **File Search Tool is correctly functioning as a verifiable RAG mechanism**. The model successfully retrieves and grounds its answers in the custom data, ensuring accuracy and trust by providing clear source citations.


3. Getting Started: The Development Workflow

3.1. Prerequisites

  • Gemini API Key: Set your key as an environment variable: GEMINI_API_KEY.
  • Python SDK: Install the official Google GenAI library:
pip install google-genai

3.2. Three Core API Steps

The integration workflow uses three distinct API calls:

Step Method Purpose
1. Create Store client.file_search_stores.create() Creates a persistent container (the knowledge base) where your file embeddings will be stored.
2. Upload File client.file_search_stores.upload_to_file_search_store() Uploads the raw file, triggers the LRO (Long-Running Operation) for indexing (chunking, embedding), and attaches the file to the Store.
3. Generate Content client.models.generate_content() Calls the Gemini model (gemini-2.5-flash), passing the Store name in the tools configuration to activate RAG.

4. Detailed Sample Code and Execution (Make sample code and check how it works)

This Python code demonstrates the complete life cycle of a RAG application, from creating the store to querying the model and cleaning up resources.

A. Sample File Content: service_guide.txt

The new account registration process includes the following steps: 1) Visit the website. 2) Enter email and password. 3) Confirm via the email link sent to your inbox. 4) Complete the mandatory personal information. The monthly cost for the basic service tier is $10 USD. The refund policy is valid for 30 days from the date of purchase. For support inquiries, please email [email protected].

B. Python Code (gemini_file_search_demo.py)

(The code block is presented as a full script for easy reference and testing.)

import os
import time
from google import genai
from google.genai import types
from google.genai.errors import APIError

# --- Configuration ---
FILE_NAME = "service_guide.txt"
STORE_DISPLAY_NAME = "Service Policy Knowledge Base"
MODEL_NAME = "gemini-2.5-flash"

def run_file_search_demo():
    # Helper to create the local file for upload
    if not os.path.exists(FILE_NAME):
        file_content = """The new account registration process includes the following steps: 1) Visit the website. 2) Enter email and password. 3) Confirm via the email link sent to your inbox. 4) Complete the mandatory personal information. The monthly cost for the basic service tier is $10 USD. The refund policy is valid for 30 days from the date of purchase. For support inquiries, please email [email protected]."""
        with open(FILE_NAME, "w") as f:
            f.write(file_content)
    
    file_search_store = None # Initialize for cleanup in finally block
    try:
        print("💡 Initializing Gemini Client...")
        client = genai.Client()

        # 1. Create the File Search Store
        print(f"\n🚀 1. Creating File Search Store: '{STORE_DISPLAY_NAME}'...")
        file_search_store = client.file_search_stores.create(
            config={'display_name': STORE_DISPLAY_NAME}
        )
        print(f"   -> Store Created: {file_search_store.name}")
        
        # 2. Upload and Import File into the Store (LRO)
        print(f"\n📤 2. Uploading and indexing file '{FILE_NAME}'...")
        
        operation = client.file_search_stores.upload_to_file_search_store(
            file=FILE_NAME,
            file_search_store_name=file_search_store.name,
            config={'display_name': f"Document {FILE_NAME}"}
        )

        while not operation.done:
            print("   -> Processing file... Please wait (5 seconds)...")
            time.sleep(5)
            operation = client.operations.get(operation)

        print("   -> File successfully processed and indexed!")

        # 3. Perform the RAG Query
        print(f"\n💬 3. Querying model '{MODEL_NAME}' with your custom data...")
        
        questions = [
            "What is the monthly fee for the basic tier?",
            "How do I sign up for a new account?",
            "What is the refund policy?"
        ]

        for i, question in enumerate(questions):
            print(f"\n   --- Question {i+1}: {question} ---")
            
            response = client.models.generate_content(
                model=MODEL_NAME,
                contents=question,
                config=types.GenerateContentConfig(
                    tools=[
                        types.Tool(
                            file_search=types.FileSearch(
                                file_search_store_names=[file_search_store.name]
                            )
                        )
                    ]
                )
            )

            # 4. Print results and citations
            print(f"   🤖 Answer: {response.text}")
            
            if response.candidates and response.candidates[0].grounding_metadata:
                print("   📚 Source Citation:")
                # Process citations, focusing on the text segment for clarity
                for citation_chunk in response.candidates[0].grounding_metadata.grounding_chunks:
                    print(f"    - From: '{FILE_NAME}' (Snippet: '{citation_chunk.text_segment.text}')")
            else:
                print("   (No specific citation found.)")


    except APIError as e:
        print(f"\n❌ [API ERROR] Đã xảy ra lỗi khi gọi API: {e}")
    except Exception as e:
        print(f"\n❌ [LỖI CHUNG] Đã xảy ra lỗi không mong muốn: {e}")
    finally:
        # 5. Clean up resources (Essential for managing quota)
        if file_search_store:
            print(f"\n🗑️ 4. Cleaning up: Deleting File Search Store {file_search_store.name}...")
            client.file_search_stores.delete(name=file_search_store.name)
            print("   -> Store successfully deleted.")
            
        if os.path.exists(FILE_NAME):
            os.remove(FILE_NAME)
            print(f"   -> Deleted local sample file '{FILE_NAME}'.")

if __name__ == "__main__":
    run_file_search_demo()

C. Demo Execution and Expected Output 🖥️

When running the Python script, the output demonstrates the successful RAG process, where the model’s responses are strictly derived from the service_guide.txt file, confirmed by the citations.

💡 Initializing Gemini Client...
...
   -> File successfully processed and indexed!

💬 3. Querying model 'gemini-2.5-flash' with your custom data...

   --- Question 1: What is the monthly fee for the basic tier? ---
   🤖 Answer: The monthly cost for the basic service tier is $10 USD.
   📚 Source Citation:
    - From: 'service_guide.txt' (Snippet: 'The monthly cost for the basic service tier is $10 USD.')

   --- Question 2: How do I sign up for a new account? ---
   🤖 Answer: To sign up, you need to visit the website, enter email and password, confirm via the email link, and complete the mandatory personal information.
   📚 Source Citation:
    - From: 'service_guide.txt' (Snippet: 'The new account registration process includes the following steps: 1) Visit the website. 2) Enter email and password. 3) Confirm via the email link sent to your inbox. 4) Complete the mandatory personal information.')

   --- Question 3: What is the refund policy? ---
   🤖 Answer: The refund policy is valid for 30 days from the date of purchase.
   📚 Source Citation:
    - From: 'service_guide.txt' (Snippet: 'The refund policy is valid for 30 days from the date of purchase.')

🗑️ 4. Cleaning up: Deleting File Search Store fileSearchStores/...
   -> Store successfully deleted.
   -> Deleted local sample file 'service_guide.txt'.

Conclusion

The **Gemini File Search Tool** provides an elegant, powerful, and fully managed path to RAG. By abstracting away the complexities of vector databases and indexing, it allows developers to quickly build **highly accurate, reliable, and grounded AI applications** using their own data. This tool is essential for anyone looking to bridge the gap between general AI capabilities and specific enterprise knowledge.

Xây Dựng AI Agent Hiệu Quả với MCP

Giới Thiệu

Trong thời đại AI đang phát triển mạnh mẽ, việc xây dựng các AI agent thông minh và hiệu quả đã trở thành mục tiêu của nhiều nhà phát triển. Model Context Protocol (MCP) – một giao thức mở được Anthropic phát triển – đang mở ra những khả năng mới trong việc tối ưu hóa cách các AI agent tương tác với dữ liệu và công cụ. Bài viết này sẽ phân tích cách tiếp cận “Code Execution with MCP” và đưa ra những góc nhìn thực tế về việc áp dụng nó vào các dự án thực tế.

MCP Là Gì và Tại Sao Nó Quan Trọng?

Model Context Protocol (MCP) có thể được ví như “USB-C của thế giới AI” – một tiêu chuẩn mở giúp chuẩn hóa cách các ứng dụng cung cấp ngữ cảnh cho các mô hình ngôn ngữ lớn (LLM). Thay vì mỗi hệ thống phải tự xây dựng cách kết nối riêng, MCP cung cấp một giao thức thống nhất, giúp giảm thiểu sự phân mảnh và tăng tính tương thích.

Quan điểm cá nhân: Tôi cho rằng MCP không chỉ là một công nghệ, mà còn là một bước tiến quan trọng trong việc chuẩn hóa hệ sinh thái AI. Giống như cách HTTP đã cách mạng hóa web, MCP có tiềm năng trở thành nền tảng cho việc kết nối các AI agent với thế giới bên ngoài.

Code Execution với MCP: Bước Đột Phá Thực Sự

Vấn Đề Truyền Thống

Trước đây, khi xây dựng AI agent, chúng ta thường phải:

  • Tải tất cả định nghĩa công cụ vào context window ngay từ đầu
  • Gửi toàn bộ dữ liệu thô đến mô hình, dù chỉ cần một phần nhỏ
  • Thực hiện nhiều lần gọi công cụ tuần tự, gây ra độ trễ cao
  • Đối mặt với rủi ro bảo mật khi dữ liệu nhạy cảm phải đi qua mô hình

Giải Pháp: Code Execution với MCP

Code execution với MCP cho phép AI agent viết và thực thi mã để tương tác với các công cụ MCP. Điều này mang lại 5 lợi ích chính:

1. Tiết Lộ Dần Dần (Progressive Disclosure)

Cách hoạt động: Thay vì tải tất cả định nghĩa công cụ vào context, agent có thể đọc các file công cụ từ hệ thống file khi cần thiết.

Ví dụ thực tế: Giống như việc bạn không cần đọc toàn bộ thư viện sách để tìm một thông tin cụ thể. Agent chỉ cần “mở” file công cụ khi thực sự cần sử dụng.

Lợi ích:

  • Giảm đáng kể token consumption
  • Tăng tốc độ phản hồi ban đầu
  • Cho phép agent làm việc với số lượng công cụ lớn hơn

2. Kết Quả Công Cụ Hiệu Quả Về Ngữ Cảnh

Vấn đề: Khi làm việc với dataset lớn (ví dụ: 10,000 records), việc gửi toàn bộ dữ liệu đến mô hình là không hiệu quả.

Giải pháp: Agent có thể viết mã để lọc, chuyển đổi và xử lý dữ liệu trước khi trả về kết quả cuối cùng.

Ví dụ:

# Thay vì trả về 10,000 records
# Agent có thể viết:
results = filter_data(dataset, criteria)
summary = aggregate(results)
return summary  # Chỉ trả về kết quả đã xử lý

Quan điểm: Đây là một trong những điểm mạnh nhất của phương pháp này. Nó cho phép agent “suy nghĩ” trước khi trả lời, giống như cách con người xử lý thông tin.

3. Luồng Điều Khiển Mạnh Mẽ

Cách truyền thống: Agent phải thực hiện nhiều lần gọi công cụ tuần tự:

Gọi công cụ 1 → Chờ kết quả → Gọi công cụ 2 → Chờ kết quả → ...

Với code execution: Agent có thể viết một đoạn mã với vòng lặp, điều kiện và xử lý lỗi:

for item in items:
    result = process(item)
    if result.is_valid():
        save(result)
    else:
        log_error(item)

Lợi ích:

  • Giảm độ trễ (latency) đáng kể
  • Xử lý lỗi tốt hơn
  • Logic phức tạp được thực thi trong một bước

4. Bảo Vệ Quyền Riêng Tư

Đặc điểm quan trọng: Các kết quả trung gian mặc định được giữ trong môi trường thực thi, không tự động gửi đến mô hình.

Ví dụ: Khi agent xử lý dữ liệu nhạy cảm (thông tin cá nhân, mật khẩu), các biến trung gian chỉ tồn tại trong môi trường thực thi. Chỉ khi agent chủ động log hoặc return, dữ liệu mới được gửi đến mô hình.

Quan điểm: Đây là một tính năng bảo mật quan trọng, đặc biệt trong các ứng dụng enterprise. Tuy nhiên, cần có cơ chế giám sát để đảm bảo agent không vô tình leak dữ liệu.

5. Duy Trì Trạng Thái và Kỹ Năng

Khả năng mới: Agent có thể:

  • Lưu trạng thái vào file để tiếp tục công việc sau
  • Xây dựng các function có thể tái sử dụng như “kỹ năng”
  • Học và cải thiện theo thời gian

Ví dụ thực tế: Agent có thể tạo file utils.py với các function xử lý dữ liệu, và sử dụng lại trong các task tương lai.

Cách Xây Dựng AI Agent Hiệu Quả với MCP

Bước 1: Thiết Kế Kiến Trúc

Nguyên tắc:

  • Tách biệt rõ ràng giữa logic xử lý và tương tác với MCP
  • Thiết kế các công cụ MCP theo module, dễ mở rộng
  • Xây dựng hệ thống quản lý trạng thái rõ ràng

Ví dụ kiến trúc:

Agent Core
├── MCP Client (kết nối với MCP servers)
├── Code Executor (sandbox environment)
├── State Manager (lưu trữ trạng thái)
└── Tool Registry (quản lý công cụ)

Bước 2: Tối Ưu Hóa Progressive Disclosure

Chiến lược:

  • Tổ chức công cụ theo namespace và category
  • Sử dụng file system để quản lý định nghĩa công cụ
  • Implement lazy loading cho các công cụ ít dùng

Code pattern:

# tools/database/query.py
def query_database(sql):
    # Implementation
    pass

# Agent chỉ load khi cần
if need_database:
    import tools.database.query

Bước 3: Xây Dựng Data Processing Pipeline

Best practices:

  • Luôn filter và transform dữ liệu trước khi trả về
  • Sử dụng streaming cho dataset lớn
  • Implement caching cho các query thường dùng

Ví dụ:

def process_large_dataset(data_source):
    # Chỉ load và xử lý phần cần thiết
    filtered = stream_filter(data_source, filter_func)
    aggregated = aggregate_in_chunks(filtered)
    return summary_statistics(aggregated)

Bước 4: Implement Security Measures

Các biện pháp cần thiết:

  • Sandboxing: Chạy code trong môi trường cách ly
  • Resource limits: Giới hạn CPU, memory, thời gian thực thi
  • Audit logging: Ghi lại tất cả code được thực thi
  • Input validation: Kiểm tra input trước khi thực thi

Quan điểm: Security không phải là feature, mà là requirement. Đừng để đến khi có sự cố mới nghĩ đến bảo mật.

Bước 5: State Management và Skill Building

Chiến lược:

  • Sử dụng file system hoặc database để lưu trạng thái
  • Tạo thư viện các utility functions có thể tái sử dụng
  • Implement versioning cho các “skills”

Ví dụ:

# skills/data_analysis.py
def analyze_trends(data):
    # Reusable skill
    pass

# Agent có thể import và sử dụng
from skills.data_analysis import analyze_trends

Áp Dụng Vào Dự Án Thực Tế

Use Case 1: Data Analysis Agent

Tình huống: Xây dựng agent phân tích dữ liệu từ nhiều nguồn khác nhau.

Áp dụng MCP:

  • MCP servers cho mỗi data source (database, API, file system)
  • Code execution để filter và aggregate dữ liệu
  • Progressive disclosure cho các công cụ phân tích

Lợi ích:

  • Giảm 60-70% token usage
  • Tăng tốc độ xử lý 3-5 lần
  • Dễ dàng thêm data source mới

Use Case 2: Automation Agent

Tình huống: Agent tự động hóa các tác vụ lặp đi lặp lại.

Áp dụng MCP:

  • MCP servers cho các hệ thống cần tương tác
  • Code execution để xử lý logic phức tạp
  • State management để resume công việc

Lợi ích:

  • Xử lý lỗi tốt hơn với try-catch trong code
  • Có thể pause và resume công việc
  • Dễ dàng debug và monitor

Use Case 3: Customer Support Agent

Tình huống: Agent hỗ trợ khách hàng với quyền truy cập vào nhiều hệ thống.

Áp dụng MCP:

  • MCP servers cho CRM, knowledge base, ticketing system
  • Code execution để query và tổng hợp thông tin
  • Privacy protection cho dữ liệu khách hàng

Lợi ích:

  • Bảo vệ thông tin nhạy cảm tốt hơn
  • Phản hồi nhanh hơn với data processing tại chỗ
  • Dễ dàng tích hợp hệ thống mới

Những Thách Thức và Giải Pháp

Thách Thức 1: Code Quality và Safety

Vấn đề: Agent có thể viết code không an toàn hoặc không hiệu quả.

Giải pháp:

  • Implement code review tự động
  • Sử dụng linter và formatter
  • Giới hạn các API và function có thể sử dụng

Thách Thức 2: Debugging

Vấn đề: Debug code được agent tự động generate khó hơn code thủ công.

Giải pháp:

  • Comprehensive logging
  • Code explanation từ agent
  • Step-by-step execution với breakpoints

Thách Thức 3: Performance

Vấn đề: Code execution có thể chậm nếu không tối ưu.

Giải pháp:

  • Caching kết quả
  • Parallel execution khi có thể
  • Optimize code generation từ agent

Roadmap Áp Dụng MCP Vào Dự Án Của Bạn

Dựa trên những nguyên tắc và best practices đã trình bày, đây là roadmap cụ thể để bạn có thể áp dụng MCP vào dự án của mình một cách hiệu quả:

Giai Đoạn 1: Chuẩn Bị và Đánh Giá (Tuần 1-2)

Mục tiêu: Hiểu rõ nhu cầu và chuẩn bị môi trường

  • Đánh giá use case: Xác định vấn đề cụ thể mà agent sẽ giải quyết
  • Phân tích hệ thống hiện tại: Liệt kê các hệ thống, API, database cần tích hợp
  • Thiết lập môi trường dev: Cài đặt MCP SDK, tạo sandbox environment
  • Xác định metrics: Định nghĩa KPIs để đo lường hiệu quả (token usage, latency, accuracy)
  • Security audit: Đánh giá các yêu cầu bảo mật và compliance

Giai Đoạn 2: Proof of Concept (Tuần 3-4)

Mục tiêu: Xây dựng prototype đơn giản để validate concept

  • Tạo MCP server đầu tiên: Bắt đầu với một data source đơn giản nhất
  • Implement basic agent: Agent có thể gọi MCP tool và xử lý response
  • Test code execution: Cho agent viết và thực thi code đơn giản
  • Đo lường baseline: Ghi lại metrics ban đầu để so sánh
  • Gather feedback: Thu thập phản hồi từ team và stakeholders

Giai Đoạn 3: Mở Rộng và Tối Ưu (Tuần 5-8)

Mục tiêu: Mở rộng chức năng và tối ưu hóa hiệu suất

  • Thêm MCP servers: Tích hợp các data source và hệ thống còn lại
  • Implement progressive disclosure: Tổ chức tools theo namespace, lazy loading
  • Xây dựng data pipeline: Filter, transform, aggregate data trước khi trả về
  • Security hardening: Implement sandboxing, resource limits, audit logging
  • State management: Lưu trạng thái, xây dựng reusable skills
  • Performance optimization: Caching, parallel execution, code optimization

Giai Đoạn 4: Production và Monitoring (Tuần 9-12)

Mục tiêu: Đưa vào production và đảm bảo ổn định

  • Testing toàn diện: Unit tests, integration tests, security tests
  • Documentation: Viết docs cho MCP servers, API, và agent behavior
  • Monitoring setup: Logging, metrics, alerting system
  • Gradual rollout: Deploy từng phần, A/B testing nếu cần
  • Training và support: Đào tạo team, setup support process
  • Continuous improvement: Thu thập feedback, iterate và optimize

Checklist Implementation

Technical Setup

  • MCP SDK installed
  • Sandbox environment configured
  • MCP servers implemented
  • Code executor setup
  • State storage configured

Security

  • Sandboxing enabled
  • Resource limits set
  • Input validation implemented
  • Audit logging active
  • Access control configured

Performance

  • Progressive disclosure implemented
  • Data filtering in place
  • Caching strategy defined
  • Metrics dashboard ready
  • Optimization plan created

Key Takeaways để Áp Dụng Hiệu Quả

  1. Bắt đầu từ use case đơn giản nhất: Đừng cố gắng giải quyết tất cả vấn đề cùng lúc. Bắt đầu nhỏ, học hỏi, rồi mở rộng.
  2. Ưu tiên security từ đầu: Đừng để security là suy nghĩ sau. Thiết kế security vào kiến trúc ngay từ đầu.
  3. Đo lường mọi thứ: Nếu không đo lường được, bạn không thể cải thiện. Setup metrics và monitoring sớm.
  4. Tận dụng code execution: Đây là điểm mạnh của MCP. Cho phép agent xử lý logic phức tạp trong code thay vì nhiều tool calls.
  5. Xây dựng reusable skills: Đầu tư vào việc tạo các function có thể tái sử dụng. Chúng sẽ tiết kiệm thời gian về sau.
  6. Iterate và improve: Không có giải pháp hoàn hảo ngay từ đầu. Thu thập feedback, đo lường, và cải thiện liên tục.

Ví Dụ Thực Tế: E-commerce Data Analysis Agent

Tình huống: Bạn cần xây dựng agent phân tích dữ liệu bán hàng từ nhiều nguồn (database, API, CSV files).

Áp dụng roadmap:

  • Tuần 1-2: Đánh giá data sources, thiết lập môi trường, xác định metrics (query time, token usage)
  • Tuần 3-4: Tạo MCP server cho database, agent có thể query và trả về kết quả đơn giản
  • Tuần 5-8: Thêm MCP servers cho API và file system, implement data filtering, aggregation trong code
  • Tuần 9-12: Production deployment, monitoring, optimize query performance, build reusable analysis functions

Kết quả: Agent có thể phân tích dữ liệu từ nhiều nguồn, giảm 65% token usage, tăng tốc độ xử lý 4 lần so với cách truyền thống.

Kết Luận và Hướng Phát Triển

Code execution với MCP đại diện cho một bước tiến quan trọng trong việc xây dựng AI agent. Nó không chỉ giải quyết các vấn đề về hiệu quả và bảo mật, mà còn mở ra khả năng cho agent “học” và phát triển kỹ năng theo thời gian.

Quan điểm cuối cùng:

Tôi tin rằng đây mới chỉ là khởi đầu. Trong tương lai, chúng ta sẽ thấy:

  • Các agent có thể tự động tối ưu hóa code của chính chúng
  • Hệ sinh thái các MCP servers phong phú hơn
  • Các framework và tooling hỗ trợ tốt hơn cho việc phát triển

Lời khuyên cho các nhà phát triển:

  1. Bắt đầu nhỏ: Bắt đầu với một use case đơn giản để hiểu rõ cách MCP hoạt động
  2. Tập trung vào security: Đừng đánh đổi bảo mật để lấy hiệu quả
  3. Đo lường và tối ưu: Luôn đo lường performance và tối ưu dựa trên dữ liệu thực tế
  4. Cộng đồng: Tham gia vào cộng đồng MCP để học hỏi và chia sẻ kinh nghiệm

Việc áp dụng MCP vào dự án của bạn không chỉ là việc tích hợp một công nghệ mới, mà còn là việc thay đổi cách suy nghĩ về việc xây dựng AI agent. Hãy bắt đầu ngay hôm nay và khám phá những khả năng mới!

Tags:AIMCPAI AgentCode ExecutionMachine Learning

Cursor 2.0: Revolutionizing Code Development

🚀 Cursor 2.0: Revolutionizing Code Development

Discover the New Features and Benefits for Modern Programmers

🎯 What’s New in Cursor 2.0?

⚡ Composer Model

4x Faster Performance: A frontier coding model that operates four times faster than similarly intelligent models, completing most tasks in under 30 seconds. Designed for low-latency agentic coding and particularly effective in large codebases.

🤖 Multi-Agent Interface

Run Up to 8 Agents Concurrently: A redesigned interface that allows you to manage and run up to eight agents simultaneously. Each agent operates in isolated copies of your codebase to prevent file conflicts and enable parallel development workflows.

🌐 Embedded Browser

Now Generally Available: The in-editor browser includes tools for selecting elements and forwarding DOM information to agents. This facilitates more effective web development, testing, and iteration without leaving your editor.

🔒 Sandboxed Terminals

Enhanced Security (macOS): Agent commands now run in a secure sandbox by default, restricting commands to read/write access within your workspace without internet access. This enhances security while maintaining functionality.

🎤 Voice Mode

Hands-Free Operation: Control agents using voice commands with built-in speech-to-text conversion. Supports custom submit keywords, allowing for hands-free coding and improved accessibility.

📝 Improved Code Review

Enhanced Multi-File Management: Better features for viewing and managing changes across multiple files without switching between them. Streamlines the code review process and improves collaboration.

👥 Team Commands

Centralized Management: Define and manage custom commands and rules centrally through the Cursor dashboard. Ensures consistency across your team and standardizes development workflows.

🚀 Performance Enhancements

Faster LSP Performance: Improved loading and usage of Language Server Protocols (LSPs) for all languages. Results in faster performance, reduced memory usage, and smoother operation, especially noticeable in large projects.

💡 Key Benefits for Programmers

🚀 Increased Productivity

Cursor 2.0’s enhanced AI capabilities significantly reduce the time spent on boilerplate code, debugging, and searching for solutions. Programmers can focus more on solving complex problems rather than routine coding tasks.

  • ✓ 4x Faster Code Generation: The Composer model completes most coding tasks in under 30 seconds, dramatically reducing development time and enabling rapid iteration cycles.
  • ✓ Parallel Development Workflows: Multi-agent interface allows running up to 8 agents simultaneously, enabling teams to work on multiple features or bug fixes concurrently without conflicts.
  • ✓ Streamlined Web Development: Embedded browser with DOM element selection eliminates the need to switch between browser and editor, making web testing and debugging more efficient.
  • ✓ Enhanced Security: Sandboxed terminals on macOS provide secure execution environment, protecting sensitive projects while maintaining full functionality for agent commands.
  • ✓ Improved Accessibility: Voice mode enables hands-free coding, making development more accessible and allowing for multitasking while coding.
  • ✓ Better Code Review Process: Enhanced multi-file change management allows reviewing and managing changes across multiple files without constant context switching, improving review efficiency.
  • ✓ Team Consistency: Team Commands feature ensures all team members follow standardized workflows and best practices, reducing onboarding time and maintaining code quality.
  • ✓ Optimized Performance for Large Projects: Improved LSP performance means faster loading times, reduced memory usage, and smoother operation even with complex, large-scale codebases.
  • ✓ Reduced Development Time: Combined features result in significantly faster development cycles, allowing teams to deliver features and fixes much quicker than before.
  • ✓ Better Resource Utilization: Parallel agent execution and optimized performance mean teams can accomplish more with the same resources, improving overall productivity.

🎨 New Features Deep Dive

1. Composer Model – Speed Revolution

The Composer model represents a significant leap in AI coding performance. Key characteristics:

  • ✓ 4x Faster: Operates four times faster than similarly intelligent models
  • ✓ Under 30 Seconds: Completes most coding tasks in less than 30 seconds
  • ✓ Low-Latency: Designed specifically for agentic coding workflows
  • ✓ Large Codebase Optimized: Particularly effective when working with large, complex projects

2. Multi-Agent Interface – Parallel Processing

The multi-agent interface revolutionizes how teams can work with AI assistants:

  • ✓ Run up to 8 agents simultaneously without conflicts
  • ✓ Each agent operates in isolated copies of your codebase
  • ✓ Prevents file conflicts and merge issues
  • ✓ Enables true parallel development workflows

3. Embedded Browser – Integrated Web Development

Now generally available, the embedded browser brings:

  • ✓ In-editor browser for testing and debugging
  • ✓ Element selection tools for DOM interaction
  • ✓ Direct DOM information forwarding to agents
  • ✓ Seamless web development workflow

4. Security & Performance Enhancements

Cursor 2.0 includes critical improvements for security and performance:

  • ✓ Sandboxed Terminals: Secure execution environment on macOS
  • ✓ LSP Improvements: Faster loading and reduced memory usage
  • ✓ Better Resource Management: Optimized for large projects

📊 Comparison: Before vs After

Aspect Before 2.0 After 2.0
Model Speed Standard speed 4x Faster (Composer) NEW
Task Completion Time Minutes <30 seconds NEW
Agent Execution Single agent Up to 8 concurrent agents NEW
Browser Integration External only Embedded in-editor browser NEW
Security (macOS) Standard terminals Sandboxed terminals NEW
Voice Control Not available Voice mode available NEW
Team Management Individual settings Centralized team commands NEW
LSP Performance Standard Enhanced (faster, less memory) IMPROVED

🎯 Use Cases & Scenarios

Scenario 1: Rapid Feature Development

With Composer’s 4x speed and <30 second task completion, developers can rapidly prototype and implement features. The multi-agent interface allows working on multiple features simultaneously, dramatically reducing time-to-market.

Scenario 2: Web Development Workflow

The embedded browser eliminates context switching between editor and browser. Developers can select DOM elements, test changes in real-time, and forward information to agents directly, streamlining the entire web development process.

Scenario 3: Team Collaboration

Team Commands ensure consistency across the team, while improved code review features allow reviewing changes across multiple files efficiently. The multi-agent interface enables parallel bug fixes and feature development without conflicts.

Scenario 4: Large Codebase Management

Enhanced LSP performance and optimized resource usage make Cursor 2.0 particularly effective for large projects. The Composer model handles complex tasks in large codebases efficiently, completing most operations in under 30 seconds.

🔗 Resources & References

For more detailed information about Cursor 2.0, please refer to:

🏷️ Tags

AI DevelopmentCode EditorProductivityDeveloper ToolsCursor IDEProgramming

 

File Search Tool in Gemini API

🔍 File Search Tool in Gemini API

Build Smart RAG Applications with Google Gemini

📋 Table of Contents

🎯 What is File Search Tool?

Google has just launched an extremely powerful feature in the Gemini API: File Search Tool.
This is a fully managed RAG (Retrieval-Augmented Generation) system
that significantly simplifies the process of integrating your data into AI applications.

💡 What is RAG?

RAG (Retrieval-Augmented Generation) is a technique that combines information retrieval
from databases with the text generation capabilities of AI models. Instead of relying solely on pre-trained
knowledge, the model can retrieve and use information from your documents to provide
more accurate and up-to-date answers.

If you’ve ever wanted to build:

  • 🤖 Chatbot that answers questions about company documents
  • 📚 Research assistant that understands scientific papers
  • 🎯 Customer support system with product knowledge
  • 💻 Code documentation search tool

Then File Search Tool is the solution you need!

✨ Key Features

🚀 Simple Integration

Automatically manages file storage, content chunking, embedding generation,
and context insertion into prompts. No complex infrastructure setup required.

🔍 Powerful Vector Search

Uses the latest Gemini Embedding models for semantic search.
Finds relevant information even without exact keyword matches.

📚 Built-in Citations

Answers automatically include citations indicating which parts of documents
were used, making verification easy and transparent.

📄 Multiple Format Support

Supports PDF, DOCX, TXT, JSON, and many programming language files.
Build a comprehensive knowledge base easily.

🎉 Main Benefits

  • Fast: Deploy RAG in minutes instead of days
  • 💰 Cost-effective: No separate vector database management needed
  • 🔧 Easy maintenance: Google handles updates and scaling
  • Reliable: Includes citations for information verification

⚙️ How It Works

File Search Tool operates in 3 simple steps:

  • Create File Search Store
    This is the “storage” for your processed data. The store maintains embeddings
    and search indices for fast retrieval.
  • Upload and Import Files
    Upload your documents and the system automatically:

    • Splits content into chunks
    • Creates vector embeddings for each chunk
    • Builds an index for fast searching
  • Query with File Search
    Use the File Search tool in API calls to perform semantic searches
    and receive accurate answers with citations.

File Search Tool Workflow Diagram

Figure 1: File Search Tool Workflow Process

🛠️ Detailed Installation Guide

Step 1: Environment Preparation

✅ System Requirements

  • Python 3.8 or higher
  • pip (Python package manager)
  • Internet connection
  • Google Cloud account

📦 Required Tools

  • Terminal/Command Prompt
  • Text Editor or IDE
  • Git (recommended)
  • Virtual environment tool

Step 2: Install Python and Dependencies

2.1. Check Python

python –version

Expected output: Python 3.8.x or higher

2.2. Create Virtual Environment (Recommended)

# Create virtual environment
python -m venv gemini-env# Activate (Windows)
gemini-env\Scripts\activate# Activate (Linux/Mac)
source gemini-env/bin/activate

2.3. Install Google Genai SDK

pip install google-genai

Wait for the installation to complete. Upon success, you’ll see:

# Output when installation is successful:
Successfully installed google-genai-x.x.x

Package installation output

Figure 2: Successful Google Genai SDK installation

Step 3: Get API Key

  • Access Google AI Studio
    Open your browser and go to:
    https://aistudio.google.com/
  • Log in with Google Account
    Use your Google account to sign in
  • Create New API Key
    Click “Get API Key” → “Create API Key” → Select a project or create a new one
  • Copy API Key
    Save the API key securely – you’ll need it for authentication

Google AI Studio - Get API Key

Figure 3: Google AI Studio page to create API Key

Step 4: Configure API Key

Method 1: Use Environment Variable (Recommended)

On Windows:

set GEMINI_API_KEY=your_api_key_here

On Linux/Mac:

export GEMINI_API_KEY=’your_api_key_here’

Method 2: Use .env File

# Create .env file
GEMINI_API_KEY=your_api_key_here

Then load in Python:

from dotenv import load_dotenv
import osload_dotenv()
api_key = os.getenv(“GEMINI_API_KEY”)

⚠️ Security Notes

  • 🔒 DO NOT commit API keys to Git
  • 📝 Add .env to .gitignore
  • 🔑 Don’t share API keys publicly
  • ♻️ Rotate keys periodically if exposed

Step 5: Verify Setup

Run test script to verify complete setup:

python test_connection.py

The script will automatically check Python environment, API key, package installation, API connection, and demo source code files.

Successful setup test result

Figure 4: Successful setup test result

🎮 Demo and Screenshots

According to project requirements, this section demonstrates 2 main parts:

  • Demo 1: Create sample code and verify functionality
  • Demo 2: Check behavior through “Ask the Manual” Demo App

Demo 1: Sample Code – Create and Verify Operation

We’ll write our own code to test how File Search Tool works.

Step 1: Create File Search Store

Code to create File Search Store

Figure 5: Code to create File Search Store

Output when store is successfully created

Figure 6: Output when store is successfully created

Step 2: Upload and Process File

Upload and process file

Figure 7: File processing workflow

Step 3: Query and Receive Response with Citations

Query and Response with citations

Figure 8: Answer with citations

Demo 2: Check Behavior with “Ask the Manual” Demo App

Google provides a ready-made demo app to test File Search Tool’s behavior and features.
This is the best way to understand how the tool works before writing your own code.

🎨 Try Google’s Demo App

Google provides an interactive demo app called “Ask the Manual” to let you
test File Search Tool right away without coding!

🚀 Open Demo App

Ask the Manual demo app interface

Figure 9: Ask the Manual demo app interface (including API key selection)

Testing with Demo App:

  1. Select/enter your API key in the Settings field
  2. Upload PDF file or DOCX to the app
  3. Wait for processing (usually < 1 minute)
  4. Chat and ask questions about the PDF file content
  5. View answers returned from PDF data with citations
  6. Click on citations to verify sources

Files uploaded in demo app

Figure 10: Files uploaded in demo app

Query and response with citations

Figure 11: Query and response with citations in demo app

✅ Demo Summary According to Requirements

We have completed all requirements:

  • Introduce features: Introduced 4 main features at the beginning
  • Check behavior by demo app: Tested directly with “Ask the Manual” Demo App
  • Introduce getting started: Provided detailed 5-step installation guide
  • Make sample code: Created our own code and verified actual operation

Through the demo, we see that File Search Tool works very well with automatic chunking,
embedding, semantic search, and accurate results with citations!

💻 Complete Code Examples

Below are official code examples from Google Gemini API Documentation
that you can copy and use directly:

Example 1: Upload Directly to File Search Store

The fastest way – upload file directly to store in 1 step:

from google import genai
from google.genai import types
import timeclient = genai.Client()# Create the file search store with an optional display name
file_search_store = client.file_search_stores.create(
config={‘display_name’: ‘your-fileSearchStore-name’}
)# Upload and import a file into the file search store
operation = client.file_search_stores.upload_to_file_search_store(
file=‘sample.txt’,
file_search_store_name=file_search_store.name,
config={
‘display_name’: ‘display-file-name’,
}
)# Wait until import is complete
while not operation.done:
time.sleep(5)
operation = client.operations.get(operation)# Ask a question about the file
response = client.models.generate_content(
model=“gemini-2.5-flash”,
contents=“””Can you tell me about Robert Graves”””,
config=types.GenerateContentConfig(
tools=[
file_search=(
file_search_store_names=[file_search_store.name]
)
]
)
)print(response.text)

Example 2: Upload then Import File (2 Separate Steps)

If you want to upload file first, then import it to store:

from google import genai
from google.genai import types
import timeclient = genai.Client()# Upload the file using the Files API
sample_file = client.files.upload(
file=‘sample.txt’,
config={‘name’: ‘display_file_name’}
)# Create the file search store
file_search_store = client.file_search_stores.create(
config={‘display_name’: ‘your-fileSearchStore-name’}
)# Import the file into the file search store
operation = client.file_search_stores.import_file(
file_search_store_name=file_search_store.name,
file_name=sample_file.name
)# Wait until import is complete
while not operation.done:
time.sleep(5)
operation = client.operations.get(operation)# Ask a question about the file
response = client.models.generate_content(
model=“gemini-2.5-flash”,
contents=“””Can you tell me about Robert Graves”””,
config=types.GenerateContentConfig(
tools=[
file_search=(
file_search_store_names=[file_search_store.name]
)
]
)
)print(response.text)
📚 Source: Code examples are taken from

Gemini API Official Documentation – File Search

🎯 Real-World Applications

1. 📚 Document Q&A System

Use Case: Company Documentation Chatbot

Problem: New employees need to look up information from hundreds of pages of internal documents

Solution:

  • Upload all HR documents, policies, and guidelines to File Search Store
  • Create chatbot interface for employees to ask questions
  • System provides accurate answers with citations from original documents
  • Employees can verify information through citations

Benefits: Saves search time, reduces burden on HR team

2. 🔬 Research Assistant

Use Case: Scientific Paper Synthesis

Problem: Researchers need to read and synthesize dozens of papers

Solution:

  • Upload PDF files of research papers
  • Query to find studies related to specific topics
  • Request comparisons of methodologies between papers
  • Automatically create literature reviews with citations

Benefits: Accelerates research process, discovers new insights

3. 🎧 Customer Support Enhancement

Use Case: Automated Support System

Problem: Customers have many product questions, need 24/7 support

Solution:

  • Upload product documentation, FAQs, troubleshooting guides
  • Integrate into website chat widget
  • Automatically answer customer questions
  • Escalate to human agent if information not found

Benefits: Reduce 60-70% of basic tickets, improve customer satisfaction

4. 💻 Code Documentation Navigator

Use Case: Developer Onboarding Support

Problem: New developers need to quickly understand large codebase

Solution:

  • Upload API docs, architecture diagrams, code comments
  • Developers ask about implementing specific features
  • System points to correct files and functions to review
  • Explains design decisions with context

Benefits: Reduces onboarding time from weeks to days

📊 Comparison with Other Solutions

Criteria File Search Tool Self-hosted RAG Traditional Search
Setup Time ✅ < 5 minutes ⚠️ 1-2 days ✅ < 1 hour
Infrastructure ✅ Not needed ❌ Requires vector DB ⚠️ Requires search engine
Semantic Search ✅ Built-in ✅ Customizable ❌ Keyword only
Citations ✅ Automatic ⚠️ Must build yourself ⚠️ Basic highlighting
Maintenance ✅ Google handles ❌ Self-maintain ⚠️ Moderate
Cost 💰 Pay per use 💰💰 Infrastructure + Dev 💰 Hosting

🌟 Best Practices

📄 File Preparation

✅ Do’s

  • Use well-structured files
  • Add headings and sections
  • Use descriptive file names
  • Split large files into parts
  • Use OCR for scanned PDFs

❌ Don’ts

  • Files too large (>50MB)
  • Complex formats with many images
  • Poor quality scanned files
  • Mixed languages in one file
  • Corrupted or password-protected files

🗂️ Store Management

📋 Efficient Store Organization

  • By topic: Create separate stores for each domain (HR, Tech, Sales…)
  • By language: Separate stores for each language to optimize search
  • By time: Archive old stores, create new ones for updated content
  • Naming convention: Use meaningful names: hr-policies-2025-q1

🔍 Query Optimization

# ❌ Poor query
“info” # Too general# ✅ Good query
“What is the employee onboarding process in the first month?”# ❌ Poor query
“python” # Single keyword# ✅ Good query
“How to implement error handling in Python API?”# ✅ Query with context
“””
I need information about the deployment process.
Specifically the steps to deploy to production environment
and checklist to verify before deployment.
“””

⚡ Performance Tips

Speed Up Processing

  1. Batch upload: Upload multiple files at once instead of one by one
  2. Async processing: No need to wait for each file to complete
  3. Cache results: Cache answers for common queries
  4. Optimize file size: Compress PDFs, remove unnecessary images
  5. Monitor API limits: Track usage to avoid hitting rate limits

🔒 Security

Security Checklist

  • ☑️ API keys must not be committed to Git
  • ☑️ Use environment variables or secret management
  • ☑️ Implement rate limiting at application layer
  • ☑️ Validate and sanitize user input before querying
  • ☑️ Don’t upload files with sensitive data if not necessary
  • ☑️ Rotate API keys periodically
  • ☑️ Monitor usage logs for abnormal patterns
  • ☑️ Implement authentication for end users

💰 Cost Optimization

Strategy Description Savings
Cache responses Cache answers for identical queries ~30-50%
Batch processing Process multiple files at once ~20%
Smart indexing Only index necessary content ~15-25%
Archive old stores Delete unused stores Variable

🎊 Conclusion

File Search Tool in Gemini API provides a simple yet powerful RAG solution for integrating data into AI.
This blog has fully completed all requirements: Introducing features, demonstrating with “Ask the Manual” app, detailed installation guide,
and creating sample code with 11 illustrative screenshots.

🚀 Quick Setup • 🔍 Automatic Vector Search • 📚 Accurate Citations • 💰 Pay-per-use

🔗 Official Resources

 

DeepSeek-OCR: Testing a New Era of Visual Compression OCR on RTX A4000

🚀 DeepSeek-OCR — Reinventing OCR Through Visual Compression

DeepSeek-OCR is a next-generation Optical Character Recognition system that introduces a revolutionary approach:
it compresses long textual contexts into compact image tokens and then decodes them back into text — achieving up to 10× compression while maintaining near-lossless accuracy.


⚙️ Key Features of DeepSeek-OCR

1. Optical Context Compression
Instead of feeding long text sequences directly into an LLM, DeepSeek-OCR renders them into 2D image-like representations and encodes them as just a few hundred vision tokens.
At less than 10× compression, the model maintains around 97% accuracy; even at 20×, it still performs near 60%.

2. Two-Stage Architecture

  • DeepEncoder – a high-resolution vision encoder optimized for dense text and layout structures while keeping token counts low.

  • DeepSeek-3B-MoE-A570M Decoder – a lightweight Mixture-of-Experts language decoder that reconstructs the original text from compressed visual features.

3. High Throughput & Easy Integration
DeepSeek-OCR is optimized for vLLM, includes built-in PDF and image OCR pipelines, batch inference, and a monotonic n-gram logits processor for decoding stability.
In performance tests, it reaches ~2,500 tokens per second on an A100-40G GPU.

4. Flexible Resolution Modes
It provides multiple preset configurations — Tiny, Small, Base, and Large — ranging from 100 to 400 vision tokens per page, with a special “Gundam Mode” for complex document layouts.


🔍 How It Works — Core Mechanism

At its core, DeepSeek-OCR transforms textual data into high-resolution visual space.
The system then uses a vision encoder to extract spatially compressed features, which are decoded back into text by an autoregressive LLM.

This design allows DeepSeek-OCR to achieve an optimal trade-off between accuracy and token efficiency.
On OmniDocBench, DeepSeek-OCR outperforms GOT-OCR 2.0 using only 100 vision tokens per page, and surpasses MinerU 2.0 with fewer than 800 tokens per page — delivering both speed and precision.


💡 Why “Long Context → Image Tokens” Works

Written language is highly structured and visually redundant — fonts, character shapes, and layout patterns repeat frequently.
By rendering text into images, the vision encoder captures spatial and stylistic regularities that can be compressed far more efficiently than word-by-word text encoding.

In short:

  • Traditional OCR treats every word or character as a separate token.

  • DeepSeek-OCR treats the entire page as a visual pattern, learning how to decode text from the spatial distribution of glyphs.
    → That’s why it achieves 10× token compression with minimal accuracy loss.
    At extreme compression (20×), fine details fade, and accuracy naturally declines.


📊 Major OCR Benchmarks

1. OmniDocBench (CVPR 2025)

A comprehensive benchmark for PDF and document parsing, covering nine real-world document types — papers, textbooks, slides, exams, financial reports, magazines, newspapers, handwritten notes, and books.

It provides:

  • End-to-end evaluations (from image → structured text: Markdown, HTML, LaTeX)

  • Task-specific evaluations: layout detection, OCR recognition, table/figure/formula parsing

  • Attribute-based analysis: rotation, color background, multi-language, complexity, etc.

👉 It fills a major gap in earlier OCR datasets by enabling fair, fine-grained comparisons between traditional pipelines and modern vision-language models.

2. FOx (Focus Anywhere)

FOx is a fine-grained, focus-aware benchmark designed to test models’ ability to read or reason within specific document regions.

It includes tasks such as:

  • Region, line, or color-guided OCR (e.g., “Read the text in the red box”)

  • Region-level translation or summarization

  • Multi-page document reasoning and cross-page OCR
    It also demonstrates efficient compression — for instance, encoding a 1024×1024 document into only ~256 image tokens.


🧭 Common Evaluation Criteria for OCR Systems

Category What It Measures
Text Accuracy Character/Word Error Rate (CER/WER), Edit Distance, BLEU, or structure-aware metrics (e.g., TEDS for HTML or LaTeX).
Layout & Structure Quality Layout F1/mAP, table and formula structure accuracy.
Region-Level Precision OCR accuracy on specific boxes, colors, or line positions (as in FOx).
Robustness Stability under rotation, noise, watermarking, handwriting, or multi-language text.
Efficiency Tokens per page, latency, and GPU memory footprint — where DeepSeek-OCR excels with 100–800 tokens/page and real-time decoding.

🔗 Learn More

🔧 My Local Setup & First Results (RTX A4000)

I ran DeepSeek-OCR locally on a workstation with an NVIDIA RTX A4000 (16 GB, Ampere) using a clean Conda environment. Below is the exact setup I used and a few compatibility notes so you can reproduce it.

Hardware & OS

  • GPU: NVIDIA RTX A4000 (16 GB VRAM, Ampere, ~140 W TDP) — a great balance of cost, power, and inference throughput for document OCR.

  • Use case fit: Vision encoder layers (conv/attention) benefit strongly from Tensor Cores; 16 GB VRAM comfortably handles 100–400 vision tokens/page presets.

Environment (Conda + PyTorch + vLLM)

# 1) Clone
git clone https://github.com/deepseek-ai/DeepSeek-OCR.git
cd DeepSeek-OCR
# 2) Conda env (Python 3.12)
conda create -n deepseek-ocr python=3.12.9 -y
conda activate deepseek-ocr
# 3) PyTorch (CUDA 11.8 build)
# Tip: keep torch, torchvision, torchaudio on matching versions & CUDA build
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 \
–index-url https://download.pytorch.org/whl/cu118
# 4) vLLM 0.8.5 (CUDA 11.8 wheel)
# Use the official wheel file that matches your CUDA build
pip install vllm-0.8.5+cu118-cp38-abi3-manylinux1_x86_64.whl
# 5) Project deps
pip install -r requirements.txt# 6) Optional: FlashAttention (speeds up attention ops)
# If you’re on CUDA 11.8 and hit build errors, skip this or switch to CUDA 12.x wheels (see Gotchas)
pip install flash-attn==2.7.3 –no-build-isolation

Run the script

cd DeepSeek-OCR-hf
python run_dpsk_ocr.py

Sample outputs (3 images): I published my first three OCR attempts here:
👉 https://github.com/mhieupham1/test-deepseek-ocr/tree/main/results

I’ll keep iterating and will add token-throughput (tokens/s), per-page latency, and accuracy notes as I expand the test set on the A4000.

🧩 Review & Observations After Testing

After running several document samples through DeepSeek-OCR on the RTX A4000, I was genuinely impressed by the model’s speed, visual compression quality, and clean text decoding. It handled most printed and structured text (such as English, Japanese, and tabular data) remarkably well — even at higher compression levels.

However, during testing I also noticed a few limitations that are worth mentioning:

  • 🔸 Occasional Missing Text:
    In some pages, especially those with dense layouts, overlapping elements, or colored backgrounds, DeepSeek-OCR tended to drop small text fragments or subscript characters. This seems to happen when the compression ratio is too aggressive (e.g., >10×), or when the region’s text contrast is low.

  • 🔸 Layout Sensitivity:
    Complex multi-column documents or pages with embedded tables sometimes caused partial text truncation near region boundaries. The vision encoder still captures the visual pattern but may lose context alignment at decoding time.

  • 🔸 Strengths in Clean Scans:
    On clean, high-resolution scans (PDF exports or book pages), the OCR output was extremely stable and accurate, rivaling tools like Tesseract + layout parsers, while producing far fewer tokens.

  • 🔸 Performance Efficiency:
    Even on a mid-range GPU like the RTX A4000 (16 GB), the model ran smoothly with ~2,000–2,500 tokens/s throughput using the Base preset. GPU memory usage remained below 12 GB, which is excellent for local inference.

In short:

DeepSeek-OCR delivers a new balance between accuracy and efficiency.
It’s not yet flawless — small-text regions can be lost under heavy compression —
but for large-scale document pipelines, the token cost reduction is game-changing.

Figma Make – When Design Can Actually Run

🚀 Figma Make – The Next Generation of Design and Development

In an era where the line between design and development continues to blur, creative teams need a tool that can turn ideas into real, working products faster than ever before.
Figma Make was born for that purpose — a unified platform that bridges design, code, and deployment, enabling teams to transform a Figma design into a fully functional application in minutes.


🌟 Overview: From Design to Real Product

Figma Make is a groundbreaking evolution in the Figma ecosystem.
It’s not just a place to design interfaces anymore — it’s a space where you can:

  • Design visually as usual in Figma

  • Add logic, data, and interactivity using AI or code blocks

  • Convert designs directly into React/Tailwind apps

  • And finally, deploy your app with a single click

The magic lies in its AI-assisted design-to-code capability. You can simply describe your idea — for example,

“Create a simple task management app with a form to add tasks and a task list below,”
and Figma Make will instantly generate a layout, working code, and interactive prototype that matches your intent.


💡 Key Features

1. AI Chat & Prompt-to-App

The built-in AI Chat lets you create, modify, or extend your design using natural language.
You might say:

“Add a revenue chart to the dashboard page.”
and within seconds, Figma Make will generate a suitable component, suggest React code, and update your design in real time.
It’s the fastest way to go from idea to interactive prototype.


2. Import & Reuse Designs

You don’t need to start from scratch. Figma Make allows you to:

  • Import existing Figma files

  • Automatically detect layouts, colors, and text styles

  • Apply Design Tokens or Components from your Design System

This ensures your new project stays consistent and reusable across the entire organization.


3. From Interactive Prototype → Real Web App

Instead of static mockups, you can now:

  • Attach event handlers (onClick, onChange, etc.)

  • Connect to sample data or live APIs

  • Preview everything in the browser as a real web application

Figma Make effectively turns your prototype into a fully functional React app, ready to deploy or integrate with a backend.


4. Visual and Code Editing in Parallel

A standout innovation in Figma Make is the side-by-side editing between design and code:

  • Edit the UI → code updates instantly

  • Edit the code → UI changes in real time

Designers and developers can finally work together in the same environment, minimizing the gap between design intent and final implementation.


5. Templates & Starter Kits

Figma Make includes a library of smart starter templates for:

  • Analytics dashboards

  • Landing pages

  • CRUD admin panels

  • Form-based apps

Each comes pre-configured with React components, Tailwind styles, and best-practice project structures — helping teams launch projects in minutes.


6. Sharing & Publishing

Once your prototype is ready, you can:

  • Publish it as a live web app

  • Share preview links with clients or teammates

  • Connect to GitHub for version control and collaboration

Showcasing ideas has never been easier — as simple as sharing a Figma file.


7. Design System Integration

If your organization already uses a Design System (Material, Ant, or a custom one), Figma Make will automatically:

  • Map your existing components

  • Preserve color tokens, typography, and spacing

  • Sync code and style guides

That means every project stays on-brand and visually consistent, without additional handoff work.

🧩 Hands-On Example: From Design → Code → Web Demo

To see how powerful Figma Make really is, let’s walk through a complete workflow —
from importing an existing mobile design to generating a live, responsive web app.

🪄 Step 1 – Prepare Your Design

Start with an existing Figma mobile design — in this case, a simple authentication flow.
Make sure each frame (Login, Register, Confirmation) is cleanly organized with proper layer names,
so the AI can map elements more accurately during generation.

Figma mobile design
A clean mobile layout with consistent spacing and components will give Make more context to work with.

⚙️ Step 2 – Import into Figma Make

Inside Figma, create a new Make File.
Then simply type your prompt in natural language — for example:

“Implement this design”

Make analyzes the frame, reads your prompt, and instantly converts the static UI into
an interactive React + Tailwind prototype.
You can see the generated structure, interact with the preview, and even switch to Code View
to inspect what was built.

Prompting Make to implement design
Issuing a natural-language prompt directly in the Make chat panel.
Initial generated result
The first generated prototype — ready for testing and iteration.

Occasionally, you may see minor layout or logic errors.
These can be fixed instantly using follow-up prompts such as:

“Fix overlapping elements on small screens.”
“Adjust padding between form fields.”
“Center the logo horizontally.”

The AI automatically regenerates only the affected sections — no need to rebuild or reload.

Fixing errors
Iterative refinement through quick AI prompts.
Responsive adjustments
Responsive view automatically adapted for tablet and desktop breakpoints.

🧱 Step 3 – Add More Screens and Logic

Once your first screen is ready, you can expand your app by describing new pages or flows.
For example:

“Add a registration page similar to the login screen.”
“After successful sign up, show a confirmation page with the user’s email.”
“Link the navigation buttons between screens.”
Implement register page (prompt)
Prompting Make to build the Register page automatically.
Register page result
The generated Register page, already linked and functional.

Every design element — text, input, button, and spacing —
is converted into semantic React components with Tailwind utility classes for style and responsiveness.

Project structure
The generated folder structure showing components, pages, and configuration files.

🚀 Step 4 – Publish Your Web App

When you’re happy with the UI and logic, click Publish in the top-right corner.
Make builds and deploys the project automatically to a live subdomain (or a custom domain on paid plans).
Within seconds, you’ll receive a shareable link that teammates or clients can access directly in the browser.

Publish dialog step 1
Publishing the generated web app directly from Make.
Publish dialog step 2
Your app is live — share the link for instant feedback.
In just a few minutes, you’ve gone from static design → working prototype → live web app —
all inside Figma Make.

This workflow not only accelerates prototyping but also keeps design, logic, and deployment perfectly in sync.

✅ Conclusion

Figma Make dramatically shortens the path from idea to live product.
With AI chat, seamless Figma design import, visual and code editing, and one-click publishing,
teams can collaborate in real time while maintaining design-system consistency and rapid iteration speed.

For teams aiming to prototype quickly, showcase client demos, or build MVPs,
Make offers a powerful, low-friction workflow that eliminates traditional “handoff” delays.
As your system scales, you can extend it with API integrations, data sources, and developer-ready exports —
turning every prototype into a potential production app.

Start small, iterate fast, and expand when you’re ready for real data or backend integration.

Serverless generative AI architectural patterns – Part 2

Generative AI is rapidly reshaping how we build intelligent systems — from text-to-image applications to multi-agent orchestration. But behind all that creativity lies a serious engineering challenge: how to design scalable, cost-efficient backends that handle unpredictable, compute-heavy AI workloads.

In Part 1: https://scuti.asia/serverless-generative-ai-architectural-patterns-part-1/

In Part 2 of AWS’s series “Serverless Generative AI Architectural Patterns,” the introduce three non-real-time patterns for running generative AI at scale — where workloads can be asynchronous, parallelized, or scheduled in bulk.


🧩 Pattern 4: Buffered Asynchronous Request–Response

When to Use

This pattern is perfect for tasks that take time — such as:

  • Text-to-video or text-to-music generation

  • Complex data analysis or simulations

  • AI-assisted design, art, or high-resolution image rendering

Instead of waiting for immediate results, the system processes requests in the background and notifies users once done.

Architecture Flow

  1. Amazon API Gateway (REST / WebSocket) receives incoming requests.

  2. Amazon SQS queues the requests to decouple frontend and backend.

  3. A compute backend (AWS Lambda, Fargate, or EC2) pulls messages, calls the model (via Amazon Bedrock or custom inference), and stores results in DynamoDB or S3.

  4. The client polls or listens via WebSocket for completion.

Benefits

  • Highly scalable and resilient to spikes.

  • Reduces load on real-time systems.

  • Ideal for workflows where a few minutes of delay is acceptable.


🔀 Pattern 5: Multimodal Parallel Fan-Out

When to Use

For multi-model or multi-agent workloads — for example:

  • Combining text, image, and audio generation

  • Running multiple LLMs for different subtasks

  • Parallel pipelines that merge into one consolidated output

Architecture Flow

  1. An event (API call, S3 upload, etc.) publishes to Amazon SNS or EventBridge.

  2. The message fans out to multiple targets — queues or Lambda functions.

  3. Each target performs a separate inference or operation.

  4. AWS Step Functions or EventBridge Pipes aggregate results when all sub-tasks finish.

Benefits

  • Enables concurrent processing for faster results.

  • Fault isolation between sub-tasks.

  • Scales elastically with demand.

This pattern is especially useful in multi-agent AI systems, where independent reasoning units run in parallel before combining their insights.


🕒 Pattern 6: Non-Interactive Batch Processing

When to Use

Use this pattern for large-scale or scheduled workloads that don’t involve user interaction — such as:

  • Generating embeddings for millions of records

  • Offline document summarization or translation

  • Periodic content refreshes or nightly analytics jobs

Architecture Flow

  1. A scheduled event (via Amazon EventBridge Scheduler or CloudWatch Events) triggers the batch workflow.

  2. AWS Step Functions, Glue, or Lambda orchestrate the sequence of tasks.

  3. Data is read from S3, processed through generative or analytical models, and written back to storage or a database.

  4. Optional post-processing (indexing, notifications, reports) completes the cycle.

Benefits

  • Handles high-volume workloads without human interaction.

  • Scales automatically with AWS’s serverless services.

  • Cost-efficient since resources run only during job execution.

This pattern is common in data pipelines, RAG preprocessing, or periodic AI content generation where timing, not interactivity, matters.


⚙️ Key Takeaways

  • Serverless + Generative AI provides elasticity, scalability, and simplicity — letting teams focus on creativity instead of infrastructure.

  • Event-driven architectures (SQS, SNS, EventBridge) keep systems modular, fault-tolerant, and reactive.

  • With building blocks like Lambda, Fargate, Step Functions, DynamoDB, Bedrock, and S3, developers can move from experiments to production-grade systems seamlessly.

  • These patterns make it easier to build cost-efficient, always-available AI pipelines — from real-time chatbots to scheduled large-scale content generation.


💡 Final Thoughts

Generative AI isn’t just about model power — it’s about the architecture that delivers it reliably at scale.
AWS’s serverless ecosystem offers a powerful foundation for building asynchronous, parallel, and batch AI workflows that adapt to user and business needs alike.

👉 Explore the full article here: Serverless Generative AI Architectural Patterns – Part 2

Built a Real-Time Translator Web App Running a Local LLM on My Mac M1

🧠 I Built a Real-Time Translator Web App Running a Local LLM on My Mac M1

Recently, I had a small idea: to create a real-time speech translation tool for meetings, but instead of relying on online APIs, I wanted everything to run completely local on my Mac M1.
The result is a web demo that lets users speak into the mic → transcribe speech → translate in real-time → display bilingual subtitles on screen.
The average response time is about 1 second, which is fast enough for real-time conversations or meetings.


🎙️ How the App Works

The app follows a simple pipeline:

  1. SpeechRecognition in the browser converts voice into text.

  2. The text is then sent to a local LLM hosted via LM Studio for translation (e.g., English ↔ Vietnamese).

  3. The translated text is displayed instantly as subtitles on the screen.

My goal was to experiment with real-time translation for live meetings — for example, when someone speaks English, the listener can instantly see the Vietnamese subtitle (and vice versa).


⚙️ My Setup and Model Choice

I’m using a Mac mini M1 with 16GB RAM and 12GB of available VRAM via Metal GPU.
After testing many small models — from 1B to 7B — I found that google/gemma-3-4b provides the best balance between speed, accuracy, and context awareness.

Key highlights of google/gemma-3-4b:

  • Average response time: ~1 second on Mac M1

  • 🧩 Context length: up to 131,072 tokens — allowing it to handle long conversations or paragraphs in a single prompt

  • 💬 Translation quality: natural and faithful to meaning

  • 🎯 Prompt obedience: follows structured prompts well, unlike smaller models that tend to drift off topic

I host the model using LM Studio, which makes running and managing local LLMs extremely simple.
With Metal GPU acceleration, the model runs smoothly without lag, even while the browser is processing audio in parallel.

🧰 LM Studio – Local LLMs Made Simple

One thing I really like about LM Studio is how simple it makes running local LLMs.
It’s a desktop app for macOS, Windows, and Linux that lets you download, run, and manage models without writing code, while still giving you powerful developer features.

Key features that made it perfect for my setup:

  • Easy installation: download the .dmg (for macOS) or installer for Windows/Linux and you’re ready in minutes.

  • Built-in model browser: browse models from sources like Hugging Face, choose quantization levels, and download directly inside the app.

  • Local & public API: LM Studio can launch a local REST API server with OpenAI-compatible endpoints (/v1/chat/completions, /v1/embeddings, etc.), which you can call from any app — including my translator web client.

  • Logs and performance monitoring: it displays live logs, token counts, generation speed, and resource usage (RAM, GPU VRAM, context window occupancy).

  • No coding required: once the model is loaded, you can interact through the built-in console or external scripts using the API — perfect for prototyping.

  • Ideal for local prototyping: for quick experiments like mine, LM Studio removes all setup friction — no Docker, no backend framework — just plug in your model and start testing.

Thanks to LM Studio, setting up the local LLM was nearly effortless.


🌐 About SpeechRecognition – It’s Still Cloud-Based

At first, I thought the SpeechRecognition API in browsers could work offline.
But in reality, it doesn’t:

On browsers like Chrome, SpeechRecognition (or webkitSpeechRecognition) sends the recorded audio to Google’s servers for processing.
As a result:

  • It can’t work offline

  • It depends on an internet connection

  • You don’t have control over the recognition engine

This means that while the translation part of my app runs entirely local, the speech recognition part still relies on an external service.

🧪 Real-World Test

To test the pipeline, I read a short passage from a fairy tale aloud.
The results were surprisingly good:

  • Subtitles appeared clearly, preserving the storytelling tone and rhythm of the original text.

  • No missing words as long as I spoke clearly and maintained a steady pace.

  • When I intentionally spoke too fast or slurred words, the system still kept up — but occasionally missed punctuation or merged phrases, something that could be improved with punctuation post-processing or a small buffering delay before sending text to the LLM.

Tips for smoother results:

  • Maintain a steady speaking rhythm, pausing naturally every 5–10 words.

  • Add punctuation normalization before rendering (or enable auto-punctuation when using Whisper).

  • Process short chunks (~2–3 seconds) and merge them for low latency and better context retention.

🧩 Some Demo Screenshots

📷 Image 1 – Web Interface:
User speaks into the microphone; subtitles appear in real time below, showing both the original and translated text.

📷 Image 2 – LM Studio:
google/gemma-3-4b running locally on Metal GPU inside LM Studio, showing logs and average response time.


🔭 Final Thoughts

This project is still a small experiment, but I’m truly impressed that a 4B parameter model running locally can handle real-time translation this well — especially with a 131K token context window, which allows it to keep track of long, coherent discussions.
With Whisper integrated locally, I believe it’s possible to build a fully offline real-time translation tool — useful for meetings, presentations, or any situation where data privacy matters.


✳️ In short:
If you’re looking for a small yet smart model that runs smoothly on a Mac M1 without a discrete GPU, I highly recommend trying google/gemma-3-4b with LM Studio.
Sometimes, a small but well-behaved model — with a huge context window — is all you need to unlock big ideas 🚀

So sánh D-ID API và HeyGen API – Giải pháp tạo Avatar AI cho doanh nghiệp

Trong bối cảnh AI-generated video bùng nổ, D-IDHeyGen đang dẫn đầu về công cụ tạo avatar ảo biết nói, phục vụ đào tạo, marketing và chăm sóc khách hàng. Cả hai đều cung cấp API giúp tích hợp trực tiếp vào sản phẩm, website hoặc hệ thống nội bộ.

Tổng quan hai nền tảng

D-ID: Tập trung vào avatar tương tác thời gian thực

  • Talks API: tạo video từ ảnh + văn bản/âm thanh.
  • Realtime/Streaming: avatar hội thoại thời gian thực (WebRTC).
  • Knowledge/Agent: tích hợp nguồn tri thức (RAG) để trả lời theo dữ liệu riêng.
  • Ứng dụng: trợ lý ảo, hướng dẫn tích hợp trong app, đào tạo nội bộ.

HeyGen: Mạnh về video marketing & localization

  • API tạo video: từ ảnh hoặc avatar có sẵn.
  • Streaming Avatar API: hội thoại trực tiếp.
  • Dịch & lip-sync đa ngôn ngữ: phù hợp hóa video cho nhiều thị trường.
  • Ứng dụng: video quảng cáo, hướng dẫn sản phẩm, đào tạo đa ngôn ngữ.

Bảng so sánh nhanh

Tiêu chí D-ID API HeyGen API
Mục tiêu chính Avatar AI tương tác real-time, gắn tri thức nội bộ Video AI cho marketing, đào tạo, localization
Streaming/Realtime Có (WebRTC/Realtime) Có (Interactive/Streaming)
Đa ngôn ngữ & lip-sync Tốt, tập trung hội thoại Rất mạnh, tối ưu dịch & lồng tiếng
Tùy chỉnh avatar Upload ảnh tự do, điều khiển cảm xúc cơ bản Kho avatar mẫu đa dạng, dễ chọn nhanh
Knowledge Base / Agent Có, hỗ trợ RAG/agent Không phải trọng tâm
Tài liệu & SDK Đầy đủ; phần streaming cần hiểu WebRTC Đầy đủ; có template/workflow cho marketer
Chi phí Theo usage; thường cần contact để quote chi tiết Minh bạch theo credit (Free/Pro/Scale)
Phù hợp nhất Chatbot video, trợ lý ảo nội bộ Marketing, đào tạo, nội dung đa ngôn ngữ

Ưu – nhược điểm

D-ID API

Ưu điểm:

  • Realtime avatar ổn định, phù hợp chatbot/hỗ trợ trực tiếp.
  • Tích hợp tri thức nội bộ (RAG) tạo “nhân viên ảo”.
  • Cá nhân hóa từ ảnh người thật.

Nhược điểm:

  • Thiết lập streaming đòi hỏi hiểu WebRTC (SDP/ICE).
  • Không chuyên sâu vào dịch/lip-sync hàng loạt như HeyGen.
  • Thông tin giá có thể kém minh bạch hơn (tùy gói/doanh nghiệp).

HeyGen API

Ưu điểm:

  • Rất mạnh về dịch & lip-sync đa ngôn ngữ, nhiều template.
  • Dễ dùng, nhanh tạo MVP; gói Free/Pro/Scale rõ ràng.
  • Phù hợp sản xuất video marketing/đào tạo số lượng lớn.

Nhược điểm:

  • Không hỗ trợ agent/tri thức nội bộ native.
  • Chi phí có thể tăng nhanh với video dài/khối lượng lớn.
  • Tùy biến avatar theo dữ liệu người dùng kém linh hoạt hơn.

Gợi ý lựa chọn theo mục tiêu

  • Avatar hội thoại trực tiếp (support, tư vấn, onboarding): ưu tiên D-ID API.
  • Dịch video/lip-sync đa ngôn ngữ, sản xuất nội dung marketing: ưu tiên HeyGen API.
  • Nhân viên ảo dùng dữ liệu riêng (RAG/agent): D-ID API.
  • Đào tạo nội bộ đa ngôn ngữ & xuất bản hàng loạt: HeyGen API.
  • Giải pháp kết hợp: D-ID cho realtime chat; HeyGen cho video đào tạo/marketing.

Khuyến nghị triển khai kỹ thuật

  1. Xác định luồng chính: realtime (WebRTC) hay batch (render video).
  2. Quy hoạch chi phí: ước tính độ dài video, số ngôn ngữ, lưu lượng concurrent.
  3. Kiến trúc tích hợp: tách microservice render/video queue; bật CDN cho file xuất.
  4. Bảo mật & quyền riêng tư: mã hóa dữ liệu, kiểm soát API key/secret, nhật ký truy cập.
  5. Đo lường chất lượng: đặt KPI cho lip-sync, độ trễ realtime, tỉ lệ render thành công.

Fine-Tuning GPT-OSS-20B on Google Colab Using Unsloth and LoRA

1. Introduction

In today’s rapidly advancing field of AI, the use of AI models — or more specifically, running them on personal computers — has become more common than ever.
However, some AI models have become increasingly difficult to use because the training data required for them is massive, often involving millions of parameters.
This makes it nearly impossible for low-end computers to use them effectively for work or projects.

Therefore, in this article, we will explore Google Colab together with Unsloth’s fine-tuning tool, combined with LoRA, to fine-tune and use gpt-oss-20b according to our own needs.


2. Main Content

a. What is Unsloth?

  • Unsloth is a modern Python library designed to speed up and optimize the fine-tuning of large language models (LLMs) such as LLaMA, Mistral, Mixtral, and others.
    It makes model training and fine-tuning extremely fast, memory-efficient, and easy — even on limited hardware like a single GPU or consumer-grade machines.

b. What is Colab?

  • Colab is a hosted Jupyter Notebook service that requires no setup and provides free access to computing resources, including GPUs and TPUs.
    It is particularly well-suited for machine learning, data science, and education purposes.

c. What is LoRA?

  • Low-Rank Adaptation (LoRA) is a technique for quickly adapting machine learning models to new contexts.
    LoRA helps make large and complex models more suitable for specific tasks. It works by adding lightweight layers to the original model rather than modifying the entire architecture.
    This allows developers to quickly expand and specialize machine learning models for various applications.

3. Using Colab to Train gpt-oss-20b

– Installing the Libraries

!pip install --upgrade -qqq uv

try:
    import numpy
    install_numpy = f"numpy=={numpy.__version__}"
except:
    install_numpy = "numpy"

!uv pip install -qqq \
  "torch>=2.8.0" "triton>=3.4.0" {install_numpy} \
  "unsloth_zoo[base] @ git+https://github.com/unslothai/unsloth-zoo" \
  "unsloth[base] @ git+https://github.com/unslothai/unsloth" \
  torchvision bitsandbytes \
  git+https://github.com/huggingface/[email protected] \
  git+https://github.com/triton-lang/triton.git@05b2c186c1b6c9a08375389d5efe9cb4c401c075#subdirectory=python/triton_kernels

– After completing the installation, load the gpt-oss-20b model from Unsloth:

from unsloth import FastLanguageModel
import torch

max_seq_length = 1024
dtype = None
model_name = "unsloth/gpt-oss-20b"

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_name,
    dtype = dtype,                 # None for auto detection
    max_seq_length = max_seq_length,  # Choose any for long context!
    load_in_4bit = True,           # 4 bit quantization to reduce memory
    full_finetuning = False,       # [NEW!] We have full finetuning now!
    # token = "hf_...",            # use one if using gated models
)
Colab install output

– Adding LoRA for Fine-Tuning

model = FastLanguageModel.get_peft_model(
    model,
    r = 8,  # Choose any number > 0! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 16,
    lora_dropout = 0,              # Optimized fast path
    bias = "none",                 # Optimized fast path
    # "unsloth" uses less VRAM, fits larger batches
    use_gradient_checkpointing = "unsloth",  # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)
Tip: If you hit out-of-memory (OOM), reduce max_seq_length, set a smaller r, or increase gradient_accumulation_steps.

– Testing the Model Before Fine-Tuning

Now, let’s test how the model responds before fine-tuning:

messages = [
    {"role": "system", "content": "Bạn là Shark B, một nhà đầu tư nổi tiếng, thẳng thắn và thực tế", "thinking": None},
    {"role": "user", "content": "Bạn hãy giới thiệu bản thân"},
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True,
    return_tensors = "pt",
    return_dict = True,
    reasoning_effort = "low",
).to(model.device)

from transformers import TextStreamer
_ = model.generate(**inputs, max_new_tokens = 512, streamer = TextStreamer(tokenizer))
Generation preview

– Load data for finetune model

Dataset sample

Dataset preview
def formatting_prompts_func(examples):
    convos = examples["messages"]
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
    return { "text" : texts, }

from datasets import load_dataset
dataset = load_dataset("json", data_files="data.jsonl", split="train")
dataset
from unsloth.chat_templates import standardize_sharegpt
dataset = standardize_sharegpt(dataset)
dataset = dataset.map(formatting_prompts_func, batched = True)

– Train model

The following code snippet defines the configuration and setup for the fine-tuning process.
Here, we use SFTTrainer and SFTConfig from the trl library to perform Supervised Fine-Tuning (SFT) on our model.
The configuration specifies parameters such as batch size, learning rate, optimizer type, and number of training epochs.

from trl import SFTConfig, SFTTrainer

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    args = SFTConfig(
        per_device_train_batch_size = 1,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        num_train_epochs = 1,  # Set this for 1 full training run.
        # max_steps = 30,
        learning_rate = 2e-4,
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none",  # Use this for WandB etc.
    ),
)

trainer_stats = trainer.train()

– After training, try the fine-tuned model

# Example reload (set to True to run)
if False:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "finetuned_model",  # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = 1024,
        dtype = None,
        load_in_4bit = True,
    )

    messages = [
        {"role": "system", "content": "Bạn là Shark B, một nhà đầu tư nổi tiếng, thẳng thắn và thực tế", "thinking": None},
        {"role": "user", "content": "Bạn hãy giới thiệu bản thân"},
    ]

    inputs = tokenizer.apply_chat_template(
        messages,
        add_generation_prompt = True,
        return_tensors = "pt",
        return_dict = True,
        reasoning_effort = "low",
    ).to(model.device)

    from transformers import TextStreamer
    _ = model.generate(**inputs, max_new_tokens = 512, streamer = TextStreamer(tokenizer))
Note: Replace finetuned_model with your actual model path (e.g., outputs or the directory you saved/merged adapters to).

Colab notebook: Open your Colab here.


4. Conclusion & Next Steps

By combining Unsloth (for speed and memory efficiency), LoRA (for lightweight adaptation), and Google Colab (for accessible compute), you can fine-tune gpt-oss-20b even on modest hardware. The workflow above helps you:

  • Install a reproducible environment with optimized kernels.
  • Load gpt-oss-20b in 4-bit to reduce VRAM usage.
  • Attach LoRA adapters to train only a small set of parameters.
  • Prepare chat-style datasets and run supervised fine-tuning with TRL’s SFTTrainer.
  • Evaluate before/after to confirm your improvements.
Open the Colab
Clone the notebook, plug in your dataset, and fine-tune your own assistant in minutes.