DeepSeek-OCR: Testing a New Era of Visual Compression OCR on RTX A4000

🚀 DeepSeek-OCR — Reinventing OCR Through Visual Compression

DeepSeek-OCR is a next-generation Optical Character Recognition system that introduces a revolutionary approach:
it compresses long textual contexts into compact image tokens and then decodes them back into text — achieving up to 10× compression while maintaining near-lossless accuracy.


⚙️ Key Features of DeepSeek-OCR

1. Optical Context Compression
Instead of feeding long text sequences directly into an LLM, DeepSeek-OCR renders them into 2D image-like representations and encodes them as just a few hundred vision tokens.
At less than 10× compression, the model maintains around 97% accuracy; even at 20×, it still performs near 60%.

2. Two-Stage Architecture

  • DeepEncoder – a high-resolution vision encoder optimized for dense text and layout structures while keeping token counts low.

  • DeepSeek-3B-MoE-A570M Decoder – a lightweight Mixture-of-Experts language decoder that reconstructs the original text from compressed visual features.

3. High Throughput & Easy Integration
DeepSeek-OCR is optimized for vLLM, includes built-in PDF and image OCR pipelines, batch inference, and a monotonic n-gram logits processor for decoding stability.
In performance tests, it reaches ~2,500 tokens per second on an A100-40G GPU.

4. Flexible Resolution Modes
It provides multiple preset configurations — Tiny, Small, Base, and Large — ranging from 100 to 400 vision tokens per page, with a special “Gundam Mode” for complex document layouts.


🔍 How It Works — Core Mechanism

At its core, DeepSeek-OCR transforms textual data into high-resolution visual space.
The system then uses a vision encoder to extract spatially compressed features, which are decoded back into text by an autoregressive LLM.

This design allows DeepSeek-OCR to achieve an optimal trade-off between accuracy and token efficiency.
On OmniDocBench, DeepSeek-OCR outperforms GOT-OCR 2.0 using only 100 vision tokens per page, and surpasses MinerU 2.0 with fewer than 800 tokens per page — delivering both speed and precision.


💡 Why “Long Context → Image Tokens” Works

Written language is highly structured and visually redundant — fonts, character shapes, and layout patterns repeat frequently.
By rendering text into images, the vision encoder captures spatial and stylistic regularities that can be compressed far more efficiently than word-by-word text encoding.

In short:

  • Traditional OCR treats every word or character as a separate token.

  • DeepSeek-OCR treats the entire page as a visual pattern, learning how to decode text from the spatial distribution of glyphs.
    → That’s why it achieves 10× token compression with minimal accuracy loss.
    At extreme compression (20×), fine details fade, and accuracy naturally declines.


📊 Major OCR Benchmarks

1. OmniDocBench (CVPR 2025)

A comprehensive benchmark for PDF and document parsing, covering nine real-world document types — papers, textbooks, slides, exams, financial reports, magazines, newspapers, handwritten notes, and books.

It provides:

  • End-to-end evaluations (from image → structured text: Markdown, HTML, LaTeX)

  • Task-specific evaluations: layout detection, OCR recognition, table/figure/formula parsing

  • Attribute-based analysis: rotation, color background, multi-language, complexity, etc.

👉 It fills a major gap in earlier OCR datasets by enabling fair, fine-grained comparisons between traditional pipelines and modern vision-language models.

2. FOx (Focus Anywhere)

FOx is a fine-grained, focus-aware benchmark designed to test models’ ability to read or reason within specific document regions.

It includes tasks such as:

  • Region, line, or color-guided OCR (e.g., “Read the text in the red box”)

  • Region-level translation or summarization

  • Multi-page document reasoning and cross-page OCR
    It also demonstrates efficient compression — for instance, encoding a 1024×1024 document into only ~256 image tokens.


🧭 Common Evaluation Criteria for OCR Systems

Category What It Measures
Text Accuracy Character/Word Error Rate (CER/WER), Edit Distance, BLEU, or structure-aware metrics (e.g., TEDS for HTML or LaTeX).
Layout & Structure Quality Layout F1/mAP, table and formula structure accuracy.
Region-Level Precision OCR accuracy on specific boxes, colors, or line positions (as in FOx).
Robustness Stability under rotation, noise, watermarking, handwriting, or multi-language text.
Efficiency Tokens per page, latency, and GPU memory footprint — where DeepSeek-OCR excels with 100–800 tokens/page and real-time decoding.

🔗 Learn More

🔧 My Local Setup & First Results (RTX A4000)

I ran DeepSeek-OCR locally on a workstation with an NVIDIA RTX A4000 (16 GB, Ampere) using a clean Conda environment. Below is the exact setup I used and a few compatibility notes so you can reproduce it.

Hardware & OS

  • GPU: NVIDIA RTX A4000 (16 GB VRAM, Ampere, ~140 W TDP) — a great balance of cost, power, and inference throughput for document OCR.

  • Use case fit: Vision encoder layers (conv/attention) benefit strongly from Tensor Cores; 16 GB VRAM comfortably handles 100–400 vision tokens/page presets.

Environment (Conda + PyTorch + vLLM)

# 1) Clone
git clone https://github.com/deepseek-ai/DeepSeek-OCR.git
cd DeepSeek-OCR
# 2) Conda env (Python 3.12)
conda create -n deepseek-ocr python=3.12.9 -y
conda activate deepseek-ocr
# 3) PyTorch (CUDA 11.8 build)
# Tip: keep torch, torchvision, torchaudio on matching versions & CUDA build
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 \
–index-url https://download.pytorch.org/whl/cu118
# 4) vLLM 0.8.5 (CUDA 11.8 wheel)
# Use the official wheel file that matches your CUDA build
pip install vllm-0.8.5+cu118-cp38-abi3-manylinux1_x86_64.whl
# 5) Project deps
pip install -r requirements.txt# 6) Optional: FlashAttention (speeds up attention ops)
# If you’re on CUDA 11.8 and hit build errors, skip this or switch to CUDA 12.x wheels (see Gotchas)
pip install flash-attn==2.7.3 –no-build-isolation

Run the script

cd DeepSeek-OCR-hf
python run_dpsk_ocr.py

Sample outputs (3 images): I published my first three OCR attempts here:
👉 https://github.com/mhieupham1/test-deepseek-ocr/tree/main/results

I’ll keep iterating and will add token-throughput (tokens/s), per-page latency, and accuracy notes as I expand the test set on the A4000.

🧩 Review & Observations After Testing

After running several document samples through DeepSeek-OCR on the RTX A4000, I was genuinely impressed by the model’s speed, visual compression quality, and clean text decoding. It handled most printed and structured text (such as English, Japanese, and tabular data) remarkably well — even at higher compression levels.

However, during testing I also noticed a few limitations that are worth mentioning:

  • 🔸 Occasional Missing Text:
    In some pages, especially those with dense layouts, overlapping elements, or colored backgrounds, DeepSeek-OCR tended to drop small text fragments or subscript characters. This seems to happen when the compression ratio is too aggressive (e.g., >10×), or when the region’s text contrast is low.

  • 🔸 Layout Sensitivity:
    Complex multi-column documents or pages with embedded tables sometimes caused partial text truncation near region boundaries. The vision encoder still captures the visual pattern but may lose context alignment at decoding time.

  • 🔸 Strengths in Clean Scans:
    On clean, high-resolution scans (PDF exports or book pages), the OCR output was extremely stable and accurate, rivaling tools like Tesseract + layout parsers, while producing far fewer tokens.

  • 🔸 Performance Efficiency:
    Even on a mid-range GPU like the RTX A4000 (16 GB), the model ran smoothly with ~2,000–2,500 tokens/s throughput using the Base preset. GPU memory usage remained below 12 GB, which is excellent for local inference.

In short:

DeepSeek-OCR delivers a new balance between accuracy and efficiency.
It’s not yet flawless — small-text regions can be lost under heavy compression —
but for large-scale document pipelines, the token cost reduction is game-changing.

Figma Make – When Design Can Actually Run

🚀 Figma Make – The Next Generation of Design and Development

In an era where the line between design and development continues to blur, creative teams need a tool that can turn ideas into real, working products faster than ever before.
Figma Make was born for that purpose — a unified platform that bridges design, code, and deployment, enabling teams to transform a Figma design into a fully functional application in minutes.


🌟 Overview: From Design to Real Product

Figma Make is a groundbreaking evolution in the Figma ecosystem.
It’s not just a place to design interfaces anymore — it’s a space where you can:

  • Design visually as usual in Figma

  • Add logic, data, and interactivity using AI or code blocks

  • Convert designs directly into React/Tailwind apps

  • And finally, deploy your app with a single click

The magic lies in its AI-assisted design-to-code capability. You can simply describe your idea — for example,

“Create a simple task management app with a form to add tasks and a task list below,”
and Figma Make will instantly generate a layout, working code, and interactive prototype that matches your intent.


💡 Key Features

1. AI Chat & Prompt-to-App

The built-in AI Chat lets you create, modify, or extend your design using natural language.
You might say:

“Add a revenue chart to the dashboard page.”
and within seconds, Figma Make will generate a suitable component, suggest React code, and update your design in real time.
It’s the fastest way to go from idea to interactive prototype.


2. Import & Reuse Designs

You don’t need to start from scratch. Figma Make allows you to:

  • Import existing Figma files

  • Automatically detect layouts, colors, and text styles

  • Apply Design Tokens or Components from your Design System

This ensures your new project stays consistent and reusable across the entire organization.


3. From Interactive Prototype → Real Web App

Instead of static mockups, you can now:

  • Attach event handlers (onClick, onChange, etc.)

  • Connect to sample data or live APIs

  • Preview everything in the browser as a real web application

Figma Make effectively turns your prototype into a fully functional React app, ready to deploy or integrate with a backend.


4. Visual and Code Editing in Parallel

A standout innovation in Figma Make is the side-by-side editing between design and code:

  • Edit the UI → code updates instantly

  • Edit the code → UI changes in real time

Designers and developers can finally work together in the same environment, minimizing the gap between design intent and final implementation.


5. Templates & Starter Kits

Figma Make includes a library of smart starter templates for:

  • Analytics dashboards

  • Landing pages

  • CRUD admin panels

  • Form-based apps

Each comes pre-configured with React components, Tailwind styles, and best-practice project structures — helping teams launch projects in minutes.


6. Sharing & Publishing

Once your prototype is ready, you can:

  • Publish it as a live web app

  • Share preview links with clients or teammates

  • Connect to GitHub for version control and collaboration

Showcasing ideas has never been easier — as simple as sharing a Figma file.


7. Design System Integration

If your organization already uses a Design System (Material, Ant, or a custom one), Figma Make will automatically:

  • Map your existing components

  • Preserve color tokens, typography, and spacing

  • Sync code and style guides

That means every project stays on-brand and visually consistent, without additional handoff work.

🧩 Hands-On Example: From Design → Code → Web Demo

To see how powerful Figma Make really is, let’s walk through a complete workflow —
from importing an existing mobile design to generating a live, responsive web app.

🪄 Step 1 – Prepare Your Design

Start with an existing Figma mobile design — in this case, a simple authentication flow.
Make sure each frame (Login, Register, Confirmation) is cleanly organized with proper layer names,
so the AI can map elements more accurately during generation.

Figma mobile design
A clean mobile layout with consistent spacing and components will give Make more context to work with.

⚙️ Step 2 – Import into Figma Make

Inside Figma, create a new Make File.
Then simply type your prompt in natural language — for example:

“Implement this design”

Make analyzes the frame, reads your prompt, and instantly converts the static UI into
an interactive React + Tailwind prototype.
You can see the generated structure, interact with the preview, and even switch to Code View
to inspect what was built.

Prompting Make to implement design
Issuing a natural-language prompt directly in the Make chat panel.
Initial generated result
The first generated prototype — ready for testing and iteration.

Occasionally, you may see minor layout or logic errors.
These can be fixed instantly using follow-up prompts such as:

“Fix overlapping elements on small screens.”
“Adjust padding between form fields.”
“Center the logo horizontally.”

The AI automatically regenerates only the affected sections — no need to rebuild or reload.

Fixing errors
Iterative refinement through quick AI prompts.
Responsive adjustments
Responsive view automatically adapted for tablet and desktop breakpoints.

🧱 Step 3 – Add More Screens and Logic

Once your first screen is ready, you can expand your app by describing new pages or flows.
For example:

“Add a registration page similar to the login screen.”
“After successful sign up, show a confirmation page with the user’s email.”
“Link the navigation buttons between screens.”
Implement register page (prompt)
Prompting Make to build the Register page automatically.
Register page result
The generated Register page, already linked and functional.

Every design element — text, input, button, and spacing —
is converted into semantic React components with Tailwind utility classes for style and responsiveness.

Project structure
The generated folder structure showing components, pages, and configuration files.

🚀 Step 4 – Publish Your Web App

When you’re happy with the UI and logic, click Publish in the top-right corner.
Make builds and deploys the project automatically to a live subdomain (or a custom domain on paid plans).
Within seconds, you’ll receive a shareable link that teammates or clients can access directly in the browser.

Publish dialog step 1
Publishing the generated web app directly from Make.
Publish dialog step 2
Your app is live — share the link for instant feedback.
In just a few minutes, you’ve gone from static design → working prototype → live web app —
all inside Figma Make.

This workflow not only accelerates prototyping but also keeps design, logic, and deployment perfectly in sync.

✅ Conclusion

Figma Make dramatically shortens the path from idea to live product.
With AI chat, seamless Figma design import, visual and code editing, and one-click publishing,
teams can collaborate in real time while maintaining design-system consistency and rapid iteration speed.

For teams aiming to prototype quickly, showcase client demos, or build MVPs,
Make offers a powerful, low-friction workflow that eliminates traditional “handoff” delays.
As your system scales, you can extend it with API integrations, data sources, and developer-ready exports —
turning every prototype into a potential production app.

Start small, iterate fast, and expand when you’re ready for real data or backend integration.

Built a Real-Time Translator Web App Running a Local LLM on My Mac M1

🧠 I Built a Real-Time Translator Web App Running a Local LLM on My Mac M1

Recently, I had a small idea: to create a real-time speech translation tool for meetings, but instead of relying on online APIs, I wanted everything to run completely local on my Mac M1.
The result is a web demo that lets users speak into the mic → transcribe speech → translate in real-time → display bilingual subtitles on screen.
The average response time is about 1 second, which is fast enough for real-time conversations or meetings.


🎙️ How the App Works

The app follows a simple pipeline:

  1. SpeechRecognition in the browser converts voice into text.

  2. The text is then sent to a local LLM hosted via LM Studio for translation (e.g., English ↔ Vietnamese).

  3. The translated text is displayed instantly as subtitles on the screen.

My goal was to experiment with real-time translation for live meetings — for example, when someone speaks English, the listener can instantly see the Vietnamese subtitle (and vice versa).


⚙️ My Setup and Model Choice

I’m using a Mac mini M1 with 16GB RAM and 12GB of available VRAM via Metal GPU.
After testing many small models — from 1B to 7B — I found that google/gemma-3-4b provides the best balance between speed, accuracy, and context awareness.

Key highlights of google/gemma-3-4b:

  • Average response time: ~1 second on Mac M1

  • 🧩 Context length: up to 131,072 tokens — allowing it to handle long conversations or paragraphs in a single prompt

  • 💬 Translation quality: natural and faithful to meaning

  • 🎯 Prompt obedience: follows structured prompts well, unlike smaller models that tend to drift off topic

I host the model using LM Studio, which makes running and managing local LLMs extremely simple.
With Metal GPU acceleration, the model runs smoothly without lag, even while the browser is processing audio in parallel.

🧰 LM Studio – Local LLMs Made Simple

One thing I really like about LM Studio is how simple it makes running local LLMs.
It’s a desktop app for macOS, Windows, and Linux that lets you download, run, and manage models without writing code, while still giving you powerful developer features.

Key features that made it perfect for my setup:

  • Easy installation: download the .dmg (for macOS) or installer for Windows/Linux and you’re ready in minutes.

  • Built-in model browser: browse models from sources like Hugging Face, choose quantization levels, and download directly inside the app.

  • Local & public API: LM Studio can launch a local REST API server with OpenAI-compatible endpoints (/v1/chat/completions, /v1/embeddings, etc.), which you can call from any app — including my translator web client.

  • Logs and performance monitoring: it displays live logs, token counts, generation speed, and resource usage (RAM, GPU VRAM, context window occupancy).

  • No coding required: once the model is loaded, you can interact through the built-in console or external scripts using the API — perfect for prototyping.

  • Ideal for local prototyping: for quick experiments like mine, LM Studio removes all setup friction — no Docker, no backend framework — just plug in your model and start testing.

Thanks to LM Studio, setting up the local LLM was nearly effortless.


🌐 About SpeechRecognition – It’s Still Cloud-Based

At first, I thought the SpeechRecognition API in browsers could work offline.
But in reality, it doesn’t:

On browsers like Chrome, SpeechRecognition (or webkitSpeechRecognition) sends the recorded audio to Google’s servers for processing.
As a result:

  • It can’t work offline

  • It depends on an internet connection

  • You don’t have control over the recognition engine

This means that while the translation part of my app runs entirely local, the speech recognition part still relies on an external service.

🧪 Real-World Test

To test the pipeline, I read a short passage from a fairy tale aloud.
The results were surprisingly good:

  • Subtitles appeared clearly, preserving the storytelling tone and rhythm of the original text.

  • No missing words as long as I spoke clearly and maintained a steady pace.

  • When I intentionally spoke too fast or slurred words, the system still kept up — but occasionally missed punctuation or merged phrases, something that could be improved with punctuation post-processing or a small buffering delay before sending text to the LLM.

Tips for smoother results:

  • Maintain a steady speaking rhythm, pausing naturally every 5–10 words.

  • Add punctuation normalization before rendering (or enable auto-punctuation when using Whisper).

  • Process short chunks (~2–3 seconds) and merge them for low latency and better context retention.

🧩 Some Demo Screenshots

📷 Image 1 – Web Interface:
User speaks into the microphone; subtitles appear in real time below, showing both the original and translated text.

📷 Image 2 – LM Studio:
google/gemma-3-4b running locally on Metal GPU inside LM Studio, showing logs and average response time.


🔭 Final Thoughts

This project is still a small experiment, but I’m truly impressed that a 4B parameter model running locally can handle real-time translation this well — especially with a 131K token context window, which allows it to keep track of long, coherent discussions.
With Whisper integrated locally, I believe it’s possible to build a fully offline real-time translation tool — useful for meetings, presentations, or any situation where data privacy matters.


✳️ In short:
If you’re looking for a small yet smart model that runs smoothly on a Mac M1 without a discrete GPU, I highly recommend trying google/gemma-3-4b with LM Studio.
Sometimes, a small but well-behaved model — with a huge context window — is all you need to unlock big ideas 🚀

Fine-Tuning GPT-OSS-20B on Google Colab Using Unsloth and LoRA

1. Introduction

In today’s rapidly advancing field of AI, the use of AI models — or more specifically, running them on personal computers — has become more common than ever.
However, some AI models have become increasingly difficult to use because the training data required for them is massive, often involving millions of parameters.
This makes it nearly impossible for low-end computers to use them effectively for work or projects.

Therefore, in this article, we will explore Google Colab together with Unsloth’s fine-tuning tool, combined with LoRA, to fine-tune and use gpt-oss-20b according to our own needs.


2. Main Content

a. What is Unsloth?

  • Unsloth is a modern Python library designed to speed up and optimize the fine-tuning of large language models (LLMs) such as LLaMA, Mistral, Mixtral, and others.
    It makes model training and fine-tuning extremely fast, memory-efficient, and easy — even on limited hardware like a single GPU or consumer-grade machines.

b. What is Colab?

  • Colab is a hosted Jupyter Notebook service that requires no setup and provides free access to computing resources, including GPUs and TPUs.
    It is particularly well-suited for machine learning, data science, and education purposes.

c. What is LoRA?

  • Low-Rank Adaptation (LoRA) is a technique for quickly adapting machine learning models to new contexts.
    LoRA helps make large and complex models more suitable for specific tasks. It works by adding lightweight layers to the original model rather than modifying the entire architecture.
    This allows developers to quickly expand and specialize machine learning models for various applications.

3. Using Colab to Train gpt-oss-20b

– Installing the Libraries

!pip install --upgrade -qqq uv

try:
    import numpy
    install_numpy = f"numpy=={numpy.__version__}"
except:
    install_numpy = "numpy"

!uv pip install -qqq \
  "torch>=2.8.0" "triton>=3.4.0" {install_numpy} \
  "unsloth_zoo[base] @ git+https://github.com/unslothai/unsloth-zoo" \
  "unsloth[base] @ git+https://github.com/unslothai/unsloth" \
  torchvision bitsandbytes \
  git+https://github.com/huggingface/[email protected] \
  git+https://github.com/triton-lang/triton.git@05b2c186c1b6c9a08375389d5efe9cb4c401c075#subdirectory=python/triton_kernels

– After completing the installation, load the gpt-oss-20b model from Unsloth:

from unsloth import FastLanguageModel
import torch

max_seq_length = 1024
dtype = None
model_name = "unsloth/gpt-oss-20b"

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_name,
    dtype = dtype,                 # None for auto detection
    max_seq_length = max_seq_length,  # Choose any for long context!
    load_in_4bit = True,           # 4 bit quantization to reduce memory
    full_finetuning = False,       # [NEW!] We have full finetuning now!
    # token = "hf_...",            # use one if using gated models
)
Colab install output

– Adding LoRA for Fine-Tuning

model = FastLanguageModel.get_peft_model(
    model,
    r = 8,  # Choose any number > 0! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 16,
    lora_dropout = 0,              # Optimized fast path
    bias = "none",                 # Optimized fast path
    # "unsloth" uses less VRAM, fits larger batches
    use_gradient_checkpointing = "unsloth",  # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)
Tip: If you hit out-of-memory (OOM), reduce max_seq_length, set a smaller r, or increase gradient_accumulation_steps.

– Testing the Model Before Fine-Tuning

Now, let’s test how the model responds before fine-tuning:

messages = [
    {"role": "system", "content": "Bạn là Shark B, một nhà đầu tư nổi tiếng, thẳng thắn và thực tế", "thinking": None},
    {"role": "user", "content": "Bạn hãy giới thiệu bản thân"},
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True,
    return_tensors = "pt",
    return_dict = True,
    reasoning_effort = "low",
).to(model.device)

from transformers import TextStreamer
_ = model.generate(**inputs, max_new_tokens = 512, streamer = TextStreamer(tokenizer))
Generation preview

– Load data for finetune model

Dataset sample

Dataset preview
def formatting_prompts_func(examples):
    convos = examples["messages"]
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
    return { "text" : texts, }

from datasets import load_dataset
dataset = load_dataset("json", data_files="data.jsonl", split="train")
dataset
from unsloth.chat_templates import standardize_sharegpt
dataset = standardize_sharegpt(dataset)
dataset = dataset.map(formatting_prompts_func, batched = True)

– Train model

The following code snippet defines the configuration and setup for the fine-tuning process.
Here, we use SFTTrainer and SFTConfig from the trl library to perform Supervised Fine-Tuning (SFT) on our model.
The configuration specifies parameters such as batch size, learning rate, optimizer type, and number of training epochs.

from trl import SFTConfig, SFTTrainer

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    args = SFTConfig(
        per_device_train_batch_size = 1,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        num_train_epochs = 1,  # Set this for 1 full training run.
        # max_steps = 30,
        learning_rate = 2e-4,
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none",  # Use this for WandB etc.
    ),
)

trainer_stats = trainer.train()

– After training, try the fine-tuned model

# Example reload (set to True to run)
if False:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "finetuned_model",  # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = 1024,
        dtype = None,
        load_in_4bit = True,
    )

    messages = [
        {"role": "system", "content": "Bạn là Shark B, một nhà đầu tư nổi tiếng, thẳng thắn và thực tế", "thinking": None},
        {"role": "user", "content": "Bạn hãy giới thiệu bản thân"},
    ]

    inputs = tokenizer.apply_chat_template(
        messages,
        add_generation_prompt = True,
        return_tensors = "pt",
        return_dict = True,
        reasoning_effort = "low",
    ).to(model.device)

    from transformers import TextStreamer
    _ = model.generate(**inputs, max_new_tokens = 512, streamer = TextStreamer(tokenizer))
Note: Replace finetuned_model with your actual model path (e.g., outputs or the directory you saved/merged adapters to).

Colab notebook: Open your Colab here.


4. Conclusion & Next Steps

By combining Unsloth (for speed and memory efficiency), LoRA (for lightweight adaptation), and Google Colab (for accessible compute), you can fine-tune gpt-oss-20b even on modest hardware. The workflow above helps you:

  • Install a reproducible environment with optimized kernels.
  • Load gpt-oss-20b in 4-bit to reduce VRAM usage.
  • Attach LoRA adapters to train only a small set of parameters.
  • Prepare chat-style datasets and run supervised fine-tuning with TRL’s SFTTrainer.
  • Evaluate before/after to confirm your improvements.
Open the Colab
Clone the notebook, plug in your dataset, and fine-tune your own assistant in minutes.

Codex CLI vs Gemini CLI vs Claude Code

1. Codex CLI – Capabilities and New Features

According to OpenAI’s official announcement (“Introducing upgrades to Codex”), Codex CLI has been rebuilt on top of GPT-5-Codex, turning it into an agentic programming assistant — a developer AI that can autonomously plan, reason, and execute tasks across coding environments.

🌟 Core Abilities

  • Handles both small and large tasks: From writing a single function to refactoring entire projects.
  • Cross-platform integration: Works seamlessly across terminal (CLI), IDE (extension), and cloud environments.
  • Task reasoning and autonomy: Can track progress, decompose goals, and manage multi-step operations independently.
  • Secure by design: Runs in a sandbox with explicit permission requests for risky operations.

📈 Performance Highlights

  • Uses 93.7% fewer reasoning tokens for simple tasks, but invests 2× more computation on complex ones.
  • Successfully ran over 7 hours autonomously on long software tasks during testing.
  • Produces more precise code reviews than older Codex versions.

🟢 In short: Codex CLI 2025 is not just a code generator — it’s an intelligent coding agent capable of reasoning, multitasking, and working securely across terminal, IDE, and cloud environments.

2.Codex CLI vs Gemini CLI vs Claude Code: The New Era of AI in the Terminal

The command line has quietly become the next frontier for artificial intelligence.
While graphical AI tools dominate headlines, the real evolution is unfolding inside the terminal — where AI coding assistants now operate directly beside you, as part of your shell workflow.

Three major players define this new space: Codex CLI, Gemini CLI, and Claude Code.
Each represents a different philosophy of how AI should collaborate with developers — from speed and connectivity to reasoning depth. Let’s break down what makes each contender unique, and where they shine.


🧩 Codex CLI — OpenAI’s Code-Focused Terminal Companion

Codex CLI acts as a conversational layer over your terminal.
It listens to natural language commands, interprets your intent, and translates it into executable code or shell operations.
Now powered by OpenAI’s Codex5-Medium, it builds on the strengths of the o4-mini generation while adding adaptive reasoning and a larger 256K-token context window.

Once installed, Codex CLI integrates seamlessly with your local filesystem.
You can type:

“Create a Python script that fetches GitHub issues and logs them daily,”
and watch it instantly scaffold the files, import the right modules, and generate functional code.

Codex CLI supports multiple languages — Python, JavaScript, Go, Rust, and more — and is particularly strong at rapid prototyping and bug fixing.
Its defining trait is speed: responses feel immediate, making it perfect for fast iteration cycles.

Best for: developers who want quick, high-quality code generation and real-time debugging without leaving the terminal.


🌤️ Gemini CLI — Google’s Adaptive Terminal Intelligence

Gemini CLI embodies Google’s broader vision for connected AI development — blending reasoning, utility, and live data access.
Built on Gemini 2.5 Pro, this CLI isn’t just a coding bot — it’s a true multitool for developers and power users alike.

Beyond writing code, Gemini CLI can run shell commands, retrieve live web data, or interface with Google Cloud services.
It’s ideal for workflows that merge coding with external context — for example:

  • fetching live API responses,

  • monitoring real-time metrics,

  • or updating deployment configurations on-the-fly.

Tight integration with VS Code, Google Cloud SDK, and Workspace tools turns Gemini CLI into a full-spectrum AI companion rather than a mere code generator.

Best for: developers seeking a versatile assistant that combines coding intelligence with live, connected utility inside the terminal.


🧠 Claude Code — Anthropic’s Deep Code Reasoner

If Codex is about speed, and Gemini is about connectivity, Claude Code represents depth.
Built on Claude Sonnet 4.5, Anthropic’s upgraded reasoning model, Claude Code is designed to operate as a true engineering collaborator.

It excels at understanding, refactoring, and maintaining large-scale codebases.
Claude Code can read entire repositories, preserve logic across files, and even generate complete pull requests with human-like commit messages.
Its upgraded 250K-token context window allows it to track dependencies, explain architectural patterns, and ensure code consistency over time.

Claude’s replies are more analytical — often including explanations, design alternatives, and justifications for each change.
It trades a bit of speed for a lot more insight and reliability.

Best for: professional engineers or teams managing complex, multi-file projects that demand reasoning, consistency, and full-codebase awareness.

3.Codex CLI vs Gemini CLI vs Claude Code: Hands-on With Two Real Projects

While benchmarks and specs are useful, nothing beats actually putting AI coding agents to work.
To see how they perform on real, practical front-end tasks, I tested three leading terminal assistants — Codex CLI (Codex5-Medium), Gemini CLI (Gemini 2.5 Pro), and Claude Code (Sonnet 4.5) — by asking each to build two classic web projects using only HTML, CSS, and JavaScript.

  • 🎮 Project 1: Snake Game — canvas-based, pixel-style, smooth movement, responsive.

  • Project 2: Todo App — CRUD features, inline editing, filters, localStorage, dark theme, accessibility + keyboard support.

🎮 Task 1 — Snake Game

Goal

Create a playable 2D Snake Game using HTML, CSS, and JavaScript.
Display a grid-based canvas with a moving snake that grows when it eats food.
The snake should move continuously and respond to arrow-key inputs.
The game ends when the snake hits the wall or itself.
Include a score counter and a restart button with pixel-style graphics and responsive design.

Prompt

Create a playable 2D Snake Game using HTML, CSS, and JavaScript.

  The game should display a grid-based canvas with a moving snake that grows when it eats

  food.

  The snake should move continuously and respond to keyboard arrow keys for direction

  changes.

  The game ends when the snake hits the wall or itself.

  Show a score counter and a restart button.

  Use smooth movement, pixel-style graphics, and responsive design for different screen sizes

Observations

Codex CLI — Generated the basic canvas scaffold in seconds. Game loop, input, and scoring worked out of the box, but it required minor tuning for smoother turning and anti-reverse logic.

Gemini CLI — Delivered well-structured, commented code and used requestAnimationFrame properly. Gameplay worked fine, though the UI looked plain — more functional than fun.

Claude Code — Produced modular, production-ready code with solid collision handling, restart logic, and a polished HUD. Slightly slower response but the most complete result overall.

✅ Task 2 — Todo App

Goal

Build a complete, user-friendly Todo List App using only HTML, CSS, and JavaScript (no frameworks).
Features: add/edit/delete tasks, mark complete/incomplete, filter All / Active / Completed, clear completed, persist via localStorage, live counter, dark responsive UI, and full keyboard accessibility (Enter/Space/Delete).
Deliverables: index.html, style.css, app.js — clean, modular, commented, semantic HTML + ARIA.

Prompt

Develop a complete and user-friendly Todo List App using only HTML, CSS, and JavaScript (no frameworks). The app should include the following functionality and design requirements:

    1. Input field and ‘Add’ button to create new tasks.
    2. Ability to mark tasks as complete/incomplete via checkboxes.
    3. Inline editing of tasks by double-clicking — pressing Enter saves changes and Esc cancels.
    4. Delete buttons to remove tasks individually.
    5. Filter controls for All, Active, and Completed tasks.
    6. A ‘Clear Completed’ button to remove all completed tasks at once.
    7. Automatic saving and loading of todos using localStorage.
    8. A live counter showing the number of active (incomplete) tasks.
    9. A modern, responsive dark theme UI using CSS variables, rounded corners, and hover effects.
    10. Keyboard accessibility — Enter to add, Space to toggle, Delete to remove tasks.
      Ensure the project is well structured with three separate files:
    • index.html
    • style.css
    • app.js
      Code should be clean, modular, and commented, with semantic HTML and appropriate ARIA attributes for accessibility.

Observations

Codex CLI — Created a functional 3-file structure with working CRUD, filters, and persistence. Fast, but accessibility and keyboard flows needed manual reminders.

Gemini CLI — Balanced logic and UI nicely. Used CSS variables for a simple dark theme and implemented localStorage properly.
Performance was impressive — Gemini was the fastest overall, but its default design felt utilitarian, almost as if it “just wanted to get the job done.”
Gemini focuses on correctness and functionality rather than visual finesse.

Claude Code — Implemented inline editing, keyboard shortcuts, ARIA live counters, and semantic roles perfectly. The result was polished, responsive, and highly maintainable.

4.Codex CLI vs Gemini CLI vs Claude Code — Real-World Comparison

When testing AI coding assistants, speed isn’t everything — clarity, structure, and the quality of generated code all matter. To see how today’s top command-line tools compare, I ran the same set of projects across Claude Code, Gemini CLI, and Codex CLI, including a 2D Snake Game and a Todo List App.
Here’s how they performed.


Claude Code: Polished and Reliable

Claude Code consistently produced the most professional and complete results.
Its generated code came with clear structure, organized logic, and well-commented sections.
In the Snake Game test, Claude built the best-looking user interface, with a balanced layout, responsive design, and smooth movement logic.
Error handling was handled cleanly, and the overall experience felt refined — something you could hand over to a production team with confidence.
Although it wasn’t the fastest, Claude made up for it with code quality, structure, and ease of prompt engineering.
If your workflow values polish, maintainability, and readability, Claude Code is the most dependable choice.


Gemini CLI: Fastest but Basic

Gemini CLI clearly took the top spot for speed.
It executed quickly, generated files almost instantly, and made iteration cycles shorter.
However, the output itself felt minimal and unrefined — both the UI and the underlying logic were quite basic compared to Claude or Codex.
In the Snake Game task, Gemini produced a playable result but lacked visual polish and consistent structure.
Documentation and comments were also limited.
In short, Gemini is great for rapid prototyping or testing ideas quickly, but not for projects where you need beautiful UI, advanced logic, or long-term maintainability.


Codex CLI: Flexible but Slower

Codex CLI offered good flexibility and handled diverse prompts reasonably well.
It could generate functional UIs with decent styling, somewhere between Gemini’s simplicity and Claude’s refinement.
However, its main drawback was speed — responses were slower, and sometimes additional manual intervention was needed to correct or complete the code.
Codex is still a solid option when you need to tweak results manually or explore multiple implementation approaches, but it doesn’t match Claude’s polish or Gemini’s speed.


Overall Impression

After testing multiple projects, the overall ranking became clear:

  • Gemini CLI is the fastest but produces simple and unpolished code.

  • Claude Code delivers the most reliable, structured, and visually refined results.

  • Codex CLI sits in between — flexible but slower and less cohesive.

Each tool has its strengths. Gemini is ideal for quick builds, Codex for experimentation, and Claude Code for professional, trust-ready outputs.

In short:

Gemini wins on speed. Claude wins on quality. Codex stands in between — flexible but slower.

Automatically Generate Google Slides with an AI Prompt

I came across a very interesting idea from the author まじん (Majin) on note.com:

Majin used Gemini to turn a single prompt into a complete Google Slides presentation, but I tried customizing it to run with ChatGPT (or Google AI Studio), and the results were quite exciting.


1. 🔍 Structure of Majin’s Prompt

Through analysis, Majin’s prompt has the following main components:

  • Role assignment for AI: The AI is not just a chatbot, but acts as a Data Scientist and Presentation Designer.

  • Absolute mission: From a piece of input text, the AI must output a JavaScript object array (slideData) that defines the entire slide structure.

  • Fixed blueprint: The prompt includes a Google Apps Script (GAS) framework, where you only need to replace slideData to generate Google Slides.

  • SlideData includes:

    • title: Slide title

    • content: Content (in the form of bullets, text, or tables)

    • pattern: Display type (Title, List, TwoColumn, Image, …)

    • speakerNote: Presenter’s notes

👉 The important point: The prompt does not directly create slides, but outputs JSON-like data that the GAS script uses to build the slides.


2. 🛠️ Customizing & Running with ChatGPT

First, download system prompt.md

🔧 Steps inside the Project

  1. Create a new Project → give it a name (e.g., Google Slide Auto Generator).

  2. Add a file → upload system prompt.md.

  3. In the Project Settings / Overview screen, you will find the Instructions (System message) box → paste the line:
    添付した「system prompt.md」に忠実に従ってください。

  4. (Meaning: Please strictly follow the attached system prompt.md).
    If you don’t see a dedicated “Instructions” field, ChatGPT may show “Description” or “Setup” instead → paste it there. It works as the System Prompt for the whole Project.

  5. From now on, in that Project, whenever you input raw text, ChatGPT will immediately reference system prompt.md and output both slideData + GAS code.

👉 In other words: the Project’s Instruction acts as a dedicated System Message for that workspace. The UI may look different, but it’s always under “Overview/Setup” when you open the Project.

Next, type a prompt about the content you want to make slides from → and it will generate slideData.

I tested it with content about LLMs and got the following result:

const slideData = [

{

type: ‘title’,

title: ‘The Busy Person’s Intro to Large Language Models’,

date: ‘2025.09.21’,

notes: ‘Re-recorded overview of LLMs focusing on what they are, how to run them, how they are trained, and why next-word prediction works’

},

{

type: ‘content’,

title: ‘Agenda’,

points: [

‘What is an LLM’,

‘Running a model locally’,

‘Training at scale’,

‘Why next-word prediction is powerful’,

‘Key takeaways’

],

notes: ‘Keep it practical and concrete. We start with the simplest mental model, then show inference vs training, then the intuition for why LLMs encode knowledge’

},

{

type: ‘section’,

title: ‘1. What is an LLM’,

notes: ‘Define an LLM at the simplest level and frame the talk around artifacts you can hold on disk’

},

{

type: ‘content’,

title: ‘LLM = Two Files’,

subhead: ‘Weights + Runner’,

points: [

‘Parameters file holds learned weights’,

‘Runner implements forward pass of the network’,

‘Self contained package runs offline’

],

notes: ‘Stress that inference can be simple and local. The interesting part lives in the parameters learned during training’

},

{

type: ‘content’,

title: ‘Example: Llama 2 • 70B’,

subhead: ‘Open weights from Meta’,

points: [

‘Sizes include 7B 13B 34B 70B’,

‘Open weights and paper enable local study’,

‘Closed APIs trade control for convenience’

],

notes: ‘Contrast open weights with closed APIs like hosted assistants without exposing internals’

},

{

type: ‘section’,

title: ‘2. Running a Model’,

notes: ‘Give a feel for local inference and why the demo used a smaller model for speed’

},

{

type: ‘process’,

title: ‘Local Inference Flow’,

subhead: ‘From files to generations’,

steps: [

‘Get weights file on disk’,

‘Compile or install a runner’,

‘Point runner to weights’,

‘Send a prompt’,

‘Stream generated tokens’

],

notes: ‘Mention that a 7B model is fast enough on a laptop while 70B is roughly an order of magnitude slower’

},

{

type: ‘compare’,

title: ‘Open Weights vs Closed API’,

leftTitle: ‘Open Weights’,

rightTitle: ‘Closed API’,

leftItems: [

‘Full local control’,

‘Offline capable’,

‘Tunable and inspectable’

],

rightItems: [

‘Best quality access’,

‘No infra to manage’,

‘Faster to integrate’

],

notes: ‘There is no universal winner. Choose based on control privacy latency and capability needs’

},

{

type: ‘section’,

title: ‘3. Training at Scale’,

notes: ‘Explain where weights come from and why training is expensive compared to inference’

},

{

type: ‘content’,

title: ‘Training Ingredients’,

subhead: ‘Data Compute Cost’,

points: [

‘Large curated text corpus’,

‘Massive GPU clusters for days’,

‘Significant budget for runs’

],

notes: ‘Use order of magnitude thinking rather than exact numbers. Emphasize engineering and data work beyond raw compute’

},

{

type: ‘timeline’,

title: ‘From Corpus to Weights’,

subhead: ‘Very high level lifecycle’,

milestones: [

{ label: ‘Collect and filter data’, date: ‘Step 1’, state: ‘done’ },

{ label: ‘Train on GPU cluster’, date: ‘Step 2’, state: ‘done’ },

{ label: ‘Evaluate and align’, date: ‘Step 3’, state: ‘done’ },

{ label: ‘Release weights and runner’, date: ‘Step 4’, state: ‘next’ }

],

notes: ‘Keep the audience at the bird’s eye view. The output is a weights artifact you can reuse everywhere for inference’

},

{

type: ‘section’,

title: ‘4. Why It Works’,

notes: ‘Bridge from mechanics to intuition about capability’

},

{

type: ‘content’,

title: ‘Next Word Prediction Objective’,

subhead: ‘Simple objective rich consequences’,

points: [

‘Model predicts a distribution over next token’,

‘Good prediction forces world knowledge’,

‘Patterns styles and relations get compressed’

],

notes: ‘Relate prediction to lossy compression intuition. To predict the next token the model must encode structure facts and relationships’

},

{

type: ‘content’,

title: ‘Common Misconceptions’,

points: [

‘It is not copy paste of the web’,

‘Bigger is not always better’,

‘Closed is not always better than open’

],

notes: ‘Clarify generalization vs memorization and the role of data quality and algorithms along with scale’

},

{

type: ‘section’,

title: ‘5. Takeaways’,

notes: ‘Land the plane with pragmatic guidance’

},

{

type: ‘content’,

title: ‘Practical Guidance’,

points: [

‘Choose model size for latency and cost’,

‘Use open weights for control’,

‘Use hosted APIs for frontier quality’

],

notes: ‘Map choices to product constraints such as privacy cost latency and maintainability’

},

{

type: ‘closing’,

notes: ‘Thanks for watching. Invite questions and link resources in the description’

}

];

Next steps

  1. Open the .md file on your computer.

  2. Copy the code from the .md file.

  3. Open Google Slides → create a new slide deck.

  4. Click Extensions → Apps Script.

  5. Paste the copied code into the Apps Script editor.

  6. Paste your slideData.

  7. Press Run


3. ✅ Experience & Results

 

  • Works well on ChatGPT: No need for Gemini, GPT-5 is enough.

  • Advantage: The prompt standardizes the output into a JSON-like object, making it easy to control.

  • Reference implementation:


📌 Conclusion

  • Majin’s prompt is a great framework to turn AI into an automatic slide design tool.

  • It doesn’t have to be Gemini — ChatGPT (GPT-5) also works well.

  • You just need to customize the input → and you can generate Google Slides for any topic (training, pitching, learning…).

👉 This article was written with reference to blogs by まじん (Majin):

Trying the Realtime Prompting Guide for GPT-Realtime: Experiments with Vietnamese Voice Input

Try Realtime Prompting Guide for GPT-Realtime

1.Introduction

OpenAI’s Realtime API enables the creation of interactive voice experiences with ultra-low latency. Instead of waiting for a full text input, the model can “listen” to a user while they are still speaking and respond almost instantly. This makes it a powerful foundation for building voice assistants, audio chatbots, automated customer support, or multimodal creative applications.

To get the best results, writing a clear and well-structured prompt is essential. OpenAI published the Realtime Prompting Guide as a playbook for controlling model behavior in spoken conversations.

References:


2.What is GPT-Realtime

GPT-Realtime is a model/API designed to handle continuous audio input and provide rapid responses. Its key features include:

  • Real-time speech-to-text recognition.

  • Robust handling of noisy, cut-off, or unclear audio.

  • Customizable reactions to imperfect audio, such as asking for repetition, clarifying, or continuing in the user’s language.

  • Support for detailed prompting to ensure safe, natural, and reliable responses.


3.Overview of the Prompting Guide

The Realtime Prompting Guide outlines seven best practices for writing system prompts for voice agents:

1. Be precise, avoid conflicts.
Instructions must be specific and consistent. For example, if you say “ask again when unclear,” don’t also instruct the model to “guess when unsure.”

2. Use bullet points instead of paragraphs.
Models handle lists better than long prose.

3. Handle unclear audio.
Explicitly instruct what to do when input is noisy or incomplete: politely ask the user to repeat and only respond when confident.

4. Pin the language when needed.
If you want the entire conversation in one language (e.g., English only), state it clearly. Otherwise, the model may switch to mirror the user.

5. Provide sample phrases.
Include example greetings, clarifications, or closing lines to teach the model your desired style.

6. Avoid robotic repetition.
Encourage varied phrasing for greetings, confirmations, and closings to keep interactions natural.

7. Use capitalization for emphasis.
For example: “IF AUDIO IS UNCLEAR, ASK THE USER TO REPEAT.”

4.Prompt Examples

Sample Prompt A – Avoid conflicts, be clear

SYSTEM: – Always speak clearly and respond in the same language the user is speaking. – If the user’s audio is unclear, noisy, partial, or unintelligible, politely ask them to repeat. – Never assume what was said if you did not hear it clearly. – Use short, varied phrases; avoid repeating the same sentence. – Provide helpful, concise responses.

USER: “Hello, can you help me with my internet issue?”

ASSISTANT: (responds according to the prompt, asks for clarification if needed)


Sample Prompt B – Handling unclear audio

SYSTEM: – If the audio is not clear, ask: “I’m sorry, I didn’t catch that. Could you please repeat?” – Only respond when you are confident you understood the user’s request. – Continue conversation in the same language as the user. – Use bullet points for clarity.

USER: “Um… internet…” (noisy, unclear audio)

ASSISTANT: (follows the system instructions)


Sample Prompt C – Keep a natural style, avoid repetition

SYSTEM: – Do not respond with the same phrase twice. – Vary greetings, confirmations, closings. – Aim for a warm, helpful tone. – If the user ends the conversation, say something like: “If there’s anything else you need, feel free to ask.”

USER: “Thank you, that’s all.”

ASSISTANT: “You’re welcome! Glad I could help. Take care!” (or another variation)

5.Experiments

For my testing, I deliberately used Vietnamese speech to see how the model would react in different situations.

First Test: Speaking Unclear Without a Prompt

To begin, I tested what would happen if I spoke unclearly in Vietnamese without providing any system prompt.

For example, I said:

USER: “Ư… mệnh của mình không tỏa lại” (unclear Vietnamese speech)

The model responded in Indonesian, saying:

ASSISTANT: “Tentu, aku di sini untuk membantu. Coba ceritakan dulu apa yang lagi kamu rasakan atau alami. Kita bisa cari solusinya bareng-bareng.”

This shows that when no system prompt is defined, the model may guess or switch languages unpredictably, instead of asking for clarification.

Second Test: Adding a System Prompt

Next, I added a system prompt to guide the model’s behavior when the audio is unclear:

SYSTEM:
- If the audio is not clear, ask: “I’m sorry, I didn’t catch that. Could you please repeat?”
- Only respond when you are confident you understood the user’s request.
- Continue conversation in the same language as user.
- Use bullet points for clarity.

Then I spoke unclearly in Vietnamese again, for example:

USER: “Um… internet…” (spoken quietly, unclear audio)

This time, the model followed the system instructions and politely asked me to repeat. Sometimes, it even suggested that I try saying a simple test sentence so it could better check whether my voice was coming through clearly.

This shows how a well-written system prompt can prevent the model from making random guesses or switching languages, ensuring a more reliable and natural conversation flow.

Third Test: Singing to the Model

Finally, I experimented by singing in Vietnamese to see how the model would react.

The model listened carefully and was able to understand the lyrics and emotional tone of my singing. However, when I asked it to repeat the lyrics back to me, it refused due to copyright restrictions.

This shows that while GPT-Realtime can analyze and comment on songs — such as summarizing their meaning, describing the mood, or suggesting new lines in a similar style — it cannot reproduce lyrics verbatim. In practice, this means you should not expect the model to sing or echo back copyrighted content.

6.Conclusion

GPT-Realtime provides smooth and natural voice interactions with minimal latency. However, its effectiveness depends heavily on the prompt.

Key takeaways:

  • Always write a clear, bullet-pointed system prompt.
  • Define explicit behavior for unclear audio.
  • Control language use and discourage robotic repetition.
  • Respect copyright limitations: the model will not repeat lyrics verbatim but can summarize or create new content.

The Realtime Prompting Guide is a practical resource for building high-quality voice agents that are both natural and safe.

Installing and Using GPT-OSS 20B Locally with Ollama

In this document, we will explore how to install and run GPT-OSS 20B — a powerful open-weight language model released by OpenAI — locally, with detailed instructions for using it on a Tesla P40 GPU.

1. Quick Introduction to GPT-OSS 20B

  • GPT-OSS 20B is an open-weight language model from OpenAI, released in August 2025—the first since GPT-2—under the Apache 2.0 license, allowing free download, execution, and modification.

  • The model has about 21 billion parameters and can run efficiently on consumer machines with at least 16 GB of RAM or GPU VRAM.

  • GPT-OSS 20B uses a Mixture-of-Experts (MoE) architecture, activating only a subset of parameters (~3.6B) at each step, saving resources and energy.

  • The model supports chain-of-thought reasoning, enabling it to understand and explain reasoning processes step by step.


2. Hardware & Software Preparation

Hardware requirements:

  • RAM or VRAM: minimum 16 GB (can be system RAM or GPU VRAM).

  • Storage: around 12–20 GB for the model and data.

  • Operating system: macOS 11+, Windows, or Ubuntu are supported.

  • GPU (if available): Nvidia or AMD for acceleration. Without a GPU, the model still runs on CPU but very slowly.

Software options:

  • Ollama: the simplest method; quick installation with a convenient CLI.

  • LM Studio: a graphical interface, suitable for beginners.

  • Transformers + vLLM (Python): flexible for developers, integrates well into open-source pipelines.


3. How to Run GPT-OSS 20B with Ollama (GPU Tesla P40)

3.1 Goal and Timeline

  • Goal: successfully run GPT-OSS 20B locally using Ollama, leveraging the Tesla P40 GPU (24GB VRAM).

  • Timeline: the first setup takes about 15–20 minutes to download the model. After that, launching the model takes only a few seconds.

3.2 Environment Preparation

  • GPU: Tesla P40 with 24GB VRAM, sufficient for GPT-OSS 20B.

  • NVIDIA Driver: version 525 or higher recommended. In the sample logs, CUDA 12.0 works fine.

  • RAM: minimum 16GB.

  • Storage: at least 20GB free space; the model itself takes ~13GB plus cache.

  • Operating system: Linux (Ubuntu), macOS, or Windows. The following example uses Ubuntu.

3.3 Install Ollama

The fastest way:

curl -fsSL https://ollama.com/install.sh | sh

Or manually (Linux):

curl -LO https://ollama.com/download/ollama-linux-amd64.tgz
sudo tar -C /usr -xzf ollama-linux-amd64.tgz

Start the Ollama service:

OLLAMA_HOST=0.0.0.0:8888 ollama serve

When the log shows listening on [::]:8888, the server is ready.

3.4 Download GPT-OSS 20B

Open a new terminal and run:

OLLAMA_HOST=0.0.0.0:8888 ollama pull gpt-oss:20b

The first download is about 13GB. When the log shows success, the model is ready.

3.5 Run the Model

Start the model and try chatting:

OLLAMA_HOST=0.0.0.0:8888 ollama run gpt-oss:20b

Example:

>>> hi
Hello! 👋 How can I help you today?

3.6 Verify GPU Usage

Run:

nvidia-smi

Result: the Tesla P40 (24GB) consumes around 12–13GB VRAM for the process /usr/bin/ollama. The Ollama log also shows “offloading output layer to GPU” and “llama runner started in 8.05 seconds”, proving the model is running on GPU, not CPU.

3.7 Monitor API and Performance

Ollama exposes a REST API at http://127.0.0.1:8888.
Common endpoints include /api/chat and /api/generate.

Response times:

  • Short prompts: about 2–10 seconds.

  • Long or complex prompts: may take tens of seconds to a few minutes.


4. Conclusion

You have successfully run GPT-OSS 20B on a Tesla P40. The initial model download takes some time, but afterward it launches quickly and runs stably. With 24GB VRAM, the GPU can handle the large model without overload. While long prompts may still be slow, it is fully usable for real-world experiments and local project integration.

Running Latent Diffusion Model on Regular Computers via Google Colab and Ngrok

Introduction

In recent years, diffusion models such as the Latent Diffusion Model (LDM) have become the gold standard for text-to-image generation thanks to their high image quality, fast inference speed, and flexible fine-tuning capabilities. However, the biggest barrier for beginners is often the expensive GPU hardware requirement. This article will guide you on how to run LDM on a regular computer by taking advantage of Google Colab—a cloud environment that provides free/affordable GPU access, allowing you to focus on your ideas instead of hardware setup.

Main Content

What is Colab?

Colab is a hosted Jupyter Notebook service that requires no setup to use and provides free access to computing resources, including GPUs and TPUs. Colab is especially well suited to machine learning, data science, and education.

What You Need to Prepare

  1. A Colab account – Getting one is very easy; just search for it and sign up. Colab can be used for free but with limits on memory, GPU, etc. You can also subscribe to a paid plan depending on your needs.

  2. An Ngrok account – Just like the first step, sign up for an account, then get your authtoken, which will be used later.

  3. An example LDM setup for this tutorial – In this article, we’ll use the Stable Diffusion Pipeline in Python. Hugging Face provides a list of text-to-image models ranging from weaker to stronger ones at: https://huggingface.co/models?pipeline_tag=text-to-image&sort=trending.
    Example repo for this article: https://github.com/thangnch/MiAI_StableDiffusion_Flask_Text2Image


Running on a Personal Computer

First, clone the above GitHub repo to your local machine. Then install the required libraries and packages such as pip, PyTorch, and diffusion.

Next, run the svr_model.py file.

Depending on whether your personal computer has GPU support (CUDA or MPS), the model can run on GPU; otherwise, it defaults to CPU—which is much slower.

Since my GPU doesn’t support NVIDIA CUDA, I had to run it on CPU.

After starting the server, the demo web page URL appears in the terminal.

Now we can generate an image from a prompt.

 

  • Speed: quite slow at 4.14s/it

  • Consumes a lot of CPU power

  • But still managed to generate an image successfully with the weakest model

This shows that although it is possible to run locally on your own machine, it is very slow and CPU-intensive, even with the smallest model.


Using Colab with Ngrok

  1. Visit Google Colab: https://colab.research.google.com/

  2. Visit Ngrok: https://dashboard.ngrok.com/get-started/setup/windows

In Colab:

Then:

  • Run another cell to install all required libraries (already listed in the repo)

  • Copy the code from text2image_model.py to start running the model

  • Next, copy the code from svr_model.py

Before running svr_model.py, install Ngrok in the Colab environment by running another code cell.

After installation:

  • Go to your Ngrok dashboard, copy your personal authtoken

  • Back in Colab, paste it into the Secrets section on the left sidebar, name it, and save

Now run svr_model.py.


Ngrok will provide a temporary public URL (my tunnel) that connects to your server running on GPU.

Visit the link, and you’ll get the text-to-image web interface.

Time to generate images!

Example:

  • Prompt: “Cat and dog” – With the lowest model

    • Very fast at 9.9s

    • GPU used effectively

Even with higher-level models, the process still runs smoothly.

  • Prompt: “City landscape” – Model level 6

Still stable and responsive


Conclusion

Through these experiments, we have learned how to use Latent Diffusion Models easily on a personal computer and optimize performance with Colab and Ngrok combined. This provides a smooth and fast user experience. Hopefully, this article will be helpful to readers.

Thank you for reading!

Exploring Claude Code Subagents: A Demo Setup for a RAG-Based Website Project

1. Introduction

Recently, Anthropic released an incredible new feature for its product Claude: subagents — secondary agents with specific tasks for different purposes within a user’s project.

2. Main Content

a. How to Set It Up:
First, install Claude using the following command in your Terminal window:

npm i @anthropic-ai/claude-code

If Claude is already installed but it’s an older version, it won’t have the subagent feature.

to update claude, command : claude update

Launch Claude Code in your working directory, then run the command:
/agents

Press Enter, and a management screen for agents will appear, allowing you to start creating agents with specific purposes for your project.

Here, I will set it up following Claude’s recommendation.

After the setup, I have the following subagents:

I will ask Claude to help me build a website using RAG with the following prompt:

The first subagents have started working.

The setup of the RAG project has been completed.

However, I noticed that the subagent ‘production-code-reviewer (Review RAG system code)’ didn’t function after the coding was completed. It might be an issue with my prompt, so I will ask Claude to review the code for me

After the whole working process, Claude Code will deliver an excellent final product.
Link: https://github.com/mhieupham1/claudecode-subagent

3. Conclusion

Through the entire setup process and practical use in a project, it’s clear how powerful and beneficial the Sub-agents feature introduced by Anthropic for Claude Code is. It enables us to have AI “teammates” with specialized skills and roles that operate independently without interfering with each other — allowing projects to be organized, easy to understand, and efficient.