Xây Dựng AI Agent Hiệu Quả với MCP

Posted on November 14, 2025 by Phat Ly

Giới Thiệu

Trong thời đại AI đang phát triển mạnh mẽ, việc xây dựng các AI agent thông minh và hiệu quả đã trở thành mục tiêu của nhiều nhà phát triển. Model Context Protocol (MCP) – một giao thức mở được Anthropic phát triển – đang mở ra những khả năng mới trong việc tối ưu hóa cách các AI agent tương tác với dữ liệu và công cụ. Bài viết này sẽ phân tích cách tiếp cận “Code Execution with MCP” và đưa ra những góc nhìn thực tế về việc áp dụng nó vào các dự án thực tế.

MCP Là Gì và Tại Sao Nó Quan Trọng?

Model Context Protocol (MCP) có thể được ví như “USB-C của thế giới AI” – một tiêu chuẩn mở giúp chuẩn hóa cách các ứng dụng cung cấp ngữ cảnh cho các mô hình ngôn ngữ lớn (LLM). Thay vì mỗi hệ thống phải tự xây dựng cách kết nối riêng, MCP cung cấp một giao thức thống nhất, giúp giảm thiểu sự phân mảnh và tăng tính tương thích.

Quan điểm cá nhân: Tôi cho rằng MCP không chỉ là một công nghệ, mà còn là một bước tiến quan trọng trong việc chuẩn hóa hệ sinh thái AI. Giống như cách HTTP đã cách mạng hóa web, MCP có tiềm năng trở thành nền tảng cho việc kết nối các AI agent với thế giới bên ngoài.

Code Execution với MCP: Bước Đột Phá Thực Sự

Vấn Đề Truyền Thống

Trước đây, khi xây dựng AI agent, chúng ta thường phải:

Tải tất cả định nghĩa công cụ vào context window ngay từ đầu
Gửi toàn bộ dữ liệu thô đến mô hình, dù chỉ cần một phần nhỏ
Thực hiện nhiều lần gọi công cụ tuần tự, gây ra độ trễ cao
Đối mặt với rủi ro bảo mật khi dữ liệu nhạy cảm phải đi qua mô hình

Giải Pháp: Code Execution với MCP

Code execution với MCP cho phép AI agent viết và thực thi mã để tương tác với các công cụ MCP. Điều này mang lại 5 lợi ích chính:

1. Tiết Lộ Dần Dần (Progressive Disclosure)

Cách hoạt động: Thay vì tải tất cả định nghĩa công cụ vào context, agent có thể đọc các file công cụ từ hệ thống file khi cần thiết.

Ví dụ thực tế: Giống như việc bạn không cần đọc toàn bộ thư viện sách để tìm một thông tin cụ thể. Agent chỉ cần “mở” file công cụ khi thực sự cần sử dụng.

Lợi ích:

Giảm đáng kể token consumption
Tăng tốc độ phản hồi ban đầu
Cho phép agent làm việc với số lượng công cụ lớn hơn

2. Kết Quả Công Cụ Hiệu Quả Về Ngữ Cảnh

Vấn đề: Khi làm việc với dataset lớn (ví dụ: 10,000 records), việc gửi toàn bộ dữ liệu đến mô hình là không hiệu quả.

Giải pháp: Agent có thể viết mã để lọc, chuyển đổi và xử lý dữ liệu trước khi trả về kết quả cuối cùng.

Ví dụ:

# Thay vì trả về 10,000 records
# Agent có thể viết:
results = filter_data(dataset, criteria)
summary = aggregate(results)
return summary  # Chỉ trả về kết quả đã xử lý

Quan điểm: Đây là một trong những điểm mạnh nhất của phương pháp này. Nó cho phép agent “suy nghĩ” trước khi trả lời, giống như cách con người xử lý thông tin.

3. Luồng Điều Khiển Mạnh Mẽ

Cách truyền thống: Agent phải thực hiện nhiều lần gọi công cụ tuần tự:

Gọi công cụ 1 → Chờ kết quả → Gọi công cụ 2 → Chờ kết quả → ...

Với code execution: Agent có thể viết một đoạn mã với vòng lặp, điều kiện và xử lý lỗi:

for item in items:
    result = process(item)
    if result.is_valid():
        save(result)
    else:
        log_error(item)

Lợi ích:

Giảm độ trễ (latency) đáng kể
Xử lý lỗi tốt hơn
Logic phức tạp được thực thi trong một bước

4. Bảo Vệ Quyền Riêng Tư

Đặc điểm quan trọng: Các kết quả trung gian mặc định được giữ trong môi trường thực thi, không tự động gửi đến mô hình.

Ví dụ: Khi agent xử lý dữ liệu nhạy cảm (thông tin cá nhân, mật khẩu), các biến trung gian chỉ tồn tại trong môi trường thực thi. Chỉ khi agent chủ động log hoặc return, dữ liệu mới được gửi đến mô hình.

Quan điểm: Đây là một tính năng bảo mật quan trọng, đặc biệt trong các ứng dụng enterprise. Tuy nhiên, cần có cơ chế giám sát để đảm bảo agent không vô tình leak dữ liệu.

5. Duy Trì Trạng Thái và Kỹ Năng

Khả năng mới: Agent có thể:

Lưu trạng thái vào file để tiếp tục công việc sau
Xây dựng các function có thể tái sử dụng như “kỹ năng”
Học và cải thiện theo thời gian

Ví dụ thực tế: Agent có thể tạo file utils.py với các function xử lý dữ liệu, và sử dụng lại trong các task tương lai.

Cách Xây Dựng AI Agent Hiệu Quả với MCP

Bước 1: Thiết Kế Kiến Trúc

Nguyên tắc:

Tách biệt rõ ràng giữa logic xử lý và tương tác với MCP
Thiết kế các công cụ MCP theo module, dễ mở rộng
Xây dựng hệ thống quản lý trạng thái rõ ràng

Ví dụ kiến trúc:

Agent Core
├── MCP Client (kết nối với MCP servers)
├── Code Executor (sandbox environment)
├── State Manager (lưu trữ trạng thái)
└── Tool Registry (quản lý công cụ)

Bước 2: Tối Ưu Hóa Progressive Disclosure

Chiến lược:

Tổ chức công cụ theo namespace và category
Sử dụng file system để quản lý định nghĩa công cụ
Implement lazy loading cho các công cụ ít dùng

Code pattern:

# tools/database/query.py
def query_database(sql):
    # Implementation
    pass

# Agent chỉ load khi cần
if need_database:
    import tools.database.query

Bước 3: Xây Dựng Data Processing Pipeline

Best practices:

Luôn filter và transform dữ liệu trước khi trả về
Sử dụng streaming cho dataset lớn
Implement caching cho các query thường dùng

Ví dụ:

def process_large_dataset(data_source):
    # Chỉ load và xử lý phần cần thiết
    filtered = stream_filter(data_source, filter_func)
    aggregated = aggregate_in_chunks(filtered)
    return summary_statistics(aggregated)

Bước 4: Implement Security Measures

Các biện pháp cần thiết:

Sandboxing: Chạy code trong môi trường cách ly
Resource limits: Giới hạn CPU, memory, thời gian thực thi
Audit logging: Ghi lại tất cả code được thực thi
Input validation: Kiểm tra input trước khi thực thi

Quan điểm: Security không phải là feature, mà là requirement. Đừng để đến khi có sự cố mới nghĩ đến bảo mật.

Bước 5: State Management và Skill Building

Chiến lược:

Sử dụng file system hoặc database để lưu trạng thái
Tạo thư viện các utility functions có thể tái sử dụng
Implement versioning cho các “skills”

Ví dụ:

# skills/data_analysis.py
def analyze_trends(data):
    # Reusable skill
    pass

# Agent có thể import và sử dụng
from skills.data_analysis import analyze_trends

Áp Dụng Vào Dự Án Thực Tế

Use Case 1: Data Analysis Agent

Tình huống: Xây dựng agent phân tích dữ liệu từ nhiều nguồn khác nhau.

Áp dụng MCP:

MCP servers cho mỗi data source (database, API, file system)
Code execution để filter và aggregate dữ liệu
Progressive disclosure cho các công cụ phân tích

Lợi ích:

Giảm 60-70% token usage
Tăng tốc độ xử lý 3-5 lần
Dễ dàng thêm data source mới

Use Case 2: Automation Agent

Tình huống: Agent tự động hóa các tác vụ lặp đi lặp lại.

Áp dụng MCP:

MCP servers cho các hệ thống cần tương tác
Code execution để xử lý logic phức tạp
State management để resume công việc

Lợi ích:

Xử lý lỗi tốt hơn với try-catch trong code
Có thể pause và resume công việc
Dễ dàng debug và monitor

Use Case 3: Customer Support Agent

Tình huống: Agent hỗ trợ khách hàng với quyền truy cập vào nhiều hệ thống.

Áp dụng MCP:

MCP servers cho CRM, knowledge base, ticketing system
Code execution để query và tổng hợp thông tin
Privacy protection cho dữ liệu khách hàng

Lợi ích:

Bảo vệ thông tin nhạy cảm tốt hơn
Phản hồi nhanh hơn với data processing tại chỗ
Dễ dàng tích hợp hệ thống mới

Những Thách Thức và Giải Pháp

Thách Thức 1: Code Quality và Safety

Vấn đề: Agent có thể viết code không an toàn hoặc không hiệu quả.

Giải pháp:

Implement code review tự động
Sử dụng linter và formatter
Giới hạn các API và function có thể sử dụng

Thách Thức 2: Debugging

Vấn đề: Debug code được agent tự động generate khó hơn code thủ công.

Giải pháp:

Comprehensive logging
Code explanation từ agent
Step-by-step execution với breakpoints

Thách Thức 3: Performance

Vấn đề: Code execution có thể chậm nếu không tối ưu.

Giải pháp:

Caching kết quả
Parallel execution khi có thể
Optimize code generation từ agent

Roadmap Áp Dụng MCP Vào Dự Án Của Bạn

Dựa trên những nguyên tắc và best practices đã trình bày, đây là roadmap cụ thể để bạn có thể áp dụng MCP vào dự án của mình một cách hiệu quả:

Giai Đoạn 1: Chuẩn Bị và Đánh Giá (Tuần 1-2)

Mục tiêu: Hiểu rõ nhu cầu và chuẩn bị môi trường

Đánh giá use case: Xác định vấn đề cụ thể mà agent sẽ giải quyết
Phân tích hệ thống hiện tại: Liệt kê các hệ thống, API, database cần tích hợp
Thiết lập môi trường dev: Cài đặt MCP SDK, tạo sandbox environment
Xác định metrics: Định nghĩa KPIs để đo lường hiệu quả (token usage, latency, accuracy)
Security audit: Đánh giá các yêu cầu bảo mật và compliance

Giai Đoạn 2: Proof of Concept (Tuần 3-4)

Mục tiêu: Xây dựng prototype đơn giản để validate concept

Tạo MCP server đầu tiên: Bắt đầu với một data source đơn giản nhất
Implement basic agent: Agent có thể gọi MCP tool và xử lý response
Test code execution: Cho agent viết và thực thi code đơn giản
Đo lường baseline: Ghi lại metrics ban đầu để so sánh
Gather feedback: Thu thập phản hồi từ team và stakeholders

Giai Đoạn 3: Mở Rộng và Tối Ưu (Tuần 5-8)

Mục tiêu: Mở rộng chức năng và tối ưu hóa hiệu suất

Thêm MCP servers: Tích hợp các data source và hệ thống còn lại
Implement progressive disclosure: Tổ chức tools theo namespace, lazy loading
Xây dựng data pipeline: Filter, transform, aggregate data trước khi trả về
Security hardening: Implement sandboxing, resource limits, audit logging
State management: Lưu trạng thái, xây dựng reusable skills
Performance optimization: Caching, parallel execution, code optimization

Giai Đoạn 4: Production và Monitoring (Tuần 9-12)

Mục tiêu: Đưa vào production và đảm bảo ổn định

Testing toàn diện: Unit tests, integration tests, security tests
Documentation: Viết docs cho MCP servers, API, và agent behavior
Monitoring setup: Logging, metrics, alerting system
Gradual rollout: Deploy từng phần, A/B testing nếu cần
Training và support: Đào tạo team, setup support process
Continuous improvement: Thu thập feedback, iterate và optimize

Checklist Implementation

Technical Setup

MCP SDK installed
Sandbox environment configured
MCP servers implemented
Code executor setup
State storage configured

Security

Sandboxing enabled
Resource limits set
Input validation implemented
Audit logging active
Access control configured

Performance

Progressive disclosure implemented
Data filtering in place
Caching strategy defined
Metrics dashboard ready
Optimization plan created

Key Takeaways để Áp Dụng Hiệu Quả

Bắt đầu từ use case đơn giản nhất: Đừng cố gắng giải quyết tất cả vấn đề cùng lúc. Bắt đầu nhỏ, học hỏi, rồi mở rộng.
Ưu tiên security từ đầu: Đừng để security là suy nghĩ sau. Thiết kế security vào kiến trúc ngay từ đầu.
Đo lường mọi thứ: Nếu không đo lường được, bạn không thể cải thiện. Setup metrics và monitoring sớm.
Tận dụng code execution: Đây là điểm mạnh của MCP. Cho phép agent xử lý logic phức tạp trong code thay vì nhiều tool calls.
Xây dựng reusable skills: Đầu tư vào việc tạo các function có thể tái sử dụng. Chúng sẽ tiết kiệm thời gian về sau.
Iterate và improve: Không có giải pháp hoàn hảo ngay từ đầu. Thu thập feedback, đo lường, và cải thiện liên tục.

Ví Dụ Thực Tế: E-commerce Data Analysis Agent

Tình huống: Bạn cần xây dựng agent phân tích dữ liệu bán hàng từ nhiều nguồn (database, API, CSV files).

Áp dụng roadmap:

Tuần 1-2: Đánh giá data sources, thiết lập môi trường, xác định metrics (query time, token usage)
Tuần 3-4: Tạo MCP server cho database, agent có thể query và trả về kết quả đơn giản
Tuần 5-8: Thêm MCP servers cho API và file system, implement data filtering, aggregation trong code
Tuần 9-12: Production deployment, monitoring, optimize query performance, build reusable analysis functions

Kết quả: Agent có thể phân tích dữ liệu từ nhiều nguồn, giảm 65% token usage, tăng tốc độ xử lý 4 lần so với cách truyền thống.

Kết Luận và Hướng Phát Triển

Code execution với MCP đại diện cho một bước tiến quan trọng trong việc xây dựng AI agent. Nó không chỉ giải quyết các vấn đề về hiệu quả và bảo mật, mà còn mở ra khả năng cho agent “học” và phát triển kỹ năng theo thời gian.

Quan điểm cuối cùng:

Tôi tin rằng đây mới chỉ là khởi đầu. Trong tương lai, chúng ta sẽ thấy:

Các agent có thể tự động tối ưu hóa code của chính chúng
Hệ sinh thái các MCP servers phong phú hơn
Các framework và tooling hỗ trợ tốt hơn cho việc phát triển

Lời khuyên cho các nhà phát triển:

Bắt đầu nhỏ: Bắt đầu với một use case đơn giản để hiểu rõ cách MCP hoạt động
Tập trung vào security: Đừng đánh đổi bảo mật để lấy hiệu quả
Đo lường và tối ưu: Luôn đo lường performance và tối ưu dựa trên dữ liệu thực tế
Cộng đồng: Tham gia vào cộng đồng MCP để học hỏi và chia sẻ kinh nghiệm

Việc áp dụng MCP vào dự án của bạn không chỉ là việc tích hợp một công nghệ mới, mà còn là việc thay đổi cách suy nghĩ về việc xây dựng AI agent. Hãy bắt đầu ngay hôm nay và khám phá những khả năng mới!

Playwright Agents — 🎭 Planner, 🎭 Generator, 🎭 Healer

Posted on October 15, 2025October 15, 2025 by Phat Ly

What are Playwright Agents?

This article distills the official guidance and demo video into a practical, production‑ready walkthrough. Playwright ships three agents you can run independently or in a loop: 🎭 Planner, 🎭 Generator, and 🎭 Healer.

🎭 Planner

Explores your app and produces a human‑readable Markdown plan.

Input: a clear request (e.g. “Generate a plan for guest checkout”), a seed test, optional PRD.
Output: specs/*.md with scenarios, steps, and expected results.

🎭 Generator

Converts the Markdown plan into executable Playwright tests and validates selectors/assertions during generation.

Input: Markdown from specs/, seed test and fixtures.
Output: tests/*.spec.ts aligned to the plan.

🎭 Healer

Runs tests, replays failures, proposes patches (locator updates, waits, data fixes) and re‑runs until passing or guardrails stop.

Input: failing test name.
Output: a passing test or a skipped test if functionality is broken.

🎭 Planner → 🎭 Generator → 🎭 Healer Overview

8
Test Helpers and Utilities
9
Examples (from actual demo)
10
Best Practices
11
Troubleshooting
12
CI/CD Integration
13
FAQ
14
Demo video and Source code

1. Requirements

Node.js 18+ and npm
Playwright Test latest version
VS Code 1.105+ (Insiders channel) for full agentic UI experience
AI Assistant – Choose one: Claude Code, OpenCode, or VS Code with AI extensions
Git for version control
Modern web browser (Chrome, Firefox, Safari)

2. Step-by-Step Installation Guide

Step 1: Prerequisites

Install Node.js 18+ from nodejs.org
Install npm (comes with Node.js)
Install VS Code 1.105+ from VS Code Insiders for agentic experience
Choose and install an AI Assistant:
- Claude Code – for Claude integration
- OpenCode – for OpenAI integration
- VS Code with AI extensions – for built-in AI features
Install Git for version control

Step 2: Navigate to Demo Directory

# Navigate to the demo directory
C:\Users\ADMIN\Documents\AI_QUEST_LTP> cd "playwright Agent Test Example - PhatLT"

Step 3: Install Dependencies

playwright Agent Test Example - PhatLT> npm install
playwright Agent Test Example - PhatLT> npx playwright install

Step 4: Initialize Playwright Agents

# Initialize agent definitions for Claude Code (recommended)
playwright Agent Test Example - PhatLT> npx playwright init-agents --loop=claude

# Or for VS Code
playwright Agent Test Example - PhatLT> npx playwright init-agents --loop=vscode

# Or for OpenCode
playwright Agent Test Example - PhatLT> npx playwright init-agents --loop=opencode

Step 5: Verify Setup

# Test seed file
playwright Agent Test Example - PhatLT> npx playwright test tests/seed-agents.spec.ts

# Check project structure
playwright Agent Test Example - PhatLT> dir .claude\agents
playwright Agent Test Example - PhatLT> dir .github
playwright Agent Test Example - PhatLT> dir specs

playwright Agent Test Example - PhatLT> npm init -y
Wrote to playwright Agent Test Example - PhatLT\package.json:
{
  "name": "phatlt-playwright",
  "version": "1.0.0",
  "main": "index.js",
  "scripts": {
    "test": "playwright test",
    "test:headed": "playwright test --headed",
    "test:ui": "playwright test --ui",
    "test:debug": "playwright test --debug",
    "test:chromium": "playwright test --project=chromium",
    "test:firefox": "playwright test --project=firefox",
    "test:webkit": "playwright test --project=webkit",
    "report": "playwright show-report",
    "codegen": "playwright codegen"
  },
  "keywords": [],
  "author": "",
  "license": "ISC",
  "type": "commonjs",
  "description": "",
  "devDependencies": {
    "@playwright/test": "^1.56.0",
    "@types/node": "^24.7.2"
  }
}

playwright Agent Test Example - PhatLT> npm install -D @playwright/test
added 1 package, and audited 2 packages in 2s
found 0 vulnerabilities

playwright Agent Test Example - PhatLT> npx playwright install
Installing browsers...
✓ Chromium 120.0.6099.109
✓ Firefox 120.0
✓ WebKit 17.4

playwright Agent Test Example - PhatLT> npx playwright init
✓ Created playwright.config.ts
✓ Created tests/
✓ Created tests/example.spec.ts
✓ Created tests/seed.spec.ts

3. Step-by-Step Testing Guide

Step 1: Test Seed File

Run the seed test to verify Playwright Agents setup:

# Test seed file for agents
playwright Agent Test Example - PhatLT> npx playwright test tests/seed-agents.spec.ts

# Run with browser UI visible
playwright Agent Test Example - PhatLT> npx playwright test tests/seed-agents.spec.ts --headed

# Run in debug mode
playwright Agent Test Example - PhatLT> npx playwright test tests/seed-agents.spec.ts --debug

Step 2: Test Generated Tests

Run the example generated tests from the Generator agent:

# Run generated Google search tests
playwright Agent Test Example - PhatLT> npx playwright test tests/google-search-generated.spec.ts

# Run specific test by name
playwright Agent Test Example - PhatLT> npx playwright test --grep "Perform Basic Search"

# Run all tests
playwright Agent Test Example - PhatLT> npx playwright test

Step 3: Test Different Browsers

# Run tests only on Chromium
playwright Agent Test Example - PhatLT> npx playwright test --project=chromium

# Run tests only on Firefox
playwright Agent Test Example - PhatLT> npx playwright test --project=firefox

# Run tests only on WebKit
playwright Agent Test Example - PhatLT> npx playwright test --project=webkit

Step 4: Generate Test Reports

# Generate HTML report
playwright Agent Test Example - PhatLT> npx playwright show-report

# Run tests with UI mode
playwright Agent Test Example - PhatLT> npx playwright test --ui

Step 5: Using Playwright Agents

Now you can use the Playwright Agents workflow with Claude Code:

# In Claude Code, ask the Planner:
"I need test scenarios for Google search functionality. Use the planner agent to explore https://www.google.com"

# Then ask the Generator:
"Use the generator agent to create tests from the test plan in specs/"

# Finally, use the Healer if tests fail:
"The test 'Perform Basic Search' is failing. Use the healer agent to fix it."

4. Project Structure and Files

playwright Agent Test Example - PhatLT/
├── .claude/agents/              # Claude Code agent definitions
│   ├── playwright-test-planner.md    # 🎭 Planner agent
│   ├── playwright-test-generator.md  # 🎭 Generator agent
│   └── playwright-test-healer.md     # 🎭 Healer agent
├── .github/                     # Official agent definitions
│   ├── planner.md               # 🎭 Planner instructions
│   ├── generator.md             # 🎭 Generator instructions
│   └── healer.md                # 🎭 Healer instructions
├── specs/                       # Test plans (Markdown)
│   └── google-search-operations.md   # Example test plan
├── tests/                       # Generated tests
│   ├── seed-agents.spec.ts      # Seed test for agents
│   └── google-search-generated.spec.ts  # Generated test example
├── .mcp.json                    # MCP server configuration
├── playwright.config.ts         # Playwright configuration
├── package.json                 # Project dependencies
└── test-results/               # Test execution results

5. How Playwright Agents Work (End‑to‑End)

🎭 Planner — explores your app and creates human-readable test plans saved in specs/ directory.
🎭 Generator — transforms Markdown plans into executable Playwright tests in tests/ directory.
🎭 Healer — automatically repairs failing tests by updating selectors and waits.
Execution — run generated tests with npx playwright test.
Maintenance — Healer fixes issues automatically, keeping tests stable over time.

playwright Agent Test Example - PhatLT> npx playwright test tests/seed-agents.spec.ts

Running 1 test using 1 worker

  ✓ [chromium] › tests/seed-agents.spec.ts › seed (2.1s)

  1 passed (2.1s)

playwright Agent Test Example - PhatLT> npx playwright test tests/google-search-generated.spec.ts

Running 5 tests using 1 worker

  ✓ [chromium] › tests/google-search-generated.spec.ts › Google Search - Basic Operations › Perform Basic Search (3.2s)
  ✓ [chromium] › tests/google-search-generated.spec.ts › Google Search - Basic Operations › Verify Search Box Functionality (1.8s)
  ✓ [chromium] › tests/google-search-generated.spec.ts › Google Search - Basic Operations › Search with Empty Query (1.5s)
  ✓ [chromium] › tests/google-search-generated.spec.ts › Google Search - Results Validation › Verify Search Results Display (4.1s)
  ✓ [chromium] › tests/google-search-generated.spec.ts › Google Search - Results Validation › Navigate Through Search Results (5.3s)

  5 passed (16.0s)

6. How Playwright Agents Work

Playwright Agents follow a structured workflow as described in the official documentation. The process involves three main agents working together:

🎭 Planner Agent

The Planner explores your application and creates human-readable test plans:

Input: Clear request (e.g., “Generate a plan for guest checkout”), seed test, optional PRD
Output: Markdown test plan saved as specs/basic-operations.md
Process: Runs seed test to understand app structure and creates comprehensive test scenarios

🎭 Generator Agent

The Generator transforms Markdown plans into executable Playwright tests:

Input: Markdown plan from specs/
Output: Test suite under tests/
Process: Verifies selectors and assertions live, generates robust test code

🎭 Healer Agent

The Healer automatically repairs failing tests:

Input: Failing test name
Output: Passing test or skipped test if functionality is broken
Process: Replays failing steps, inspects UI, suggests patches, re-runs until passing

// Example: Generated test from specs/basic-operations.md
// spec: specs/basic-operations.md
// seed: tests/seed.spec.ts

import { test, expect } from '../fixtures';

test.describe('Adding New Todos', () => {
  test('Add Valid Todo', async ({ page }) => {
    // 1. Click in the "What needs to be done?" input field
    const todoInput = page.getByRole('textbox', { name: 'What needs to be done?' });
    await todoInput.click();

    // 2. Type "Buy groceries"
    await todoInput.fill('Buy groceries');

    // 3. Press Enter key
    await todoInput.press('Enter');

    // Expected Results:
    // - Todo appears in the list with unchecked checkbox
    await expect(page.getByText('Buy groceries')).toBeVisible();
    const todoCheckbox = page.getByRole('checkbox', { name: 'Toggle Todo' });
    await expect(todoCheckbox).toBeVisible();
    await expect(todoCheckbox).not.toBeChecked();

    // - Counter shows "1 item left"
    await expect(page.getByText('1 item left')).toBeVisible();

    // - Input field is cleared and ready for next entry
    await expect(todoInput).toHaveValue('');
    await expect(todoInput).toBeFocused();

    // - Todo list controls become visible
    await expect(page.getByRole('checkbox', { name: '❯Mark all as complete' })).toBeVisible();
  });
});

7. Agent Deep Dives

🎭 Planner — author plans that generate great tests

Goal: Convert product intent into executable, atomic scenarios.
Inputs: business request, seed.spec.ts, optional PRD/acceptance criteria.
Output quality tips: prefer user‑intent over UI steps, keep 1 scenario = 1 assertion focus, name entities consistently.
Anti‑patterns: mixing setup/teardown into steps; over‑specifying selectors in Markdown.

🎭 Generator — compile plans into resilient tests

Validates selectors live: uses your running app to confirm locators/assertions.
Structure: mirrors specs/*.md; adds fixtures from seed.spec.ts; keeps tests idempotent.
Resilience: prefer roles/labels; avoid brittle CSS/XPath; centralize waits.

🎭 Healer — stabilize and protect correctness

Scope: flaky selectors, timing, deterministic data; not business‑logic rewrites.
Review gates: patches proposed as diffs; you accept/reject before merge.
Outcomes: test fixed, or skipped with a documented reason when the feature is broken.

8. Project Structure and Artifacts

Playwright Agents follow a structured approach as described in the official documentation. The generated files follow a simple, auditable structure:

repo/
  .github/                    # agent definitions
    planner.md               # planner agent instructions
    generator.md             # generator agent instructions  
    healer.md                # healer agent instructions
  specs/                     # human-readable test plans
    basic-operations.md      # generated by planner
  tests/                     # generated Playwright tests
    seed.spec.ts             # seed test for environment
    add-valid-todo.spec.ts   # generated by generator
  playwright.config.ts       # Playwright configuration

Agent Definitions (.github/)

Under the hood, agent definitions are collections of instructions and MCP tools provided by Playwright. They should be regenerated whenever Playwright is updated:

# Initialize agent definitions
npx playwright init-agents --loop=vscode
npx playwright init-agents --loop=claude  
npx playwright init-agents --loop=opencode

Specs in specs/

Specs are structured plans describing scenarios in human-readable terms. They include steps, expected outcomes, and data. Specs can start from scratch or extend a seed test.

Tests in tests/

Generated Playwright tests, aligned one-to-one with specs wherever feasible. Generated tests may include initial errors that can be healed automatically by the healer agent.

Seed tests (seed.spec.ts)

Seed tests provide a ready-to-use page context to bootstrap execution. The planner runs this test to execute all initialization necessary for your tests including global setup, project dependencies, and fixtures.

// Example: seed.spec.ts
import { test, expect } from './fixtures';

test('seed', async ({ page }) => {
  // This test uses custom fixtures from ./fixtures
  // 🎭 Planner will run this test to execute all initialization
  // necessary for your tests including global setup, 
  // project dependencies and all necessary fixtures and hooks
});

9. Examples from Official Documentation

🎭 Planner Output Example

The 🎭 Planner generates human-readable test plans saved as specs/basic-operations.md:

# TodoMVC Application - Basic Operations Test Plan

## Application Overview

The TodoMVC application is a React-based todo list manager that demonstrates 
standard todo application functionality. Key features include:

- **Task Management**: Add, edit, complete, and delete individual todos
- **Bulk Operations**: Mark all todos as complete/incomplete and clear all completed todos  
- **Filtering System**: View todos by All, Active, or Completed status with URL routing support
- **Real-time Counter**: Display of active (incomplete) todo count
- **Interactive UI**: Hover states, edit-in-place functionality, and responsive design

## Test Scenarios

### 1. Adding New Todos

**Seed:** `tests/seed.spec.ts`

#### 1.1 Add Valid Todo

**Steps:**
1. Click in the "What needs to be done?" input field
2. Type "Buy groceries"
3. Press Enter key

**Expected Results:**
- Todo appears in the list with unchecked checkbox
- Counter shows "1 item left"
- Input field is cleared and ready for next entry
- Todo list controls become visible (Mark all as complete checkbox)

🎭 Generator Output Example

The 🎭 Generator transforms the Markdown plan into executable Playwright tests:

// Generated test from specs/basic-operations.md
// spec: specs/basic-operations.md
// seed: tests/seed.spec.ts

import { test, expect } from '../fixtures';

test.describe('Adding New Todos', () => {
  test('Add Valid Todo', async ({ page }) => {
    // 1. Click in the "What needs to be done?" input field
    const todoInput = page.getByRole('textbox', { name: 'What needs to be done?' });
    await todoInput.click();

    // 2. Type "Buy groceries"
    await todoInput.fill('Buy groceries');

    // 3. Press Enter key
    await todoInput.press('Enter');

    // Expected Results:
    // - Todo appears in the list with unchecked checkbox
    await expect(page.getByText('Buy groceries')).toBeVisible();
    const todoCheckbox = page.getByRole('checkbox', { name: 'Toggle Todo' });
    await expect(todoCheckbox).toBeVisible();
    await expect(todoCheckbox).not.toBeChecked();

    // - Counter shows "1 item left"
    await expect(page.getByText('1 item left')).toBeVisible();

    // - Input field is cleared and ready for next entry
    await expect(todoInput).toHaveValue('');
    await expect(todoInput).toBeFocused();

    // - Todo list controls become visible
    await expect(page.getByRole('checkbox', { name: '❯Mark all as complete' })).toBeVisible();
  });
});

10. Best Practices

Keep plans atomic: Small, focused scenarios help 🎭 Generator produce clean tests. Avoid mixing multiple user flows in one scenario.
Stabilize with seed: Centralize navigation, authentication, and data seeding in seed.spec.ts to ensure consistent test environment.
Prefer semantic selectors: Use getByRole, getByLabel, and getByText for resilient element selection.
🎭 Healer guardrails: Review patches carefully; accept locator/wait tweaks, but avoid broad logic changes that might mask real bugs.
Version agent definitions: Commit .github/ changes and regenerate them whenever Playwright is updated.
Choose the right AI assistant: VS Code, Claude Code, or OpenCode — pick the one that fits your team’s workflow and preferences.
Maintain traceability: Keep clear 1:1 mapping from specs/*.md to tests/*.spec.ts using comments and headers.
Test the agents: Start with simple scenarios to understand how each agent works before tackling complex user flows.

11. Troubleshooting

🎭 Planner can’t explore the app

Ensure your app is running locally, seed test works, and the app is accessible. Check that authentication and navigation are properly set up in seed.spec.ts.

🎭 Generator can’t find elements

Run the app locally, ensure routes are correct, and verify that elements have proper roles, labels, or accessible names. The 🎭 Generator validates selectors live against your running app.

🎭 Healer loops without fixing

Set explicit timeouts, add deterministic test data, and reduce flakiness in network waits. The 🎭 Healer works best with stable, predictable test conditions.

AI assistant doesn’t trigger agents

Re-run npx playwright init-agents --loop=[assistant], reload the IDE, and ensure the correct workspace root is open with agent definitions in .github/.

Generated tests fail immediately

Check that your seed test passes first. Ensure the app state matches what the 🎭 Planner observed. Verify that test data and authentication are consistent between planning and execution.

Agent definitions are outdated

Regenerate agent definitions after Playwright updates: npx playwright init-agents --loop=[assistant]. This ensures you have the latest tools and instructions.

12. CI/CD Integration

You can run the same agent‑generated tests in CI. Keep agent definitions in the repo and refresh them on Playwright upgrades.

# .github/workflows/tests.yml (excerpt)
name: Playwright Tests
on: [push, pull_request]
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
      - run: npm ci
      - run: npx playwright install --with-deps
      - run: npx playwright test --reporter=html

13. FAQ

Do I need Claude Code?

No. Playwright Agents work with VS Code (v1.105+), Claude Code, or OpenCode. Choose the AI assistant that fits your team’s workflow and preferences.

Where do test plans live?

In specs/ as Markdown files generated by the 🎭 Planner. Generated tests go to tests/.

What if a feature is actually broken?

The 🎭 Healer can skip tests with an explanation instead of masking a real bug. It distinguishes between flaky tests and genuinely broken functionality.

Can I run agent-generated tests in CI?

Yes. The agents produce standard Playwright tests that run with npx playwright test in CI. Agent definitions are only needed for test authoring, not execution.

How do I update agent definitions?

Run npx playwright init-agents --loop=[assistant] whenever Playwright is updated to get the latest tools and instructions.

What’s the difference between 🎭 Planner, 🎭 Generator, and 🎭 Healer?

🎭 Planner: Explores your app and creates human-readable test plans. 🎭 Generator: Transforms plans into executable Playwright tests. 🎭 Healer: Automatically fixes failing tests by updating selectors and waits.

14. Demo video and Source code

GitHub GitHub repository: phatltscuti/playwright_agents

Context Engineering cho AI Agents – Tóm tắt từ Anthropic

Posted on October 14, 2025 by Cuong Dinh

Context Engineering cho AI Agents

Tóm tắt từ bài viết của Anthropic về nghệ thuật quản lý context trong phát triển AI

🎯 Context Engineering là gì?

Context Engineering là tập hợp các chiến lược để tuyển chọn và duy trì bộ tokens (thông tin) tối ưu trong quá trình AI agents hoạt động.

Nó bao gồm việc quản lý toàn bộ trạng thái context như:

System prompts (hướng dẫn hệ thống)
Tools (công cụ)
Model Context Protocol (MCP)
External data (dữ liệu bên ngoài)
Message history (lịch sử hội thoại)
Các thông tin khác trong context window

💡 Bản chất: Context Engineering là nghệ thuật và khoa học về việc tuyển chọn thông tin nào sẽ đưa vào context window giới hạn từ vũ trụ thông tin liên tục phát triển của agent.

🔄 Khác biệt giữa Context Engineering và Prompt Engineering

📝 Prompt Engineering

Focus: Cách viết instructions (hướng dẫn)
Phạm vi: Tối ưu hóa system prompts
Use case: Tác vụ đơn lẻ, one-shot
Tính chất: Rời rạc, tĩnh

Ví dụ: “Tóm tắt văn bản này thành 3 điểm chú trọng số liệu tài chính”

🧠 Context Engineering

Focus: Model nhìn thấy gì trong context window
Phạm vi: Toàn bộ trạng thái thông tin
Use case: Multi-turn, tác vụ dài hạn
Tính chất: Lặp lại, động, liên tục

Ví dụ: Quyết định agent nên xem toàn bộ tài liệu, 3 phần cuối, hay bản tóm tắt đã chuẩn bị?

🎭 Ẩn dụ: Prompt engineering là “nói cho ai đó biết phải làm gì”, còn context engineering là “quyết định nên cung cấp nguồn lực gì cho họ”.

⚡ Tại sao Context Engineering quan trọng hơn?

Khi AI agents thực hiện các tác vụ phức tạp trên nhiều vòng lặp, chúng tạo ra ngày càng nhiều dữ liệu. Thông tin này phải được tinh chỉnh theo chu kỳ. Context engineering xảy ra mỗi khi chúng ta quyết định đưa gì vào model – đây là quá trình lặp đi lặp lại, không phải một lần.

⚠️ Những điều cần chú ý khi phát triển AI Agents

1. 🎯 Vấn đề “Goldilocks Zone” cho System Prompts

System prompts cần nằm ở “vùng vừa phải” giữa hai thái cực:

❌ Quá cứng nhắc: Hardcode logic if-else phức tạp → agent dễ vỡ, khó bảo trì

❌ Quá mơ hồ: Hướng dẫn chung chung, giả định context chung → thiếu tín hiệu cụ thể

✅ Vùng tối ưu: Đủ cụ thể để dẫn dắt hành vi, nhưng đủ linh hoạt để cung cấp heuristics mạnh mẽ

2. 🧹 “Context Rot” – Sự suy giảm độ chính xác

Khi context window dài ra, độ chính xác của model giảm xuống:

Giới hạn chú ý: LLMs giống con người – không thể nhớ mọi thứ khi quá tải. Nhiều tokens ≠ chính xác hơn
Context rot: Context càng dài, độ chính xác truy xuất càng giảm. Thêm 100 trang logs có thể che mất chi tiết quan trọng duy nhất
Kiến trúc transformer: Tạo n² mối quan hệ giữa các tokens (10K tokens = 100M quan hệ, 100K tokens = 10B quan hệ)

💡 Giải pháp: Implement pagination, range selection, filtering, truncation với giá trị mặc định hợp lý

3. 🔧 Quản lý Tools hiệu quả

Giữ tools riêng biệt: Không tạo 2 tools cùng làm việc giống nhau (VD: cùng fetch news)
Mô tả rõ ràng: Viết tool descriptions như hướng dẫn nhân viên mới – rõ ràng, tránh mơ hồ
Token-efficient: Giới hạn tool responses (VD: Claude Code giới hạn 25,000 tokens mặc định)
Error handling tốt: Error messages phải cụ thể, actionable, không phải error codes mơ hồ

4. 📊 Just-in-Time Context Retrieval

Thay vì load toàn bộ dữ liệu trước, hãy fetch dữ liệu động khi cần:

Tránh overload context window
Giảm token costs
Ngăn context poisoning (nhiễu thông tin)
Tương tự cách con người dùng hệ thống indexing bên ngoài

5. 🎨 Ba chiến lược cho tác vụ dài hạn

📦 Compaction (Nén thông tin)

Tóm tắt context cũ, giữ lại thông tin quan trọng

📝 Structured Note-Taking

Agent tự ghi chú có cấu trúc về những gì đã làm

🤖 Multi-Agent Architecture

Spawn sub-agents nhỏ cho các tác vụ hẹp, trả về kết quả ngắn gọn

6. 🎯 Ưu tiên Context theo tầm quan trọng

🔴 High Priority (luôn có trong context): Tác vụ hiện tại, kết quả tool gần đây, hướng dẫn quan trọng

🟡 Medium Priority (khi có không gian): Examples, quyết định lịch sử

⚪ Low Priority (on-demand): Nội dung file đầy đủ, documentation mở rộng

7. 📈 Monitoring và Iteration

Theo dõi liên tục:

Token usage per turn
Tool call frequency
Context window utilization
Performance ở các độ dài context khác nhau
Recall vs Precision khi rút gọn context

💡 Quy trình: Bắt đầu đơn giản → Test → Xác định lỗi → Thêm hướng dẫn cụ thể → Loại bỏ redundancy → Lặp lại

💡 Kết luận

Context engineering là kỹ năng then chốt để xây dựng AI agents hiệu quả. Khác với prompt engineering tập trung vào “cách viết instructions”, context engineering quan tâm đến “môi trường thông tin toàn diện” mà agent hoạt động.

Thành công không nằm ở việc tìm từ ngữ hoàn hảo, mà là tối ưu hóa cấu hình context để tạo ra hành vi mong muốn một cách nhất quán.

🎯 Nguyên tắc cốt lõi: Tìm bộ tokens nhỏ nhất có tín hiệu cao nhất để tối đa hóa khả năng đạt được kết quả mong muốn. Mỗi từ không cần thiết, mỗi mô tả tool thừa, mỗi dữ liệu cũ đều làm giảm hiệu suất agent.

Shipping with Codex

Posted on October 14, 2025October 14, 2025 by Dai Ha

Codex: Kỹ Sư Phần Mềm AI Đã Tạo Ra “Sự Thay Đổi Cảm Hứng Lớn” (Vibe Shift) Trong Lập Trình

Gần đây, tại OpenAI, chúng ta đã chứng kiến một điều phi thường: một “sự thay đổi cảm hứng lớn” (vibe shift) trong cách chúng tôi xây dựng phần mềm. Kể từ tháng Tám, mức độ sử dụng Codex—kỹ sư phần mềm AI của chúng tôi—đã tăng gấp mười lần. Codex không chỉ là một công cụ; nó giống như một đồng nghiệp con người mà bạn có thể lập trình đôi, giao phó công việc hoặc đơn giản là để nó tự động thực hiện các nhiệm vụ phức tạp.

Sự bùng nổ này không phải ngẫu nhiên. Nó là kết quả của hàng loạt cập nhật lớn, biến Codex thành một tác nhân mạnh mẽ, an toàn và trực quan hơn, hoạt động trên mọi nền tảng mà bạn xây dựng.

1. Những Cập Nhật Đã Tạo Nên Sự Thay Đổi Lớn

Đại Tu Hoàn Toàn Tác Nhân (Agent Overhaul)

Chúng tôi định nghĩa tác nhân Codex là sự kết hợp của hai yếu tố: mô hình lý luận và bộ công cụ (harness) cho phép nó hành động và tạo ra giá trị.

Mô Hình Nâng Cấp: Ban đầu, chúng tôi đã ra mắt GPT-5, mô hình tác nhân tốt nhất của mình. Dựa trên phản hồi, chúng tôi tiếp tục tối ưu hóa và cho ra mắt GPT-5 Codex, một mô hình được tinh chỉnh đặc biệt cho công việc mã hóa. Người dùng mô tả nó như một “kỹ sư cấp cao thực thụ” vì nó không ngại đưa ra phản hồi thẳng thắn và từ chối những ý tưởng tồi.
Hệ Thống Công Cụ Mới (Harness): Chúng tôi đã viết lại hoàn toàn bộ công cụ để tận dụng tối đa các mô hình mới. Hệ thống này bổ sung các tính năng quan trọng như lập kế hoạch (planning), nén tự động bối cảnh (autoco compaction)—cho phép các cuộc trò chuyện và tương tác cực kỳ dài—và hỗ trợ cho MCP (Multi-Context Protocol).

Trải Nghiệm Người Dùng Được Tinh Chỉnh

Dù mô hình và tác nhân mạnh mẽ, phản hồi ban đầu cho thấy giao diện dòng lệnh (CLI) còn “sơ khai”.

CLI Revamp: Chúng tôi đã đại tu CLI, đơn giản hóa chế độ phê duyệt (approvals modes), tạo ra giao diện người dùng dễ đọc hơn và thêm nhiều chi tiết tinh tế. Quan trọng nhất, Codex CLI hiện mặc định hoạt động với sandboxing (môi trường hộp cát), đảm bảo an toàn theo mặc định trong khi vẫn trao toàn quyền kiểm soát cho người dùng.
Tiện Ích Mở Rộng IDE: Để hỗ trợ người dùng muốn xem và chỉnh sửa code cùng lúc với việc cộng tác với Codex, chúng tôi đã phát hành một extension bản địa (native extension) cho IDE. Tiện ích này hoạt động với VS Code, Cursor và các bản fork phổ biến khác. Nó đã bùng nổ ngay lập tức, thu hút 100.000 người dùng trong tuần đầu tiên—bằng cách sử dụng cùng một tác nhân mạnh mẽ và bộ công cụ mã nguồn mở (open-source harness) đã cung cấp sức mạnh cho CLI.
Codex Cloud Nhanh Hơn: Chúng tôi đã nâng cấp Codex Cloud để chạy nhiều tác vụ song song, tăng tốc độ tác vụ Cloud lên 90%. Tác vụ Cloud giờ đây có thể tự động thiết lập các phụ thuộc, thậm chí xác minh công việc bằng cách chụp ảnh màn hình và gửi cho bạn.

Codex Hoạt Động Mọi Nơi

Codex giờ đây hoạt động tích hợp sâu vào quy trình làm việc của bạn:

Slack và GitHub: Codex có thể được giao nhiệm vụ trực tiếp trong các công cụ cộng tác như Slack. Nó nhận toàn bộ bối cảnh từ luồng trò chuyện, tự mình khám phá vấn đề, viết code, và đăng giải pháp cùng với bản tóm tắt chỉ sau vài phút.
Review Code Độ Tin Cậy Cao: Việc đánh giá và duyệt code đang trở thành nút thắt cổ chai lớn. Chúng tôi đã huấn luyện GPT-5 Codex đặc biệt để thực hiện review code cực kỳ kỹ lưỡng (ultra thorough). Nó khám phá toàn bộ code và các phụ thuộc bên trong container của mình, xác minh ý định và việc triển khai. Kết quả là những phát hiện có tín hiệu rất cao (high signal), đến mức nhiều đội đã bật nó theo mặc định, thậm chí cân nhắc bắt buộc.

2. Codex Đang Thúc Đẩy OpenAI Như Thế Nào

Kết quả nội bộ tại OpenAI là minh chứng rõ ràng nhất cho sức mạnh của Codex:

92% nhân viên kỹ thuật của OpenAI sử dụng Codex hàng ngày (tăng từ 50% vào tháng Bảy).
Các kỹ sư sử dụng Codex nộp 70% nhiều PR (Pull Requests) hơn mỗi tuần.
Hầu như tất cả các PR đều được Codex review. Khi nó tìm thấy lỗi, các kỹ sư cảm thấy hào hứng vì nó giúp tiết kiệm thời gian và tăng độ tin cậy khi triển khai.

3. Các Quy Trình Làm Việc Thực Tế Hàng Ngày

Các kỹ sư của chúng tôi đã chia sẻ những ví dụ thực tế về cách họ sử dụng Codex để giải quyết các vấn đề phức tạp.

Trường Hợp 1: Lặp Lại Giao Diện Người Dùng (UI) Với Bằng Chứng Hình Ảnh (Nacho)

Nacho, kỹ sư iOS, đã chia sẻ quy trình làm việc tận dụng tính năng đa phương thức (multimodal) của Codex:

Vấn Đề: Trong công việc front-end, 10% công việc đánh bóng cuối cùng—như căn chỉnh header/footer—thường chiếm đến 90% thời gian.
Giải Pháp: Nacho giao cho Codex nhiệm vụ triển khai UI từ một bản mockup. Khác với các agent cũ (được ví như “kỹ sư tập sự”), Codex (được ví như “kỹ sư cấp cao”) xác minh công việc của nó.
Quy Trình TDD & Multimodal:
1. Nacho cung cấp cho Codex một công cụ đơn giản: một script Python (do Codex viết) để trích xuất các snapshot (ảnh chụp giao diện) từ các SwiftUI Previews.
2. Nó được hướng dẫn sử dụng công cụ này để xác minh trực quan code UI mà nó viết.
3. Codex lặp đi lặp lại: Viết code > Chạy test/Snapshot > Sửa lỗi cho đến khi giao diện đạt đến độ hoàn hảo về pixel (pixel perfect).
Kết Quả: Nacho có thể để Codex làm việc trên những chi tiết nhỏ (10% độ đánh bóng) trong khi anh làm những việc khác, biết rằng nó sẽ tự kiểm tra công việc của mình bằng hình ảnh.

Trường Hợp 2: Mở Rộng Giới Hạn Tác Vụ Lớn (Fel)

Fel, được biết đến là người có phiên làm việc Codex lâu nhất (hơn bảy giờ) và xử lý nhiều token nhất (hơn 150 triệu), đã chứng minh cách anh thực hiện các tác vụ refactor lớn chỉ với vài lời nhắc.

Vấn Đề: Thực hiện một refactor lớn (như thay đổi 15.000 dòng code) trong các dự án phức tạp (như bộ phân tích JSON cá nhân của anh) thường dẫn đến việc tất cả các bài kiểm tra thất bại trong thời gian dài.
Giải Pháp: Kế Hoạch Thực Thi (Exec Plan):
1. Fel yêu cầu Codex viết một đặc tả (spec)—được gọi là plans.md—để triển khai tính năng, giao cho nó nhiệm vụ nghiên cứu thư viện và cách tích hợp.
2. Anh định nghĩa plans.md là một “tài liệu thiết kế sống” (living document) mà Codex phải liên tục cập nhật, bao gồm mục tiêu lớn, danh sách việc cần làm, tiến trình, và nhật ký quyết định (decision log).
3. Anh sử dụng thuật ngữ neo “exec plan” để đảm bảo mô hình biết khi nào cần tham chiếu và phản ánh lại tài liệu này.
4. Sau khi Fel phê duyệt kế hoạch, anh ra lệnh: “Implement” (Thực thi).
Kết Quả: Codex có thể làm việc một cách hiệu quả trong nhiều giờ (thậm chí hơn một giờ trong buổi demo) trên một tính năng lớn, sử dụng plans.md như bộ nhớ và kim chỉ nam. Trong một phiên, nó đã tạo ra 4.200 dòng code chỉ trong khoảng một giờ—mọi thứ đều được kiểm tra và vượt qua.

Trường Hợp 3: Vòng Lặp Sửa Lỗi và Review Tại Chỗ (Daniel)

Daniel, một kỹ sư trong nhóm Codex, đã giới thiệu quy trình làm việc slash review mới, đưa khả năng review code chất lượng cao của GPT-5 Codex xuống môi trường cục bộ (local).

Vấn Đề: Ngay cả sau khi hoàn thành code, các kỹ sư cần một bộ mắt mới không bị thiên vị để tìm ra các lỗi khó.
Giải Pháp: Slash Review: Trước khi gửi PR, Daniel sử dụng lệnh /review trong CLI.
- Anh chọn duyệt so với nhánh gốc (base branch), tương tự như một PR.
- GPT-5 Codex bắt đầu luồng review chuyên biệt: Nó nghiên cứu sâu các tập tin, tìm kiếm các lỗi kỹ thuật, và thậm chí viết/chạy các script kiểm tra để xác minh các giả thuyết lỗi trước khi báo cáo.
- Mô hình thiên vị: Luồng review chạy trong một luồng riêng biệt, có bối cảnh mới mẻ (fresh context), loại bỏ bất kỳ thiên vị triển khai (implementation bias) nào từ cuộc trò chuyện trước.
Vòng Lặp Sửa Lỗi: Khi Codex tìm thấy một vấn đề P0/P1, Daniel chỉ cần gõ “Please fix”.
Kết Quả: Codex sửa lỗi, và Daniel có thể chạy /review lần nữa cho đến khi nhận được “thumbs up” (chấp thuận) cuối cùng. Điều này đảm bảo code được kiểm tra kỹ lưỡng, được sửa lỗi cục bộ trước khi push, tiết kiệm thời gian và đảm bảo độ tin cậy cao hơn.

Ba chức năng chính của Codex, được nhấn mạnh trong bài thuyết trình, là:

Lập Trình Đôi và Triển Khai Code (Implementation & Delegation):
- Codex hoạt động như một đồng đội lập trình đôi trong IDE/CLI, giúp bạn viết code nhanh hơn.
- Nó có thể nhận ủy quyền (delegate) các tác vụ lớn hơn (như refactor hoặc thêm tính năng) và tự thực hiện trong môi trường Cloud/Sandboxing, bao gồm cả việc tự thiết lập dependencies và chạy song song.
Xác Minh và Kiểm Thử Tự Động (Verification & TDD):
- Codex tích hợp sâu với quy trình Test-Driven Development (TDD).
- Nó không chỉ viết code mà còn tự động chạy các bài kiểm thử (unit tests) và xác minh đa phương thức (ví dụ: tạo và kiểm tra snapshot UI) để đảm bảo code hoạt động chính xác và đạt độ hoàn hảo về mặt hình ảnh (pixel perfect).
Review Code Độ Tin Cậy Cao (High-Signal Code Review):
- Sử dụng mô hình GPT-5 Codex được tinh chỉnh, nó thực hiện review code cực kỳ kỹ lưỡng (ultra thorough) trên GitHub PR hoặc cục bộ thông qua lệnh /review.
- Chức năng này giúp tìm ra các lỗi kỹ thuật khó và có thể được sử dụng trong vòng lặp Review -> Fix -> Review để đảm bảo chất lượng code trước khi merge, tiết kiệm thời gian và tăng độ tin cậy khi triển khai.

Link video: https://www.youtube.com/watch?v=Gr41tYOzE20

Revolutionizing Test Automation with Playwright Agents

Posted on October 14, 2025October 14, 2025 by Cuong Dinh

🎭 Revolutionizing Test Automation with Playwright Agents

How AI-Powered Agents are Transforming E2E Testing

📅 October 2025
⏱️ 5 min read
🏷️ Testing, AI, Automation

Imagine this: You describe what you want to test, and AI generates comprehensive test plans, writes the actual test code, and even fixes failing tests automatically. This isn’t science fiction—it’s Playwright Agents, and it’s available today.

Playwright has introduced three powerful AI agents that work together to revolutionize how we approach test automation: the Planner, Generator, and Healer. Let’s dive deep into how these agents are changing the game.

What Are Playwright Agents?

Playwright Agents are AI-powered tools that automate the entire test creation and maintenance lifecycle. They can work independently or sequentially in an agentic loop, producing comprehensive test coverage for your product without the traditional manual overhead.

🎯

Planner Agent

The Planner is your AI test strategist. It explores your application and produces detailed, human-readable test plans in Markdown format.

How It Works:

Input: A clear request (e.g., “Generate a plan for guest checkout”), a seed test that sets up your environment, and optionally a Product Requirements Document
Process: Runs the seed test to understand your app’s structure, analyzes user flows, and identifies test scenarios
Output: Structured Markdown test plans saved in specs/ directory with detailed steps and expected results

💡 Pro Tip: The Planner uses your seed test as context, so it understands your custom fixtures, authentication flows, and project setup automatically!

Example Output:

# TodoMVC Application - Basic Operations Test Plan

## Test Scenarios

### 1. Adding New Todos
#### 1.1 Add Valid Todo
**Steps:**
1. Click in the "What needs to be done?" input field
2. Type "Buy groceries"
3. Press Enter key

**Expected Results:**
- Todo appears in the list with unchecked checkbox
- Counter shows "1 item left"
- Input field is cleared and ready for next entry

⚡

Generator Agent

The Generator transforms your human-readable test plans into executable Playwright test code, verifying selectors and assertions in real-time.

Key Features:

Live Verification: Checks selectors against your actual app while generating code
Smart Assertions: Uses Playwright’s catalog of assertions for robust validation
Context Aware: Inherits setup from seed tests and maintains consistency
Best Practices: Generates code following Playwright conventions and modern patterns

Generated Test Example:

// spec: specs/basic-operations.md
// seed: tests/seed.spec.ts
import { test, expect } from '../fixtures';

test.describe('Adding New Todos', () => {
  test('Add Valid Todo', async ({ page }) => {
    // Type and submit todo
    const todoInput = page.getByRole('textbox', { 
      name: 'What needs to be done?' 
    });
    await todoInput.fill('Buy groceries');
    await todoInput.press('Enter');
    
    // Verify todo appears
    await expect(page.getByText('Buy groceries')).toBeVisible();
    await expect(page.getByText('1 item left')).toBeVisible();
    await expect(todoInput).toHaveValue('');
  });
});

🔧

Healer Agent

The Healer is your automated maintenance engineer. When tests fail, it diagnoses issues and applies fixes automatically.

Healing Process:

Step 1: Replays the failing test steps to understand the failure context
Step 2: Inspects the current UI to locate equivalent elements or alternative flows
Step 3: Suggests patches like locator updates, wait adjustments, or data corrections
Step 4: Re-runs the test until it passes or determines the functionality is actually broken

🎯 Smart Decisions: If the Healer can’t fix a test after multiple attempts, it marks the test as skipped and flags it as a potential real bug in your application!

Common Fixes Applied:

Updating selectors when UI structure changes
Adding appropriate waits for dynamic content
Adjusting test data to match new requirements
Handling new dialog boxes or pop-ups

🤖 Working with Claude Code

Playwright Agents integrate seamlessly with Claude Code, enabling natural language test automation directly from your terminal.

Setup Process:

# Initialize Playwright Agents for Claude Code
npx playwright init-agents --loop=claude

# This generates agent definitions optimized for Claude Code
# under .github/ directory with MCP tools and instructions

1
Initialize: Run the init command to generate agent definitions

2
Plan: Ask Claude Code to use the Planner: “Use 🎭 planner to create a test plan for user registration”

3
Generate: Command the Generator: “Use 🎭 generator to create tests from specs/registration.md”

4
Heal: Let the Healer fix issues: “Use 🎭 healer to fix all failing tests”

Benefits with Claude Code:

Natural Language Control: Command agents using simple English instructions
Context Awareness: Claude Code understands your project structure and requirements
Iterative Refinement: Easily adjust and improve tests through conversation
Automatic Updates: Regenerate agents when Playwright updates to get latest features

The Complete Workflow

Here’s how the three agents work together to create comprehensive test coverage:

1. 🎯 Planner explores your app
   └─> Produces: specs/user-flows.md

2. ⚡ Generator reads the plan
   └─> Produces: tests/user-registration.spec.ts
               tests/user-login.spec.ts
               tests/checkout.spec.ts

3. Run tests: npx playwright test
   └─> Some tests fail due to UI changes

4. 🔧 Healer analyzes failures
   └─> Updates selectors automatically
   └─> Tests now pass ✅

Why This Matters

Traditional E2E testing requires significant manual effort:

Writing detailed test plans takes hours
Converting plans to code is tedious and error-prone
Maintaining tests as UI changes is a constant battle
New team members need extensive training

Playwright Agents eliminate these pain points by:

✅ Generating plans in minutes instead of hours
✅ Producing production-ready test code automatically
✅ Self-healing tests that adapt to UI changes
✅ Making test automation accessible to everyoneDEMO:

Github source : https://github.com/cuongdvscuti/agent-playwright

Ready to Transform Your Testing?

Playwright Agents represent a fundamental shift in how we approach test automation. By combining AI with Playwright’s powerful testing capabilities, you can achieve comprehensive test coverage with a fraction of the traditional effort.

Whether you’re starting a new project or maintaining an existing test suite, Playwright Agents can help you move faster, catch more bugs, and spend less time on maintenance.

Get Started with Playwright Agents

📚 Documentation

🐙 GitHub Repo

💬 Discord Community

Khi Trí Tuệ Nhân Tạo Lên Ngôi: Những Vấn Đề Đạo Đức Không Thể Làm Ngơ

Posted on October 14, 2025October 14, 2025 by Phan Thanh Giảng

“AI không có đạo đức. Chúng ta mới là người quyết định cách nó được sử dụng.”

Kỷ nguyên AI và câu hỏi lớn về “đạo đức”

Trong vài năm gần đây, trí tuệ nhân tạo (AI) đã trở thành từ khóa nóng nhất.
Từ những công cụ như ChatGPT giúp viết nội dung, đến xe tự lái, robot y tế, hay các hệ thống chấm điểm tín dụng — AI đang hiện diện ở khắp nơi.

Nhưng càng mạnh mẽ, AI càng đặt ra những câu hỏi khó:

Ai chịu trách nhiệm khi AI gây sai sót?
AI có thể phân biệt đối xử không?
Dữ liệu cá nhân của chúng ta có thực sự an toàn?
Và liệu máy móc có thể hiểu… đạo đức con người?

1. Khi AI “học” cả điều tốt lẫn điều xấu

AI không có đạo đức. Nó chỉ học từ dữ liệu mà con người cung cấp. Nếu dữ liệu có thiên vị, AI cũng sẽ “bắt chước” thiên vị đó.

Ví dụ:
Một công ty dùng AI để tuyển dụng, nhưng dữ liệu lịch sử chủ yếu là ứng viên nam → AI “vô tình” loại bỏ nhiều ứng viên nữ, dù họ đủ năng lực.

Giải pháp:
Các nhà phát triển cần kiểm tra dữ liệu trước khi huấn luyện AI, đảm bảo nó đa dạng và công bằng. Một số công ty lớn như Google hay Microsoft hiện đã có “đội ngũ kiểm toán đạo đức AI” riêng để làm điều này.

2. “Hộp đen” trí tuệ nhân tạo – khi AI ra quyết định mà không ai hiểu vì sao

Một vấn đề lớn khác là tính minh bạch. Rất nhiều mô hình AI hiện nay, đặc biệt là các “mô hình ngôn ngữ lớn” (như GPT, Gemini, Claude…), hoạt động như “hộp đen” – ta chỉ thấy kết quả, nhưng không biết vì sao nó đưa ra kết luận đó.

Nếu AI khuyên sai trong lĩnh vực y tế hay tài chính, ai sẽ chịu trách nhiệm?

Hướng đi mới:

Phát triển AI có thể giải thích được (Explainable AI) – cho phép người dùng biết lý do đằng sau quyết định.
Quy định rõ trách nhiệm pháp lý: người phát triển, người vận hành hay người sử dụng?

3. Quyền riêng tư – “dữ liệu của tôi đang ở đâu?”

AI học từ lượng dữ liệu khổng lồ – bao gồm cả dữ liệu cá nhân: hình ảnh, giọng nói, thói quen tìm kiếm… Nếu không được quản lý chặt, những dữ liệu này có thể bị rò rỉ hoặc sử dụng sai mục đích.

Bạn có bao giờ tự hỏi:

“Tôi có quyền yêu cầu AI không học từ dữ liệu của mình không?”

Ở châu Âu, luật GDPR cho phép người dùng yêu cầu “xóa dấu vết kỹ thuật số” — Việt Nam cũng đang áp dụng quy định tương tự trong Luật Bảo vệ dữ liệu cá nhân (PDP).

Giải pháp:

Dùng AI bảo mật dữ liệu (federated learning) – nơi dữ liệu không rời khỏi thiết bị của bạn.
Cung cấp tùy chọn “không dùng dữ liệu của tôi để huấn luyện AI”.

4. Khi AI thay đổi công việc của chúng ta

AI có thể giúp chúng ta làm việc nhanh hơn, nhưng cũng có thể thay thế nhiều công việc truyền thống. Nhiều người lo lắng về “làn sóng thất nghiệp công nghệ”.

Tuy nhiên, không phải tất cả đều tiêu cực: AI sẽ tạo ra những công việc mới – từ kỹ sư dữ liệu, chuyên gia đạo đức AI, đến người giám sát mô hình.

Cách ứng xử thông minh: Thay vì chống lại AI, hãy học cách làm việc cùng AI.
Ví dụ: dùng AI để hỗ trợ sáng tạo, chứ không phải để thay thế tư duy.

5. Ảnh hưởng đến tâm lý và niềm tin xã hội

Chatbot ngày nay có thể trò chuyện tự nhiên như con người, thậm chí “thấu hiểu” cảm xúc. Điều đó khiến nhiều người dễ phụ thuộc cảm xúc vào AI, hoặc tin nhầm vào thông tin giả (deepfake).

Deepfake có thể tạo video “giả như thật” khiến xã hội hoang mang, đặc biệt trong bầu cử hoặc các sự kiện chính trị.

Giải pháp:

Bắt buộc gắn nhãn “nội dung do AI tạo ra”.
Giáo dục cộng đồng về “AI literacy” – hiểu và sử dụng AI có trách nhiệm.

6. AI trong quân sự và an ninh – ranh giới mong manh

AI đang được dùng để phát hiện mục tiêu, điều khiển drone, thậm chí ra quyết định tấn công. Nếu không có giới hạn, điều này có thể đe dọa sinh mạng con người và an ninh toàn cầu.

Nhiều tổ chức quốc tế đang kêu gọi cấm vũ khí tự động hóa hoàn toàn (LAWS – Lethal Autonomous Weapon Systems). Một chiếc máy không nên có quyền quyết định sống chết của con người.

7. Khi AI cần “học đạo đức”

Hãy tưởng tượng một chiếc xe tự lái: Nếu buộc phải chọn giữa va vào một người hay năm người, AI nên làm gì?

Đó là bài toán đạo đức kinh điển — và AI chưa thể tự mình trả lời.
Các nhà khoa học đang nghiên cứu “AI alignment” – giúp AI hiểu và hành động phù hợp với giá trị con người.

8. Ở Việt Nam – Đạo đức AI là nền tảng cho “Make in Vietnam”

Việt Nam đang trong giai đoạn phát triển mạnh mẽ AI – từ chatbot tiếng Việt đến hệ thống giáo dục thông minh.
Nhưng cùng với đó là trách nhiệm:

Không để AI tạo nội dung sai lệch, bạo lực hoặc phản cảm.
Đảm bảo quyền riêng tư và dữ liệu cá nhân.
Xây dựng văn hóa AI vì con người, chứ không thay con người.

Kết luận: “AI tốt hay xấu – là do chúng ta”

AI không có thiện hay ác. Chính con người đặt giá trị đạo đức vào AI – thông qua cách ta thiết kế, giám sát và sử dụng nó.

Nếu ta dùng AI để chữa bệnh, giáo dục, và sáng tạo – AI sẽ trở thành người đồng hành tuyệt vời. Nhưng nếu ta để AI phục vụ cho thao túng, kiểm soát và bất công – nó sẽ phản chiếu mặt tối của chính con người.

✨ Hãy bắt đầu bằng câu hỏi đơn giản:
“Công nghệ này có giúp thế giới tốt hơn không?”

Khi câu trả lời là có, đó chính là AI có đạo đức.

OpenAI DevDay 2025 Introduces Revolutionary AI Features & Comprehensive Analysis

Posted on October 13, 2025October 13, 2025 by Phat Ly

OpenAI DevDay 2025

Revolutionary AI Features & Comprehensive Analysis

October 6, 2025 • San Francisco, CA

Event Information

📅

Date

October 6, 2025

📍

Location

Fort Mason, San Francisco

👥

Attendees

1,500+ Developers

🎤

Keynote Speaker

Sam Altman (CEO)

🌐

Official Website

openai.com/devday

🎥

Video Keynote

Watch on YouTube

💡

OpenAI DevDay 2025 represents a pivotal moment in AI development history. This comprehensive analysis delves deep into the revolutionary features announced, examining their technical specifications, real-world applications, and transformative impact on the AI ecosystem. From ChatGPT Apps to AgentKit, each innovation represents a quantum leap forward in artificial intelligence capabilities.

📋 Executive Summary

New features/services: ChatGPT Apps; AgentKit (Agent Builder, ChatKit, Evals); Codex GA; GPT‑5 Pro API; Sora 2 API; gpt‑realtime‑mini.
What’s great: Unified chat‑first ecosystem, complete SDKs/kits, strong performance, built‑in monetization, and strong launch partners.
Impacts: ~60% faster dev cycles, deeper enterprise automation, one‑stop user experience, and a need for updated ethics/regulation.
Highlights: Live demos (Coursera, Canva, Zillow); Codex controlling devices/IoT/voice; Mattel partnership.
ROI: Better cost/perf (see Performance & Cost table) and new revenue via Apps.

Revolutionary Features Deep Dive

📱

ChatGPT Apps

Native Application Integration Platform

Overview

ChatGPT Apps represents the most revolutionary feature announced at DevDay 2025. This platform allows developers to create applications that run natively within ChatGPT, creating a unified ecosystem where users can access multiple services without leaving the conversational interface.

Core Capabilities

Apps SDK: Comprehensive development toolkit for seamless ChatGPT integration
Native Integration: Applications function as natural extensions of ChatGPT
Context Awareness: Full access to conversation context and user preferences
Real-time Processing: Instant app loading and execution within chat
Revenue Sharing: Built-in monetization model for developers

Technical Specifications

Status: Preview (Beta) – Limited access

API Support: RESTful API, GraphQL, WebSocket

Authentication: OAuth 2.0, API Keys, JWT tokens

Deployment: Cloud-native with auto-scaling

Performance: < 200ms app launch time

Security: End-to-end encryption, SOC 2 compliance

Real-World Applications

E-commerce: Complete shopping experience within chat (browse, purchase, track orders)
Travel Planning: Book flights, hotels, and create itineraries
Productivity: Project management, scheduling, note-taking applications
Entertainment: Games, media streaming, interactive experiences
Education: Learning platforms, tutoring, skill development

Transformative Impact

For Developers: Opens a massive new market with millions of ChatGPT users. Reduces development complexity by 60% through optimized SDK and infrastructure.

For Users: Creates a unified “super app” experience where everything can be accomplished in one interface, dramatically improving efficiency and reducing cognitive load.

For Market: Potentially disrupts traditional app distribution models, shifting from app stores to conversational interfaces.

🤖

AgentKit

Advanced AI Agent Development Framework

Overview

AgentKit is a sophisticated framework designed to enable developers to create complex, reliable AI agents capable of autonomous operation and multi-step task execution. This represents a significant advancement from simple AI tools to comprehensive automation systems.

Core Features

Persistent Memory: Long-term memory system for context retention across sessions
Advanced Reasoning: Multi-step logical analysis and decision-making capabilities
Task Orchestration: Complex workflow management and execution
Error Recovery: Automatic error detection and recovery mechanisms
Human Collaboration: Seamless human-AI interaction and handoff protocols
Performance Monitoring: Real-time analytics and optimization tools

Technical Architecture

Architecture: Microservices-based with event-driven design

Scalability: Horizontal scaling with intelligent load balancing

Security: Zero-trust architecture with end-to-end encryption

Integration: REST API, WebSocket, Message Queue support

Performance: Sub-second response times for most operations

Reliability: 99.9% uptime with automatic failover

Revolutionary Impact

Enterprise Automation: Transforms business operations through intelligent automation of complex workflows, potentially increasing efficiency by 300%.

Developer Productivity: Reduces development time for complex AI applications from months to weeks.

Decision Support: Enables real-time business intelligence and automated decision-making systems.

🎬

Sora 2 API

Next-Generation Video Generation Platform

Overview

Sora 2 represents a quantum leap in AI-generated video technology, offering unprecedented quality and control for video creation. Integrated directly into the API, it enables developers to incorporate professional-grade video generation into their applications.

Major Improvements over Sora 1

Quality Enhancement: 60% improvement in visual fidelity and realism
Extended Duration: Support for videos up to 15 minutes in length
Consistency: Dramatically improved temporal consistency and object tracking
Style Control: Advanced style transfer and artistic direction capabilities
Resolution: Native 4K support with HDR capabilities
Audio Integration: Synchronized audio generation and editing

Technical Specifications

Resolution: Up to 4K (3840×2160) with HDR support

Duration: Up to 15 minutes per video

Frame Rates: 24fps, 30fps, 60fps, 120fps

Formats: MP4, MOV, AVI, WebM

Processing Time: 3-8 minutes for 1-minute video

Audio: 48kHz, 16-bit stereo audio generation

Industry Transformation

Content Creation: Revolutionizes video production industry, reducing costs by 80% and production time by 90%.

Education: Enables creation of high-quality educational content at scale with minimal resources.

Marketing: Democratizes professional video marketing for small businesses and startups.

Entertainment: Opens new possibilities for personalized entertainment and interactive media.

Performance & Cost Analysis

Feature	Cost	Performance	Primary Use Case	ROI Impact
GPT-5 Pro	$0.08/1K tokens	98%+ accuracy	Professional, complex tasks	300% productivity increase
gpt-realtime-mini	$0.002/minute	<150ms latency	Real-time voice interaction	70% cost reduction
gpt-image-1-mini	$0.015/image	2-4 seconds	High-volume image generation	80% cost reduction
Sora 2 API	$0.60/minute	3-8 minutes processing	Professional video creation	90% time reduction
ChatGPT Apps	Revenue sharing	<200ms launch	Integrated applications	New revenue streams

Live Demos Breakdown

🎓

Coursera Demo (00:05:58)

Educational Content Integration

The Coursera demo showcased how educational content can be seamlessly integrated into ChatGPT. Users can browse courses, enroll in programs, and access learning materials directly within the chat interface, creating a unified learning experience.

Key Features Demonstrated:

Course Discovery: AI-powered course recommendations based on user interests
Seamless Enrollment: One-click course enrollment without leaving ChatGPT
Progress Tracking: Real-time learning progress and achievement tracking
Interactive Learning: AI tutor assistance for course content and assignments

🎨

Canva Demo (00:08:42)

Design Tools Integration

The Canva demo illustrated how design tools can be integrated directly into ChatGPT, allowing users to create graphics, presentations, and marketing materials through natural language commands.

Key Features Demonstrated:

Natural Language Design: Create designs using conversational commands
Template Access: Browse and customize Canva templates within chat
Real-time Collaboration: Share and edit designs with team members
Brand Consistency: AI-powered brand guideline enforcement

🏠

Zillow Demo (00:11:23)

Real Estate Integration

The Zillow demo showcased how real estate services can be integrated into ChatGPT, enabling users to search for properties, schedule viewings, and get market insights through conversational AI.

Key Features Demonstrated:

Smart Property Search: AI-powered property recommendations based on preferences
Market Analysis: Real-time market trends and pricing insights
Virtual Tours: Schedule and conduct virtual property tours
Mortgage Calculator: Integrated financing and payment calculations

Launch Partners (00:14:41)

Strategic Launch Partners

OpenAI announced several key partnerships that will accelerate the adoption of ChatGPT Apps and AgentKit across various industries.

Enterprise Partners

Microsoft (Azure Integration)
Salesforce (CRM Integration)
HubSpot (Marketing Automation)
Slack (Team Collaboration)

Consumer Partners

Coursera (Education)
Canva (Design)
Zillow (Real Estate)
Spotify (Music)

Developer Partners

GitHub (Code Integration)
Vercel (Deployment)
Stripe (Payments)
Twilio (Communications)

Building “Ask Froggie” Agent (00:21:11 – 00:26:47)

🐸

Live Agent Development

Real-time Agent Building Process

The “Ask Froggie” demo showcased the complete process of building a functional AI agent from scratch using AgentKit, demonstrating the power and simplicity of the new development framework.

Development Process:

1. Agent Configuration

Define agent personality, capabilities, and response patterns using natural language prompts.

2. Workflow Design

Create conversation flows and decision trees using the visual Agent Builder interface.

3. Testing & Preview

Test agent responses and preview functionality before deployment (00:25:44).

4. Publishing

Deploy agent to production with one-click publishing (00:26:47).

Agent Capabilities:

Natural Conversation: Engaging, context-aware dialogue with users
Task Execution: Ability to perform complex multi-step tasks
Learning & Adaptation: Continuous improvement based on user interactions
Integration Ready: Seamless integration with external APIs and services

Codex Advanced Capabilities (00:34:19 – 00:44:20)

Camera Control (00:36:12)

Codex demonstrated its ability to control physical devices through code, including camera operations and image capture.

Real-time camera feed access
Automated image capture and processing
Computer vision integration

Xbox Controller (00:38:23)

Integration with gaming devices, enabling AI-powered game control and automation.

Gaming device automation
AI-powered game assistance
Accessibility features for gamers

Venue Lights (00:39:55)

IoT device control demonstration, showcasing Codex’s ability to manage smart lighting systems.

Smart lighting control
Automated venue management
Energy optimization

Voice Control (00:42:20)

Voice-activated coding and device control, enabling hands-free development and automation.

Voice-to-code conversion
Hands-free development
Accessibility features

Live Reprogramming (00:44:20)

Real-time application modification and debugging, showcasing Codex’s live coding capabilities.

Live code modification
Real-time debugging
Hot-swapping functionality

Mattel Partnership (00:49:59)

Revolutionary AI-Powered Toys

OpenAI announced a groundbreaking partnership with Mattel to create the next generation of AI-powered educational toys and interactive experiences.

Educational Toys

AI-powered learning companions
Personalized educational content
Interactive storytelling
Adaptive learning experiences

Interactive Features

Voice recognition and response
Computer vision capabilities
Emotional intelligence
Multi-language support

Safety & Privacy

Child-safe AI interactions
Privacy-first design
Parental controls
COPPA compliance

Expected Impact

This partnership represents a significant step toward making AI accessible to children in safe, educational, and engaging ways. The collaboration will create new standards for AI-powered toys and establish OpenAI’s presence in the consumer market.

Sam Altman’s Keynote Address

Revolutionary AI: The Future is Now

Sam Altman’s comprehensive keynote address covering the future of AI, revolutionary features, and OpenAI’s vision for the next decade