Tối ưu hóa PDF Cho LLM

PDF là định dạng tài liệu phổ biến nhất hiện nay: từ báo cáo, hợp đồng cho đến tài liệu kỹ thuật. Tuy nhiên, khi đưa PDF trực tiếp vào các mô hình ngôn ngữ lớn (LLM) để làm Q&A, RAG (Retrieval-Augmented Generation) hay tóm tắt, chúng ta thường gặp nhiều vấn đề:

PDF scan chỉ chứa ảnh, không thể đọc trực tiếp.
Layout phức tạp (2 cột, bảng, biểu đồ) → text extraction sai thứ tự.
File quá lớn → embedding tốn nhiều token, chi phí cao.
Nhiễu từ header/footer, số trang, ký tự đặc biệt.

Vì vậy, việc tối ưu PDF trước khi đưa vào LLM là bắt buộc nếu muốn đảm bảo kết quả chính xác, chi phí hợp lý và hệ thống dễ mở rộng.

Trong bài viết này, mình sẽ hướng dẫn cách triển khai pipeline xử lý PDF bằng Node.js: từ trích xuất văn bản, OCR cho PDF scan, làm sạch dữ liệu, chunk, tạo embedding và indexing bằng FAISS.

Workflow tổng quan

PDF → Phân loại (Text-based / Scanned)
├─ Text-based → pdf-parse → Làm sạch → Chunk → Embedding → Index
└─ Scanned → OCR (tesseract.js) → Làm sạch → Chunk → Embedding → Index

Bước 1. Cài đặt thư viện

npm install pdf-parse openai faiss-node tesseract.js pdf2pic

pdf-parse: trích xuất text từ PDF dạng text.
tesseract.js + pdf2pic: OCR cho PDF scan (ảnh).
openai: gọi API để tạo embeddings.
faiss-node: tạo vector index để search.

Bước 2. Trích xuất văn bản từ PDF (Text-based)

import fs from “fs”;
import pdf from “pdf-parse”;

const dataBuffer = fs.readFileSync(“sample.pdf”);

const extractPdf = async () => {

const data = await pdf(dataBuffer);

console.log(“Số trang:”, data.numpages);

return data.text;

};const rawText = await extractPdf();

Bước 3. Làm sạch dữ liệu

function cleanText(text) {
return text
.replace(/\s+/g, ” “) // bỏ khoảng trắng thừa
.replace(/Page \d+ of \d+/gi, “”) // bỏ số trang
.replace(/[^\x00-\x7F]/g, “”); // bỏ ký tự đặc biệt
}const cleanedText = cleanText(rawText);

Bước 4. Chunking văn bản

Chia nhỏ văn bản theo đoạn, tránh cắt cứng theo ký tự.

function chunkText(text, maxWords = 200) {
const sentences = text.split(/(?<=[.?!])\s+/);
const chunks = [];
let current = [];
let count = 0; for (const sentence of sentences) {
const words = sentence.split(” “).length;
if (count + words > maxWords) {
chunks.push(current.join(” “));
current = [];
count = 0;
}
current.push(sentence);
count += words;
}
if (current.length) chunks.push(current.join(” “));
return chunks;
}const chunks = chunkText(cleanedText);
console.log(“Số chunks:”, chunks.length);

Bước 5. OCR cho PDF Scan

Nếu PDF chỉ chứa ảnh, dùng tesseract.js để OCR:

import { fromPath } from “pdf2pic”;
import Tesseract from “tesseract.js”;const convert = fromPath(“scanned.pdf”, { density: 200 });for (let i = 1; i <= 3; i++) {
const page = await convert(i);
const result = await Tesseract.recognize(page.path, “eng”);
console.log(“OCR trang”, i, “:”, result.data.text.substring(0, 200));
}

Bước 6. Tạo Embedding & Indexing (OpenAI + FAISS)

import OpenAI from “openai”;
import faiss from “faiss-node”;const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });async function getEmbedding(text) {
const res = await client.embeddings.create({
model: “text-embedding-3-small”,
input: text,
});
return res.data[0].embedding;
}const embeddings = [];
for (const chunk of chunks) {
const emb = await getEmbedding(chunk);
embeddings.push(emb);
}// Tạo FAISS index
const dim = embeddings[0].length;
const index = new faiss.IndexFlatL2(dim);
index.add(embeddings);console.log(“Đã index:”, index.ntotal, “chunks”);

Bước 7. Tìm kiếm & Truy vấn

async function searchQuery(query, k = 3) {
const qEmb = await getEmbedding(query);
const { labels } = index.search([qEmb], k);
return labels[0].map(i => chunks[i]);
}const results = await searchQuery(“Nội dung chính của báo cáo là gì?”);
console.log(“Top matches:”, results);

Sau đó bạn đưa results vào prompt của GPT để Q&A hoặc tóm tắt.

Thực hành tốt nhất

Bảng (tables) → trích xuất thành CSV/JSON thay vì text để giữ cấu trúc.
Biểu đồ (figures/charts) → OCR + alt text mô tả.
Metadata → lưu page, section, title để dễ trace lại nguồn.
Large PDFs → xử lý batch 50–100 trang/lần.
Pre-check → phân loại PDF trước khi xử lý (text-based vs scanned).

Kết luận

Nếu không xử lý trước, PDF sẽ trở thành “nhiễu” đối với LLM: khó hiểu, tốn token và cho ra kết quả sai lệch. Bằng pipeline Node.js trên, bạn có thể tối ưu PDF từ raw file → dữ liệu sạch, có cấu trúc và dễ dàng search để đưa vào RAG hoặc các ứng dụng AI khác.