Inside the Workshop: Automated Quality Control through Clustering

#Clustering #AI Quality Control #LoRA #Vector Database #Low Cost AI #Translation Pipeline

Intip Dapurnya Translator: Menjaga Kualitas Tanpa Bikin Kantong Jebol

Pernahkah kamu penasaran gimana cara seorang modder solo bisa memoles puluhan ribu baris teks dengan kualitas yang konsisten? Rahasianya bukan ada di kafein yang diminum tiap malam, melainkan di teknik Clustering yang terotomasi. Aku mau sharing hal teknis yang mungkin agak berat, tapi ini penting buat menunjukkan kalau dukungan kalian itu benar-benar digunakan untuk membangun infrastruktur riset yang gokil. Bayangkan, harga API GPT-4 atau model high-end lainnya itu selangit, apalagi buat nerjemahin ribuan dialog RPG. Strategi 'brute-force' (kirim semua teks ke API berbayar) itu bukan cuma nggak cerdas, tapi juga cara cepat menuju kebangkrutan pribadi, WKWKWK.

Solusinya? Aku bikin sistem Quality Control otomatis sendiri. Kunci dari sistem ini adalah jangan pernah menyuapi AI dengan data yang 'kotor' atau ribuan baris data mentah yang nggak diatur. Itu cuma bikin otak AI-nya 'enyek' atau saturasi. AI-nya malah jadi makin goblok karena kehilangan fokus konteks. Melalui teknik Clustering, data teks tadi aku pecah menjadi ribuan fragmen logika yang masing-masing punya identitas matematis tersendiri. Sistem ini membantu aku menemukan 'perwakilan terbaik' dari setiap tipe dialog untuk dipelajari ulang oleh mesin.

Dari sisi matematika, sistem ini mengubah kata-kata kamu jadi angka atau vektor di ruang 768 dimensi. Tapi karena 768 dimensi itu berat banget, aku kompres secara cerdas menjadi 50 dimensi doang biar PC-ku nggak teriak. Ada tiga jalur (route) utama dalam skripku: Jalur HDBSCAN (paling elit, buat datanya padat), Jalur K-Means (buat datanya yang agak mencar), dan Jalur Bucketing (cadangan manual). Proyek *Black Myth Wukong* adalah bukti nyata kesaktian sistem ini; cuma data super elit yang bisa lolos tes akurasi internal skripku.

Optimasi Budget: Mewahnya Terjemahan Harga 1 Dollar

Tahap paling asik itu di ujungnya: dari ratusan kluster tadi, aku cuma perlu benerin secara manual sekitar 300 sampel teks kunci yang benar-benar mewakili gaya bahasa seluruh game. Sampel koreksi manusia yang super-premium inilah yang aku masukin ke proses LoRA Training untuk model AI sebesar 27 miliar parameter. Hasilnya gila, ngab! Aku cuma butuh duit $1 sampe $5 aja per training untuk dapet hasil yang mendekati kerjaan manusia. Meskipun biaya sewa GPU di awal sempet bikin kantong 'boncos' (nangis liat tagihan Cloud GPU!), sekarang 80% prosesnya sudah lancar jaya di mesin offline. Dukungan Trakteer kalian adalah nyawa buat riset begini. Mari kita terus majukan teknologi lokalisasi Indonesia!

Inside the Workshop: Maintaining Quality Without Breaking the Bank

Have you ever wondered how a solo developer can polish tens of thousands of translation lines with consistent professional quality? The secret isn't some magical caffeine-fueled stamina; it is an automated Clustering architecture. Today, I want to lift the veil on the technical machinery I’ve been building. It’s important to show my supporters exactly how their contributions are being used to drive high-end research. Let's be real: brute-forcing translations through the GPT-4 API is not only unoriginal, it’s a direct shortcut to bankruptcy! Thousands of RPG lines would cost a fortune at standard API rates.

My solution? An in-house automated Quality Control system. The golden rule here is never to feed your AI 'dirty' or unoptimized raw data. When you drown a Large Language Model (LLM) in massive, unorganized data, it reaches a point of 'semantic saturation.' Essentially, it gets dumber because it loses contextual focus. By implementing advanced clustering techniques, I break the text into thousands of logical fragments, each with a unique mathematical signature. This allows the system to hand-pick the 'ideal representatives' from every type of dialogue for the AI to analyze and learn from.

Mathematically, the system translates text into vectors in a high-density 768-dimensional space. Since processing 768 dimensions is computationally expensive, I employ intelligent dimensionality reduction, squeezing it down to 50 dimensions while preserving the 'soul' of the language. My scripts utilize three distinct routes: the HDBSCAN (VVIP route for dense data), K-Means (the reliable VIP fallback), and Manual Bucketing (the safety net). The *Black Myth Wukong* project was the ultimate benchmark for this; only the highest tier of data samples survived the rigorous accuracy testing within the pipeline.

Budget Optimization: Premium Translations for One Dollar

The magic happens at the final stage. Out of those hundreds of clustered categories, I only need to manually refine about 300 key representative sentences. These hand-corrected 'premium samples' are then fed into a LoRA Training cycle for a 27B parameter LLM. The results are phenomenal! Instead of hundreds of dollars, it only costs me between $1 to $5 to run the training while achieving human-tier output. While the initial investment in high-performance GPU rental hit my wallet hard (RIP budget for a few weeks), 80% of the process is now handled offline on local machines. Your ongoing support is the lifeblood of this technical progress. Let’s keep pushing the boundaries of Indonesian localization tech together!