Evaluation Report: Multi-Agent AI Translation System

Jac Wong

October 20, 2024

•

min read

Introduction

In today's interconnected world, breaking language barriers is more crucial than ever. OneSky is at the forefront of this mission, dedicated to advancing AI-driven localization solutions that are accurate, reliable, and cost-effective. Our mission is to bridge communication gaps across the globe, enabling businesses to reach new markets effortlessly.
‍

Our flagship product, OneSky Localization Agent (OLA), utilizes a sophisticated multi-agent system comprising multiple Large Language Models (LLMs). This innovative architecture ensures robust handling of linguistic nuances and contextual accuracy, delivering high-quality translations that resonate with local audiences.

‍

Who Should Read This?

Stakeholders of Localization Projects
- Adoption Strategies: Learn how to leverage AI translation services to streamline multilingual content creation, reduce costs, and accelerate market expansion.
Technical and Product Development Teams
- Maximizing Efficiency: Discover how AI-powered localization processes can expedite content deployment and contribute to ongoing system enhancements.
AI Enthusiasts and Researchers
- Expanding Agent Capabilities: Explore the potential of multi-agent systems to handle domain-specific aspects and the possibility of autonomous virtual localization teams run by AI agents.

‍

Objectives of Our Evaluation

We conducted an in-depth evaluation to assess the performance and quality of our AI translation service, focusing on:

Translation Accuracy: Measuring the precision and correctness of translations across various language pairs.
Human-Like Quality: Determining how closely AI translations match human translations in fluency and readability.
Efficiency of Post-Editing Processes: Evaluating the extent of human intervention required to refine AI-generated translations.
Role of the Multi-Agent System: Examining how our multi-agent approach enhances translation quality.

‍

OLA's Multi-Agent Approach

Our multi-agent system is designed for excellence:

One L10N Manager Agent: Categorize and provide context for each string based on the style and context guide.
Five Translator AI Agents: Each uses a different LLM to produce varied translations, increasing the chances of high-quality outputs.
Five Voter AI Agents: Assess these translations based on accuracy, fluency, and context preservation, selecting the top translation.
One Proofreader AI Agent: Review the selected translation to ensure it meets technical formatting standards.
Five Evaluator Agents: Perform thorough quality assurance checks, identifying and addressing any errors or inconsistencies.

Our expectations:

Diverse Outputs: Multiple AI translators provide varied translations, enhancing quality.
Peer Review: Voters maintain high standards by ensuring only the best translations proceed.
Oversight: Editors and QA agents add layers of refinement, ensuring consistency and accuracy.
Error Reduction: The collaborative system significantly reduces the incidence of poor translations.

‍

Assessment Methodology

Our evaluation framework integrates both post-editing percentage and blind test assessments, providing a comprehensive understanding of translation quality and operational efficiency.

Linguistic Roles Involved:

Human Post-Editor: Refines Multi-Agent AI translations to meet publication standards.
Human Translator: Provides professional human translations for comparison.
Evaluator: Rates the quality of translations without knowing their origin.

Deliverables:

Post-Edited AI Translation: Refined by a human post-editor.
Direct Human Translation: Performed by a professional human translator.
Grading of Shuffled Translations: Professional evaluators assess translations blindly.

‍

Evaluation System

Similarity Calculation:

We measure the extent of human intervention required to refine Multi-Agent AI translations, offering insights into their practical usability in professional settings. First, we calculate the Utilization Ratio (UR) of each edited translation using the Levenshtein distance, which measures the difference between the original and edited translations. This metric considers insertions, deletions, and substitutions as changes, capturing substantial rearrangements of phrases between the two sentences.

Next, we consider the Average Utilization Ratio (AUR), using the word count (WC) of each translation unit as the weighting factor. To calculate the AUR, we first obtain the UR and WC for each translation unit, then calculate the Effective Word Utilization (EWU), defined as:

EWU=UR×WC

Afterwards, we sum the EWU and the WC across all translation units. To calculate the AUR, we divide the sum of the EWU by the sum of the WC:

AUR=∑(EWU1,EWU2,EWU3,…) / ∑(WC1,WC2,WC3,…)

‍

Blind Test Assessment:

Evaluators assess translations based on a structured error identification system, focusing on both objective and subjective errors.

Objective Errors:

Critical: Mistranslations that distort the source meaning.
Moderate: Missing information, grammatical errors, term inconsistencies, ignoring glossaries.
Minor: Typos, punctuation/formatting issues, missing/misplaced placeholders.

Subjective Errors:

Choice of Words: Better word choices that could improve the translation.
Tone: Appropriateness of the tone compared to the source text.
Sentence Flow: Readability and natural flow of sentences.

Grading System:

Good Quality: No objective errors; accurate and flows naturally.
Fair Quality: Maximum of 2 objective errors per string; mostly accurate and readable.
Failed/Poor Quality: More than 2 objective errors; text may sound weird or machine-like.

‍

Summary

Our AI translation service demonstrates exceptional performance:

Overall Human-Like Quality: Achieved an impressive 93.1%.
Efficiency: Low post-editing requirements underscore the system's practicality for professional applications.
Multi-Agent Advantage: The collaborative approach significantly enhances translation accuracy and fluency, especially in top-performing language pairs.
Professional Refinement: On-demand post-editing by our professional human reviewers ensures 100% human-like, high-quality results.

The evaluation confirms that our AI translation service not only meets but exceeds industry standards, offering a robust and reliable solution for diverse translation needs.

‍

Looking Ahead

We are committed to continuous improvement, targeting improvements in lower-performing language pairs, and genres through expanded datasets and refined agent collaboration.

‍

Get the Detailed Report

Interested in a deeper dive into our evaluation results on Your content? Our detailed report includes:

Budget Plan: Estimate how much you save compared to traditional human translation services.
In-Depth Analysis: Comprehensive data and insights on each of the language pairs evaluated.
Methodology Details: A closer look at our assessment processes and how we measure translation quality.

👉 Leave your contact info below, and we'll send you the full report!

Checkout Our Latest Post

Join the global leaders who are embracing AI-driven localization. Don't miss out on the opportunity to unlock new markets and engage audiences worldwide with ease.