Date: May 6, 2015
Project: TMS_Pilot_Samsung_Project
Content: Samsung Products Manuals
Languages: English to Simplified Chinese
From: Junpeng Qiao, Guanghan Liu, Yu Liang, Xinhui Du
Project Objective
- Quality Goals
Zero tolerance:
1) Terminology Accuracy
Terminology accuracy should be considered in the first place when measuring the quality of the machine translation of our user manuals. Here by accuracy we refer to both the correctness of term translation and the consistency of terms throughout the output of the trained machine translation system. We adopt a zero tolerance of terminology errors.
2) Information Completeness
Speaking of the complete representation of information of the original texts, we mainly stick to the goals that there will be no omission, addition or untranslated texts. Here it should be distinguished from “whether contents should be translated or information should be added” for culture-bound texts. The above-mentioned errors should not be committed in translating mostly the technical texts.
Tolerance:
3) Register
In register we want to measure whether slang and taboos are used in translation. Here we tolerate no such words in the machine translation output.
4) Style
The texts we are going to translate in the system are user manuals, so in general we require a formal tone of the translation. In this case, formal words and concise sentence structure should be employed. Still, it’s okay if there are 2-3 paragraphs that might not be translated in a formal manner.
5) Design
Here by design we are looking into factors as layout, local formatting, graphics and tables. Generally speaking, if post-editors mess up or fail to stick to the consistency of the fonts, indentation or leading, we won’t consider it as severe errors and will fix them later.
- Timing Goal
Our tentative goal for timing of this project is to reduce required translation time by 30% – 50%. However, after conducting the pilot project, we found that the actual post-editing speed is much higher (75% faster).
- Pricing Goal
There are several factors to consider when developing pricing goals for a PEMT project:
- Purpose
Since this project is to translate product manuals in the form of large knowledge database, we would expect a fast post-edit instead of a full one.
- Language pair
Unlike western languages, Chinese is very different from English in terms of grammar, and therefore has a higher requirement for the MT output.
- MT system
The content we will be translating and the one we import as training data are highly interrelated, so that a better result within this specific domain is expected.
In the light of these factors and a human editing test, we suggest that PEMT is priced around 50% of editing human translation.
- Tentative PEMT timing and pricing goals VS. human translation:
Daily Workload | Price | |
HT | 2,500 source words | $ 0.2 / word |
PEMT | 10,000 source words (400%) | $ 0.1 / word (50%) |
Process
- Pilot Process and Timeline
Notes: by “segmentation” we mean dividing our original source files, which are large PDF files of over 150 pages, into smaller files for training, testing and tuning.
The wrapping & conclusion also includes a simple human editing test to see how effective the PEMT is in the light of its costs.
- Costs
We have agreed to set the hourly rate at $20, which times the total hours of the project and then times 4 team members will be the total fixed costs for this project. The calculation is shown below.
$20 * 8h * 9 days * 4 members = $5,760
Based on the pricing and fixed cost, in order to break even and also generate enough revenue for company development, if clients want to use PEMT in their translation projects, the total words need to exceed 200,000 words or there will occur extra charge.
Deliverables
We have seen that adding training data would generally increase the BLEU score, and each round of training costs approximately 2.5 hours. After 5 rounds of training, the output quality is quite satisfying. Therefore we would expect a great reduction of both time and cost of a fully trained engine.
For further training, we would manually align documents before importing into the system, and add more specific data to tuning.