Proposal

Date: April 19, 2015
Project: CAT_Pilot_HKMA_Project
From: Jingxiao Wang, Kexin Pei, Tianyi Yang, Xinhui Du

Project Spec

This pilot project is designed to train the Microsoft Translator Hub using Annual Reports of Hong Kong Monetary Authority (HKMA) from 2003 to 2013. The reports are available in Traditional Chinese (SL) and English (TL). Files used as testing data include sections in the 2013 report on topics of Economic and Banking Environment, Monetary Stability, Banking Stability, International Financial Center, Reserves Management, and the Exchange Fund. Files from previous years will be used as training and tuning data.

Objective

The word count of the 2013 Annual Report of Hong Kong Monetary Authority (HKMA) used for testing totals 87322. Considering its difficulty and significance, it is estimated that a professional translator who has specialization in the area is able to translate 300 Chinese words per hour. That is to say, it will take one translator 291 hours to complete the entire project. If the price per word is $0.12/word, then the total cost for this project will be $10,478.64.

The objective of this project is to speed up the translation process and lower the cost without undermining the translation quality. In this project, “good enough” translation should have its terminology (financial terms) and style (very formal) coherent with those of the previous reports. Consistency is most highly valued. Some minor errors including word order, sentence structure could be ignored in this case. By training and tuning our translation system, we hope that Post-Editing Machine Translation (PEMT) can be 10% faster, and thus 10% cheaper than human translation.

Process

Files downloaded directly from the HKMA website are pdf files. Due to the large volume, it is important that the source and target files are stored in folders chronologically in a clear manner. Microsoft Translator Hub also has its required document naming style, which is <document name>_<language abbreviation>.<file format>. In our case, we also specified the year of each file for distinction, with each of the source and target files having a suffix either as “_year_en.pdf” or “_year_cht.pdf”.

After a few initial tests, we realized that the Traditional Chinese files cannot be recognized and appropriately extracted in the system, so that we had to manually transform all the pdf files to doc files. The Translator Hub can only align files with the same format. Thus the English source files also need to be converted into doc files.

When all the files are ready, we plan to start the new system training by uploading the 2003 annual reports as training data, the 2004 annual reports as tuning data, and only part of the 2013 annual reports mentioned above as testing data, in order to get the first bleu score.

The whole project is planned to go through 10 attempts. Strategies used to improve bleu score include adding training data, swapping and refining tuning data, adding manually aligned files etc. It is expected that with more files on the same topics as those in the testing data being added to the tuning data, and with more files being used as general training data, the bleu score will goes up.

Timeline

The project is divided into five phases: Planning, Setup, Training of MT, Finalization. Detailed schedule for each phase is outlined as follows:

Task

Schedule

Planning

April 16 – April 18

Setup (Including file preparation)

April 20 – April 22

Training of MT (Including training, adjustment and re-training)

April 23 – May 1

Finalization & Human editing test

May 4 – May 5

Delivery

May 6

The training is going to take approximately a week, during which time we will do the initial training, adjust the tuning and testing data, and then have several rounds of re-training to improve the BLEU score and achieve better performance of machine translation. We will then perform a simple human editing test to evaluate the efficiency of PEMT in the light of its cost.

Cost

The cost involved in this project are as follows:

Charge Item

Quantity Unit

Unit Rate

Amount

File Preparation

8 hours

$50

$400.00

Training

30 hours

$70.00

$2100.00

Subtotal:

$2500.00

Project Management (10%)

$250.00

Total:

$2750.00

Deliverables

When the project is done, we will have the actual results of the training process, which include the overall time, source and target word counts and improvements of the translation quality of the system. With more relevant data imported into tuning data, we expect to see an increase in the BLEU score.

In the presentation, we will talk about the training process in detail, such as how we devise and execute the overall workflow of the training, how each attempt affects the BLEU score, and ultimately how effective is the PEMT process compared to traditional human TEP workflow.