Sanzang: Introduction

Abstract

Machine translation (MT) has existed for several decades, but the approaches have differed greatly. Most applications have been aimed at translation of modern texts, and particularly those in western languages. Few have addressed Chinese, and even fewer have been aimed at being applicable to ancient texts. We present here the Sanzang translation system, which is software designed to generate rough translations from the “CJK” languages (Chinese, Japanese, and Korean) into other languages such as English. These rough translations may serve as an aid for human readers and translators.

Background

When creating a translation system for the CJK languages, it is essential to first understand clearly what we wish to accomplish. If we consider the full history of machine translation, we can see clearly that many setbacks and disappointments were due to researchers not adequately understanding the problem they wished to solve. For our purposes, we may define an ideal system for assisting in the translation of CJK languages as one that helps to convey the meaning of the original source text clearly and effectively to a human translator. It may be further defined as one that suggests possible words, phrases, or sentences in the target language which might be used in the finished translation.

We might consider the most popular method used today, statistical machine translation (SMT). Using statistical methods and a parallel corpus, such systems seek the fully automatic translation of human languages. The approach, however, can be characterized as one of “best effort.” If there is some ambiguity in the language, or if a term may have multiple meanings, then the most common are selected based on the system’s statistical model. Hence, the goal of SMT is to produce a translation that is an educated guess. Especially when dealing with ancient Chinese, translators cannot accept the results of such methods. In reading an ancient work, each word or character may have a meaning that is critical to understanding the sentence. The situation for poems may be even more unforgiving, in which a line of four characters may convey the meaning of an entire sentence. For such works, conventional statistical methods would be a poor choice.

At the other end of the translation spectrum is the humble reference dictionary. In many ways, a simple reference dictionary does what we would want. It may contain words and phrases, and it may be reasonably complete depending on its contents. The definition of a word or phrase may contain useful information such as etymologies, explanations, comparisons, etc. The main problem with a reference dictionary, however, is that its contents are too verbose to serve as a continuous source of reference when reading a text. However, the basic approach, considering each element in a sentence, may serve as a useful starting point for identifying a more appropriate method.

Criteria

What we propose for translating from the CJK languages is a modification of the approach of a reference dictionary:

  1. Identifying terms can be accomplished by a simple string match. Words and phrases are matched directly as strings of characters. Special rules for grammar or syntax are not necessary when closely following the source text.

  2. Rather than providing a full dictionary definition for each term, our system would provide a list of common possible translations for that term (e.g. “wish/desire”). This gives the human translator fuller access to alternate meanings, or to a broader semantic range.

  3. Rather than acting as a simple bilingual dictionary, it should accept any number of target languages, and this approach should be flexible enough to encompass transliteration as well. That is to say, the definition of a target “language” should not be needlessly limited to a conventional human language.

  4. Rather than providing a list of definitions for each sentence, the output should instead be a list of equivalent sentences in the defined target languages. This can increase readability by allowing the human translator to read the “translations” from left to right.

What we have described here is a translation method that closely follows the source text, rewriting it in other languages, but still following the original grammar. When this is done with Chinese, it has been termed, “Chinese-ordered English” (COE), when used for language learning. A translation system like the one described here would rewrite Chinese into something resembling “Chinese-ordered English.”

Workflow

The software for accomplishing such rough-but-useful translations is called Sanzang. Although a full description of its methods and capabilities is beyond the scope of this article, To begin with, we should first give an overview of the translation process.

  1. The user either develops or downloads a file containing translation rules.
  2. Sanzang reformats the source text so it may be translated more reliably.
  3. Sanzang translates by loading the translation rules and applying them.
  4. The user may update the translation rules if any errors were found.

Translation rules for the Sanzang translator are stored in a simple delimited file. For CJK languages, we only need to match on fixed strings, so each translation rule is a line that includes the source term, along with equivalent translated terms in the target languages. For example:

你好|nǐhǎo|hello
世界|shìjiè|world

A reformatting function is provided. Since CJK languages typically do not include spaces between words, a line break may occur in the middle of a phrase or even in the middle of a word. To ease the translation process, Sanzang provides a formatting function. This formatter will completely reformat the source text according to the spacing and punctuation it finds in the text, and remove arbitrary line breaks.

For generating a translation, Sanzang uses a simple algorithm that produces translations through string substitution, beginning with the longest terms and ending with the shortest. These rough translations are then collated with the source text for easy reading and cross-referencing. For example:

1.1|你好,世界!
1.2| nǐhǎo , shìjiè !
1.3| hello , world !

Using Sanzang

The features of the Sanzang system are accessed through a variety of tools, collectively called Sanzang Utils. Each Sanzang function has a separate command, following the Unix philosophy. Sanzang Utils programs read from the standard input stream (stdin), and write to the standard output stream (stdout), so they can function as filters. Options are specified with command line switches, and other information is provided through command line parameters. Typical usage of the Sanzang Utils programs in a command shell might be the following:

$ szu-r *.txt | szu-t my_rules.txt

Breaking up functionality into different tools ensures modularity and flexibility within the system. In addition to being flexible, the programs are also portable. They are implemented in the Python programming language, so they can run on a wide variety of platforms such as Unix, Linux, BSD, OS X, and MS Windows.

Example usage

As an example of Sanzang usage and output, the following would be a realistic use of the system. In this example, the file zh-en-modern.txt contains a set of translation rules. The file sample.txt contains the Chinese text that we would like to translate, shown here:

Linux是一种自由和开放源代码的类UNIX操作系统。

A professional human translator might render this into English as the following:

Linux is a type of free and open-source UNIX-like operating system.

Sanzang would instead produce a rough translation closely following the Chinese:

$ szu-t zh-en_modern.txt sample.txt
1.1|Linux是一种自由和开放源代码的类UNIX操作系统。
1.2|Linux is/this one-type free-&-open-source => category UNIX operating-system 。

Because the translation rules are well-defined, the output is easily understandable for a native English speaker somewhat familiar with the subject matter.

Maintaining Rules

The design and implementation of the Sanzang system is very simple, but this simplicity comes at a certain cost. Because the translation program itself knows practically nothing about languages, it relies entirely on a translation table file which stores all the translation rules. The development of such translation rules is a responsibility of the user. The exact ways of doing this, whether entirely manual or driven by computer analysis, are beyond the scope of this article. It does seem that the generation of somewhat reliable translations will require in excess of 10,000 translation rules, and perhaps far more depending on the difficulty and content of the text. These translation rules may be comparable to the vocabulary of a human being.

Suitability

Sanzang is described as a translation system for translating from the CJK languages, and it appears that this description is fairly well founded. The system has undergone preliminary testing and is able to render comprehensible translations that are limited mainly by the quality of the translation rules available. The system has progressed to the point where readers can often understand the content of passages without any background in the source language and without any training in translation.

The system has also been tested with ancient and difficult or atypical Chinese texts, such as the Daodejing (道德經), and early Buddhist translations by An Shigao (安世高) and Lokakṣema (支婁迦讖). In all cases, the translation process has been similar to that of modern Chinese, and poses no particular problems as long as the translation rules are crafted to adequately represent the underlying vocabulary. This suggests that the development of adequate translation rules is the critical element in the suitability of such a system.

^ top