The Sanzang Manual

Introduction

Sanzang is a compact, cross-platform machine translation system. This program was developed specifically to fill the need for a competent application for aiding translators of the Chinese Buddhist canon into other languages. However, the translation method it uses is general enough that it may extend to other translation domains as well, especially translations with CJK source languages (Chinese, Japanese, and Korean). The name Sanzang (三藏) is a literal translation of the Sanskrit word “Tripitaka,” which is a general term for the Buddhist canon.

Sanzang is implemented as a Unix style “command suite” program that executes in your operating system’s command shell. The sanzang command includes subcommands for carrying out each of the available functions of the system.

Sanzang is programmed in the Ruby programming language and is free software (“free as in freedom”). This program is licensed under the GNU General Public License, version 3, which ensures that anyone can use the program for any purpose, and that extensions to this program will remain freely available to others.

Background

At the time of this writing, nearly all machine translation systems available are based on statistical machine translation (SMT), a method which has not proven very useful for ancient Chinese texts. Sanzang was developed as an alternative simply because there was the practical need for a better and more reliable tool.

The most significant difference between Sanzang and other machine translation systems is that it does not attempt to interpret or translate grammar. Instead, it simply translates names, terms, and phrases based on rules stored in a translation table. The Sanzang translator applies this translation table at runtime to generate a text listing as the output.

This method is simple and efficient, and produces predictable results that can be made immediately available to the user for verification. To facilitate this task, all translation listings generated by the program are collated line-by-line with the original source text.

Translation Method

Sanzang provides mainly a simple machine translation engine. To use this translation engine, it will also be necessary to have a text file in which all translation rules are defined. This is called a translation table, and its format is simple delimited text. At runtime, the translation rules in this file are applied to the source text to generate the translation listing.

In a translation table text file, each line is a translation rule containing a source term and its equivalent meanings in other languages. Each rule has terms separated by the “|” character. In the translation table, the first column represents the source language, while subsequent columns represent destination languages. In this example, we want to create a table capable of rendering the following title into English:

金剛般若波羅蜜經

We start by creating a new text file, named table.txt, or something similar. In this text file, we may add the following rules:

波羅蜜|pāramitā
金剛|diamond
般若|prajñā
經|sūtra/classic

After we have written this table file, we can then run the Sanzang translation engine with our table. When it reads the Chinese title as the input text, it then produces the following translation listing:

1.1 金剛般若波羅蜜經
1.2  diamond prajñā pāramitā sūtra/classic

The program first sorted our terms by the length of the source column, and then applied each of these rules in sequence. It then collated the output and created a translation listing. In the left margin, we can see numbers denoting the line number of the source text, along with the column number of the translation table.

As a final example, below is a snippet from an Indian Buddhist meditation text, which was processed by the Sanzang translation engine in the same manner:

105.1 阿難白佛言。
105.2  ānán bái-fó-yán ¶
105.3  ānanda addressed-the-buddha-saying ¶

106.1 唯然世尊。
106.2  wéi-rán shìzūn ¶
106.3  just-so bhagavān ¶

107.1 願樂欲聞。
107.2  yuànlè-yù-wén ¶
107.3  joyfully-wish-to-hear ¶

Here we can see a three-column translation table at work. The first column has the traditional Chinese source text, the second column contains the Pinyin transliteration, and the third column contains English. In this example we can see that well-defined translation rules lead to a clear translation listing, in which the meaning of the original text is readily understandable in English.

Considering the examples above, we can see that knowledge of the source language and expertise in the relevant literary field is often still necessary. Here again we can see that this translation system does not position itself as a “silver bullet” for creating finished translations, but is rather a practical tool for the purpose of assisting human readers and translators.

Installation

Requirements

The standard way of installing Sanzang is as a Ruby gem. To do this, the only requirement is Ruby 1.9 or later, along with an Internet connection.

If you do not have an Internet connection available on the host computer, then you may download the gem file and install it manually. If you choose this route, then please be aware that the “sanzang” gem depends on the “parallel” gem for multiprocessing.

For users of Microsoft Windows, please be aware that Windows ports of Ruby typically do not include full multiprocessing support using the standard Unix system calls. If you will be using Sanzang in a Windows environment and you require support for fast batch processing, then you should use the Cygwin port of Ruby, which does not have this limitation. Unix-based platforms such as Linux, BSD, and Mac OS X are unaffected by this issue.

In addition to installation requirements, it may also be very useful to have a text editor that is aware of Unicode and other encodings, and able to display multilingual texts. One such application that is known to work well for this task is the gedit text editor, which is free software and is available on a variety of platforms.

Installation

To install Sanzang, the following command should suffice.

# gem install sanzang

This command will download and install Sanzang into your Ruby environment.

If you have installed Ruby 1.9 but cannot run the gem command, then you may need to set up your PATH environment variable first, so you can run ruby and gem from the command line.

Commands

Sanzang functions are accessible through the sanzang command and its subcommands. Runtime behavior is set through command line options and parameters. This allows sanzang to be easily scripted and automated.

Sanzang subcommands can also be abbreviated for the sake of convenience. For example, the “translate” subcommand could be abbreviated as “trans”, “tr”, or even the one letter “t”.

sanzang

The main sanzang program acts as a front-end to subcommands, and also includes options for printing platform and version information.

Usage: sanzang [options]
Usage: sanzang <command> [options] [args]

Sanzang commands:
    batch       translate many files in parallel
    reflow      format CJK text for translation
    translate   standard single text translation

Options:
    -h, --help                       show this help message and exit
    -P, --platform                   show platform information and exit
    -V, --version                    show version number and exit

sanzang batch

The sanzang batch command can translates files in parallel. A list of files is read from STDIN, while progress information is printed to STDERR. The list of output files written is printed to STDOUT at the end of the batch. The output directory is specified as a parameter.

Usage: sanzang batch [options] table output_dir < queue

Options:
    -h, --help                       show this help message and exit
    -E, --encoding=ENC               set data encoding to ENC
    -L, --list-encodings             list possible encodings
    -j, --jobs=N                     allow N concurrent processes

sanzang reflow

The command sanzang reflow can reformat Chinese, Japanese, or Korean text, in which terms are often split between lines. This formatter “reflows” the text based on its punctuation and horizontal spacing, separating the source text into lines that are much safer for translation.

Usage: sanzang reflow [options]

Options:
    -h, --help                       show this help message and exit
    -E, --encoding=ENC               set data encoding to ENC
    -L, --list-encodings             list possible encodings
    -i, --infile=FILE                read input text from FILE
    -o, --outfile=FILE               write output text to FILE

sanzang translate

The command sanzang translate can perform translation of a single text stream or file. By default, this command reads from STDIN and writes to STDOUT. For concurrent translation of multiple files, see the batch command.

Usage: sanzang translate [options] table

Options:
    -h, --help                       show this help message and exit
    -E, --encoding=ENC               set data encoding to ENC
    -L, --list-encodings             list possible encodings
    -i, --infile=FILE                read input text from FILE
    -o, --outfile=FILE               write output text to FILE

Basic Usage

In the following example, we are working with a small text that we want to translate. With the first command, we reformat the text using reflow. Then we run translate with our translation table, to generate a translation listing.

$ sanzang reflow -i xinjing.txt -o lines.txt
$ sanzang translate -i lines.txt -o trans.txt TABLE.txt

The next two commands illustrate how these programs use standard input and output streams by default, how they can easily operate as text filters, and the way that sanzang subcommands can be abbreviated.

$ sanzang r < xinjing.txt | sanzang t TABLE.txt > trans.txt
$ cat xinjing.txt | sanzang r | sanzang t TABLE.txt | less

Advanced Usage

Batch Mode

We may have thousands of texts that we want to generate translation listings for with our translation table. For example, if our translation table was updated recently, we may want to regenerate an entire corpus of translation listings. To do this, we can use the find command to retrieve the file paths to our text files, and then pipe that output into sanzang batch.

$ find /srv/texts -type f | sanzang batch TABLE.txt /srv/trans

This command will find all files in the location specified, and then feed the file paths to sanzang batch, which will process them as a batch. If multiprocessing is supported on your platform, then the batch will be divided among all available processors.

To determine whether multiprocessing is available in your version of Ruby, you can run the sanzang command with the “-P” option:

$ sanzang -P

If you see that “Fork implemented” is “true”, then your platform supports Unix style multiprocessing, and you can gain performance benefits from using the batch command. You should also examine the value for “Processors found”. This is the number of logical processors detected on your platform.

If you see that only one processor is detected and you know that there are more available on your platform, then you may want to use the “-j” flag when running the batch command, to manually specify the number of processes to use. This number can be set to the number of CPU cores available on your system. This option may be necessary on some less common platforms, such as some BSD distributions and commercial Unix variants.

Microsoft Windows ports of Ruby typically do not support Unix style multiprocessing. If you require higher performance and utilizing multiple CPU cores, then you should look into the Cygwin port of Ruby, which does not have this limitation.

The performance benefits of running in batch mode with multiprocessing may be very significant. The performance increases are typically proportional to the number of CPU cores available on the system. For example, on a large SMP system with 50 processors available, sanzang batch can run up to 50x as fast.

Text Encodings

Sanzang supports many possible text encodings. Option “-L” will list all available text encodings. Option “-E” will set the encoding to be used for all text data such as input texts, output texts, and table files. The other program I/O, such as messages for the terminal, will still be in the default encoding of the environment. For example, in a Windows environment that by default uses the IBM-437 encoding, specifying “-E” with a value of “UTF-16LE” will cause Sanzang to read and write all text data in UTF-16LE, but all other program messages will still be displayed in the console’s native IBM-437 encoding.

$ sanzang t -E UTF-16LE -i in.txt -o out.txt TABLE.txt

If the “-E” option is not specified, then Sanzang will use the default data encoding for that environment. The data encoding can be seen by running sanzang with the “–version” or “–platform” options.

Responsible Use

With comprehensive translation tables, Sanzang can often be quite accurate and effective. However, this program is still comparable to a simple machine, and it can never replace a human translator. Please understand the scope of this translation system when using it. No machines can take responsibility for a poor translation. In the end, it is you who are responsible for any and all publications.

[Validate]

Generated with the Darkfish Rdoc Generator 2.