Sanzang Utils Tutorial
- Translation Tables
- Text Reformatting: szu-r
- Translation: szu-t
- String Substitution: szu-ss
- Table Editor: szu-ed
- Man Pages
- Going Further
- Getting Help
The Sanzang Utilities are a small group of programs based around a simple way of translating from the CJK languages (Chinese, Japanese, and Korean). Using these tools, you can develop your own translation rules and apply them to generate rough translations.
The Sanzang Utilities were originally developed with the intention of building a system for assisting translators of the Chinese Buddhist canon into other languages. This is where the software gets the name “Sanzang” (三藏), meaning Tripitaka, which is a general designation for the Buddhist canon. However, these programs may also be very useful for translating from modern Chinese and other related languages as well. The Sanzang system is different from others in part because it closely follows the original language and grammar, allowing it to be useful even for some subtle or difficult passages where others fail.
These programs are implemented in the Python programming language (Python 3), which allows them to be portable and multi-platform. They are written as small command-line programs that may be run from a command shell. Because they are developed following the Unix philosophy, each program is meant to do one thing well, and each handles data through standard I/O streams. Due to this general orientation, it is advisable to use these programs in an environment like Unix, Linux, BSD, OS X, etc. For running these programs on Windows, a Cygwin environment is best.
In this tutorial, text typed at a command shell is prefixed with the sigil “
#” (superuser) or “
$” (normal user), following a common Unix conventions for shell prompts. In general, examples of commands and command output are shown as if they were typed into a standard Unix shell.
All text should be saved and handled as UTF-8 text, as the Sanzang Utilities do all input and output in UTF-8. The terminal should be set to UTF-8 as well, and should ideally be capable of displaying CJK characters.
The tutorial is organized as follows. The first section is a guide for installing the Sanzang Utils software. The next section covers the translation table format that is used by several programs, and which is central to the translation system. The next several sections deal with each program in turn, and how the program may be used. The last sections deal with taking the next steps toward advanced usage, and how to get help.
Windows users: installation in a Cygwin environment is advisable.
Sanzang Utils may be installed in one of two ways. The first way is the “Python way” of installing programs. To follow this method, first navigate to the Sanzang Utils directory, and then enter:
# python3 setup.py install
This will install the programs onto your system very simply, but any uninstallation would need to be performed manually.
The other way of installing Sanzang Utils is the classic “Unix way” of installing programs: make. To use this method instead, first navigate to the Sanzang Utils directory and then enter:
# make install
Using make is more flexible and allows for easy uninstallation, which is performed as follows:
# make uninstall
You may choose the method that you find more suitable for your system.
Some of the main Sanzang Utils programs utilize a type of formatted text file that we call a translation table. It is important to understand the format for this type of file. To begin with, a translation table is a Unicode text file in UTF-8 format. We can see an example of a translation table below:
宮保雞丁|kung-pao-chicken 烏龍茶|oolong-tea 餃子|dumplings
Each line of text is one record in the table, and the “
|” (vertical bar) character separates the fields of data within a record. The first field is the source term, and each subsequent field is an equivalent term in a target language. The number of fields for each record must be the same throughout the entire table. The records with longer source terms are also placed before records with shorter source terms, which is a requirement of the translation table format.
In this example, we can see that the source language is traditional Chinese, while the target language is English. The use of lowercase and hyphens between English words that we see is merely a useful convention and is not a requirement. In fact, the Sanzang Utils programs do not care which languages and scripts you decide to include in your translation tables, but we should keep in mind that the translation program is mostly useful for translating from Chinese, Japanese, and Korean, due to its translation methodology.
Text Reformatting: szu-r
The first utility to learn is szu-r, the reflow or reformatting utility. This is a formatter for CJK text that first collapses all vertical spacing, and then reorganizes the text into lines based on the horizontal spacing and punctuation found. It divides CJK text into logical lines that may be translated more safely. Usage information for this tool is shown below:
Usage: szu-r [options] [file ...] Reformat CJK text into lines safe for translation. Options: -h, --help print this help message and exit -v, --verbose include information useful for debugging
Suppose we have a text in which words are broken up between lines:
世 尊食時， 著 衣 持缽
We can easily fix the text so it will translate safely by running it through szu-r. This is what the command and its output will look like:
$ szu-r bad.txt 世尊食時， 著衣持缽
This is exactly what we want: lines of text that can be safely translated.
Of course, we could also redirect the output to a file:
$ szu-r bad.txt > good.txt
This is a normal way to preprocess a text before running the translator.
The next step is to learn szu-t, the translator. Since this is the main translation program, it is important to understand its method and features. The usage information for szu-t is shown below:
Usage: szu-t [options] table_file [file ...] Translate CJK text using a translation table file. Options: -h, --help print this help message and exit -v, --verbose include information useful for debugging
As we can see from the usage information above, the translator requires a translation table file. A translation table is a set of rules for translating text, organized into a tabular format, and saved as a simple UTF-8 text file. The format, however, is quite specific. As our first example, we can try something to translate the Chinese version of the expression “Hello, world!”:
We can divide the Chinese expression up into two parts (你好 and 世界), and then write the following translation table:
In our first column, we have the source language (Chinese), followed by the pronunciation (Pinyin), and finally the English. If we now run szu-t with our translation table, and input the Chinese phrase, it will generate a translation for us:
$ echo '你好，世界！' | szu-t mytable.txt 1.1| 你好，世界！ 1.2| nǐhǎo ， shìjiè ！ 1.3| hello ， world ！
Although this example is trivial, it shows some of the simplicity and flexibility of the system. By just defining a table of translation rules, you are able to define your own way of translating. The translation method scales up as well, as it is anticipated that a truly capable translation system naturally requires thousands of translation rules.
Consider the following translated excerpt from a traditional Chinese text:
43.1|善男子、善女人， 43.2| shàn-nánzǐ & shàn-nǚrén , 43.3| good-man & good-woman , 44.1|發阿耨多羅三藐三菩提心， 44.2| fā-ānòuduōluó-sānmiǎo-sānpútí-xīn , 44.3| develop-the-mind-of-anuttarā-samyaksaṃbodhi , 45.1|應如是住， 45.2| yìng rúshì zhù , 45.3| should/worthy thusly abide/in , 46.1|如是降伏其心。」 46.2| rúshì xiángfú qí xīn ¶ ” 46.3| thusly subdue (preceding) mind/heart ¶ ”
Although the grammar is rough and close to the original Chinese grammar, the basic meaning becomes apparent from the translation alone. From this, we can see that the system is useful and readily applicable so long as the translation table rules are adequately defined.
String Substitution: szu-ss
Along with the standard translator, there is also a smaller tool, szu-ss, for string substition, or string-to-string translation. Using this tool, you can perform fixed string substitions within the source text itself, and uppercase and lowercase variations are handled automatically. The usage information is as follows:
Usage: szu-ss [options] table_file [file ...] Make string substitutions using two-column translation table Options -h, --help print this help message and exit -v, --verbose include information useful for debugging
As we can see from the usage information, szu-ss requires a two-column translation table of source terms and their equivalents. For example, we could use a translation table for inserting the proper diacritics into names:
Purnamaitrayaniputra|Pūrṇamaitrāyaniputra Mahakausthila|Mahākauṣṭhila Grdhrakuta|Gṛdhrakūṭa Rajagrha|Rājagṛha
Using this table, we could automatically make these replacements in our text by using szu-ss:
$ szu-ss mytable.txt input.txt > output.txt
Of course, szu-ss may be used for many other purposes as well, wherever string substitutions are needed in text. You could just as easily use it for translating from traditional Chinese to simplified Chinese, or any number of other different string or character replacement tasks.
Table Editor: szu-ed
Because a translation table file must be saved in a specific format, a table editor program is available: szu-ed. This program can easily handle normal table editing tasks with some simple commands. Because it reads user input one line at a time from the standard input, it can also be easily scripted to automate table changes. The usage information for szu-ed is shown below:
Usage: szu-ed [options] table_file Edit translation table rules using a program of simple commands. Options: -h, --help print this help message and exit
We can see that a translation table file must be specified when invoking the editor. If the translation table file does not yet exist, then the editor will create and open a new translation table file. Once the editor has started, we can enter commands or data. A command starts with the “
\” character, and has no arguments or parameters. Commands for setting the editor mode are the following:
\get– Get a table rule (requires a source term)
\set– Set a table rule (requires a table rule)
\rm– Remove a table rule (requires a source term)
Commands for immediate editor operations are the following:
\p– Print the current table
\q– Quit without saving
\w– Write back to the file
\wq– Write back to the file and quit
Along with editor commands, you can also specify data. A line of data is any line that does not begin with the “
\” character. When the editor receives a line of data, it will interpret that data based on the current editor mode. For example, if the editor receives a line of data that is a translation table rule, and the mode is “set”, then the editor will set a translation table rule accordingly. If only a source term is provided and the mode is “rm”, then it will remove a rule from the table for that source term. The default editor mode is “set”.
A normal pattern of editing a table is to set the mode, enter data for that mode, etc., and finally write the table rules back to the translation table file. An example of invoking the editor to create our earliest translation table would be the following:
$ szu-ed mytable.txt \set 你好|nǐhǎo|hello 世界|shìjiè|world \wq
The editor will handle the translation table sorting as long as the records given to it are properly specified.
When you are working with Sanzang Utils, you may want a reference for each utility. Each has a standard Unix manual page that can be used for reference. You can check the manual page for any of these programs as follows:
$ man szu-t
These manual pages will provide a short description of the utility, usage information, and technical details such as exit codes.
The goal of this tutorial is to get you started, but in order to really understand the system thoroughly, you will need to spend time becoming familiar with these tools. In particular, learning how to make suitable translation tables that can produce clear and comprehensible translations is something that simply requires practice. However, the time and effort to do so may be worth it, especially when considering the ease and transparency of sharing these files as simple UTF-8 text files.
In particular, you will find that because these programs use standard input and output streams, there are many ways to make them work with other programs in ways that were not originally anticipated. This sort of flexibility allows opportunities to automate these tools, use them to build bigger systems, mine data, perform research, etc. We hope that the more time you spend with these tools, the more useful ways you will find to use them.
If you have read the tutorial and manual pages, and still have problems, please contact us. If you run into an issue that you think may be a bug, please file a bug report with the project. If there is a feature or issue that you think has been left undocumented, please notify us as well.
Web: Sanzang Homepage