Chinese cannot be treated the same way like English. Due to its nature, which is much alike Japanese, Korean or Old Vietnamese, most of NLP approaches should be different. There are no explicitly seen lexemes in Chinese nor visually parted words and this makes the biggest part of all troubles for lexical or morphological text analysis.
Here we are going to give a few guidelines on how to prepare or pre-process a Chinese electronic text in order to automate a Chinese dictionary lookup or against any other kind of operations belonging to Natural Language Processing.
As an example we take a simple task of pre-processing an array of Chinese text. This procedure is targeted at splitting the source text into pieces so that another part of computer application is able to make an automatic lookup in the electronic dictionary.
To make this possible the following steps should be implemented:
- To split the text according to the Chinese punctuation marks;
- To split the text in places where spaces, tabs and new lines are met;
- To split the text into single hieroglyphs;
- To split (or generate) all the possible sub-strings in order to get the list of possible Chinese words, phrases, idioms, etc.
The parts 1 and 2 are common to almost any language, but parts 3 and 4 should be detailed a bit. The order presented above reflects the operation priorities, which in short may be defined like ‘from bigger to smaller’, or, ‘from common to particular’.
Single character splitting is a so-called latest hope of translator, when searching in a dictionary, because due to the absence of uniformity of the Chinese language, and we are talking about the modern Chinese language, and that’s why a single character is the tiniest lexeme possible to find in a Chinese dictionary.
The substring generating, and in our case the overall movement is from left to right where a step is one character and the shortest length of substring is two characters, happens on the cleaned from punctuation and other printed signs parts of the source text. Such an algorithm is able to cover 100% of the source text.
It is also necessary to mention that under certain circumstances there is a need to re-clean the text in order to ‘wipe’ out non-Chinese inclusions like numbers, letters, words, brackets, parentheses etc, if any.
If one has accomplished an application doing the job like narrated above, he’s in power to include into his applications automated sub-tasks like a dictionary lookup or something like that.

1 comment:
The Adso program automates most of this and is considered industry standard at this point. It is not usually necessary to write this sort of program from scratch these days.
http://popupchinese.com/words/downloads
We're working on semantic analysis, but it is still an incredible amount of work and there is much room for improvement.
Post a Comment