Segmentation - Why is Smart Chinese Reader more accurate than other segmenters out there?

Word segmentation is a necessary first step in Chinese language processing, and it is a difficult problem. Let's look at an example "日本是亚洲内人口老化最严重的国家" with the focus on the subsequence "内人口". In this sentence, "内" is a word in itself, which means "inside". "人" and "口" form a word together, which means "population".

Traditional programs use dictionaries to recognize Chinese words. By consulting a dictionary they find "内人" is a word, and then "口" is a word in itself. Here is the wrong segmentation by one traditional program:

日本 是 亚洲 内人 口 老化 最 严重 的 国家

Our Chinese word segmentation algorithm is based on the statistical natural language processing (NLP) technology. It parses a sentence as a whole. A sentence usually has huge amounts of segmentation options, our algorithm chooses the one which has the largest global probability, thus achieves a higher accuracy. This technology is called Hidden Markov Model (HMM).

For the above sentence, "亚洲-内" and "人口-老化" are both frequently occurring phrases and have much higher probabilities than the sequence of "亚洲-内人-口-老化", therefore Smart Chinese Reader reaches the following conclusion:

日本 是 亚洲 内 人口 老化 最 严重 的 国家

As a segmenter based on statistical method, its accuracy is ultimately dependent on the precision of the statistics of facts on two levels: first, character-character combinations in Chinese word constructing, and second, word-word combinations in Chinese sentense construting. We computed the statistics based on one-year articles form 人民日报 (People's Daily, the most authoritative newspaper in China), which had been split into words with part of speech (POS) tags by human editors. These data cover all possible Chinese character-character combinations and word-word combinations to a large extent. Joined with the advanced segmentation algorithm, they lead to the high accuracy in segmenting Chinese words.