Study on the English Corresponding Unit of Chinese Clause

. This paper annotates the English corresponding units of Chinese clauses in Chinese-English translation and statistically analyzes them. Firstly, based on Chinese clause segmentation, we segment English target text into corresponding units (clause) to get a Chinese-to-English clause-aligned parallel corpus. Then, we annotate the grammatical properties of the English corresponding clauses in the corpus. Finally, we ﬁ nd the distribution characteristics of grammatical properties of English corresponding clauses by statistically analyzing the annotated corpus: there are more clauses (1631,74.41%) than sentences (561,25.59%); there are more major clauses (1719,78.42%) than subordinate clauses (473,21.58%); there are more adverbial clauses (392,82.88%) than attributive clauses (81,17.12%) and more non-de ﬁ ning clauses (358,75.69%) than restrictive relative clauses (115,24.31%) in subor-dinate clauses; and there are more simple clauses (1142,52.1%) than coordinate clauses (1050,47.9%).

Clause is the basic unit of discourse translation. Previous research has shown that in Machine Translation Systems, the acceptance rate of clause-based translation is 45% higher than the sentence-based translation [1]. Thus, clause-based translation model has become an important subject for discourse-based machine translation studies [2]. Nowadays, statistical translation is grounded on the large scale bilingual aligned samples and bilingual grammatical knowledge. Therefore, one of the most important issues for discourse-based machine translation studies appears to be building clause-aligned and -annotated parallel corpora.
In Chinese-to-English translation, English corresponding units of Chinese clauses (ECUCC) in the translation are complex and diverse. One key issue of translation is to select the proper form of ECUCC for translation. For example, in (1), Chinese clauses C1 and C2 are translated to English corresponding units E1 and E2. Functionally, E1 is the major clause and E2 is the subordinate clause (the adverbial clause); structurally, E1 is the restrictive relative clause and E2 is the non-defining clause (the present participle).
As shown in (1), if we can provide bilingual clause-aligned samples and add grammatical annotations for clauses, it will not only provide an effective guide for the translation of parallel clauses, but will also lay a foundation for discourse-based Machine Translation Studies. At present, several Chinese-English parallel corpora have been built on the sentence-(usually marked with period) or paragraph-alignment [3,4]. Studies on clause-aligned parallel corpus at a preliminary stage [2] that there are few annotated resources of grammatical knowledge for segmenting parallel texts into clauses.
In this paper, we recount our experience in annotating ECUCC and statistically analyze them. Firstly, based on Chinese clause segmentation we segment English texts into corresponding units for parallel text to get a Chinese-to-English clause-aligned parallel corpus (Sect. 1). Then, we annotate the grammatical s of the ECUCC in the corpus to get a grammatical annotated corpus (Sect. 2). Finally, based on the annotated corpus we find the distribution characteristics of grammatical properties of ECUCC by statistically analyzing the annotated corpus (Sect. 3).

Chinese-to-English Clause-Aligned Parallel Corpus
Building a Chinese-to-English clause-aligned parallel corpus is based on the following principles: (1) define the rules for Chinese clause segmentation in Chinese-English parallel texts; (2) based on the results of Chinese clause segmentation, divide English translated texts into units, and get the best English corresponding units in a linear sequence, which are ECUCC.
The rules for Chinese clause segmentation in our study applies the definition of clause by Li [6,7]: "Clause is the basic unit of discourse analysis, including the traditional simple sentences and clauses in compound sentences. Structurally, an independent clause contains at least one predicate and at least one proposition; functionally, an independent clause is not used as any grammatical component to other clauses, and there is only propositional relationship between two independent clauses; formally, there must be punctuation (comma, semicolon, or period) between two independent clauses. Besides, some traditional phrases, which are similar to typical clauses in structure, function and forms are treated as clauses." Studies [5,6] have shown that such definition of Chinese clause provides operability to create and automatic analyze large-scale annotated corpus.
The "based on the results of Chinese clause segmentation divide English translated texts into units" means that we divide English translated texts based on the results of segmentation Chinese clauses. In example (2), Chinese text is divided into three clauses which are marked as C1, C2 and C3 and accordingly, English corresponding unit are divided as E1, E2 and E3. Grammatically, E1 is a typical clause, E2 and E3 are not. E2 is a clause group, and E3 is an infinitive phrase. According to the nature of English, E2 would be divided into two English clauses ("…expand…" and "use…"). But we analyze E2 which is the corresponding unit to Chinese clause C2 as the final unit, based on the rules for Chinese clause segmentation. Therefore, we call E1, E2 and E3 as English corresponding unit of C1, C2 and C3. However, sometimes we also call these corresponding units of Chinese clause as "English clause".
(2) C1 浙江省今后将进一步提高对外开放水平, // C2 努力扩大对外贸易、利用 外资和国际经济技术合作, / C3 并逐步完善对外经贸营销网络。 E1 Zhejiang Province will further raise the level of opening up to the outside world, // E2 diligently expand its foreign trade, and use foreign funds and international economic and technical co-operation, / E3 to progressively perfect its marketing network of foreign economic and trade business.
This number was an increase of 20.3 billion yuan, / E2 a growth of 27% compared to the end of the previous year.
The "best English corresponding units in a linear sequence" means that the English corresponding unit segmentation should correspond to the Chinese clauses in a linear sequence, but not necessary in semantics. For example, in (3), E1 and E2 semantically are not equal to the C1 and C2 because of the position of the adverb (compared to the end of the previous year). In this case, E1 and E2 are the best English corresponding units in a linear sequence of C1 and C2.
Based on the above principles, we select 100 Chinese-English parallel texts (news) to build a Chinese-to-English clause-aligned parallel corpus, in which Chinese clauses and their English corresponding units are aligned.

ECUCC Grammatically Annotated Corpus
In the ECUCC grammatically-annotated corpus, 2192 ECUCC taken from the Chinese-to-English Clause-Aligned Corpus are analyzed and annotated. Grammatical properties of ECUCC are analyzed and annotated under certain principles and systems.

Grammatical Analytic Principles of ECUCC
To deal with problems of grammatical analysis of English corresponding units, we formulated the analytic principles through analysis and verification.
First, in the process of identification of the grammatical properties of ECUCC, both their inner structure and external function should be considered. As shown in Example (1), structurally, the core verbs in E1 and E2 are different between restrictive relative and non-defining; functionally, general structures are different between the major and subordinate in the global structure.
Second, for identifying the major object of ECUCC, the global function takes priority over the local function. Sometimes ECUCC is complicated in the inner structure, and it is difficult to identify its grammatical properties. In this case, the identification of the structure and function is based on the major object of the unit, while the identification of the major object is based on the global function of its global structure. For example, in (4), E1 is complicated by its inner structure (it consists of major clause and adverbial clause, while adverbial clause is composed of coordinate attributive clauses). The whole sentence is a complex sentence: E1 is the major clause, E2 is a subordinate clause. Thus, E1 can be identified as "major clause + finite structure" according to the function of major object ("recently there were…").
According to the Pudong New Region's Economy and Trade Bureau follow -up investigation into 157 large industrial projects that were introduced in the seven years of Pudong's development, and that have more than 10 million US dollars invested, recently there were 116 that finished construction and went into operation, / E2 with the percentage of going into operation reaching up to 73.9%.
Third, sometimes omissions in ECUCC influence the identification of their grammatical properties. In this case, the analysis should be based on the completed sentence. For example, in (5), there is an ellipsis of preposition "with" in clause E3 and E4. It is required to complete E3 and E4 before the analysis. Thus, E3 and E4 are identified as "coordinate" "prepositional phrase" and "adverbial".
From January to November of last year, the inland signed 1,474 new contracts for foreign contracted projects and cooperation of labor service and design consultation in Hong Kong, / E2 with a contracted value of 2.094 billion US dollars, // E3 a completed turnover of 1.58 billion US dollars // E4 and 21,153 man -times of labor service sent to Hong Kong.

Grammatical Analytic System of ECUCC
Based on the studies of the corpus, the grammatical analytic system has been functionally and structurally formed [7] (Further details follow in Sect. 3).
Functionally: firstly, according to the grammatical properties of a whole sentence (simple sentence, coordinate sentence, complex sentence) and the position of a clause, English clauses can be divided into independent clauses, coordinate clause, major clauses and subordinate clauses; secondly, according to the function, clauses can be divided into adverbial clauses, attributive clauses and so on; finally, according to quantity of clauses with the same function in a sentence, clauses can be divided into simple clauses and coordinate clauses.
Structurally: firstly, according to the properties of predicate verbs, clauses can be divided into restrictive relative clauses and non-defining clauses; secondly, depending on particular conditions, non-defining clauses can be divided into infinitive, present participle, past participle, non-verb, preposition structure and other subcategories.

Sentences and Clauses
ECUCC may be a sentence, or a clause. Separate sentence as example (6) and clause group as example (7) can independently performed an utterance function. Clauses which include coordinate clauses and various types of major or subordinate clauses (see Sect. 3.2) cannot performed an utterance function. It should be combined with other clauses to form a complete sentence.
Construction is a principal economic activity in developing Pudong. / E2 These years there have been several hundred construction companies and over four thousand construction sites that have spread out all over this stretch of hot turf.
The unification of the world economy is intensifying with each passing day. Facing the Challenges brought by globalization, each country should continuously implement sound economic policies and deepen structural reform so as to enjoy the most benefits from the process of globalization and to minimize the negative effects.
The statistical distribution of sentences and clauses of ECUCC is given in Fig. 1. The results show that clauses are more than sentences by three times which indicates that Chinese clauses are more likely to be translated as English clauses rather than English sentences.

Major Clauses and Subordinate Clauses
ECUCC may be characterized by major clauses or subordinate clauses. English major clause units include simple major clauses (example 9) and coordinate major clauses (example 10) (details of simple/coordinate clauses follow in Sect. 3.5 below), coordinate clauses (example 8), sentences (example 6) and clause groups (example 7). The major clause units are generally finite structures and can be independently used as sentences. Subordinate clause units include 20 kinds of clauses such as attributive clauses, adverbial clauses, infinitive, and present participle clauses (see Sects. 3.3, 3.4 and Table 1). Subordinate clause units are barely used as independent sentences.
LastOctober, the Import and Export Bank of China invited Nomura Securities of Japan to be advisors, / E2 and submitted a formal assessment application to the Commune Bond Research Institute of Japan, a famous assessment institution in Japan (Coordinate clause).
(9) …… C1 而是借鉴发达国家和深圳等特区的经验教训, / C2 聘请国内外有关专 家学者, // C3 积极、及时地制定和推出法规性文件, // C4 使这些经济活动一出现就 被纳入法制轨道。 ……. E1 Instead, Pudong is taking advantage of the lessons from experience of developed countries and special regions such as Shenzhen / E2 by hiring appropriate domestic and foreign specialists and scholars, // E3 by actively and promptly formulating and issuing regulatory documents, // E4 and by ensuring that these economic activities are incorporated into the sphere of influence of the legal system as soon as they appear (Simple major clause).
(10) C1 当前经济的关键不是争取更高的增长速度 // C2 而是调整结构,提高效益, The key of the current economy is not striving for a higher growth rate, // E2 but is adjusting structures and increasing benefits, / E3 so as to make macro controls which were implemented from the second half year of 1993 obtain greater achievements // E4 and push the national economy onto a road of constant, rapid and healthy development (Coordinate major clause).
The statistical distribution of major clauses and subordinate clauses in English corresponding units is given in Fig. 2. The results show that major clauses are more than subordinate clauses by four times. It indicates that Chinese clauses are more likely to be translated as English major clauses rather than subordinate clauses.
(11) C1 如果亚洲的经济形势恶化或者金融危机对外界的影响增大, / C2 全球原 油需求量的增长幅度可能会进一步缩小。 E1 If the Asian economic situation deteriorates or the outside influence of the financial crisis becomes larger, / E2 the growth rate of worldwide demand for crude oil may possibly further decrease (Simple adverbial clause).
Recently, the Import and Export Bank of China won a bond credit rating of AAin Japan, / E2 which is currently the highest bond rating given to a Chinese bank by the Japanese financial market (Simple attributive clause).
(14) C1 据统计, 在目前已投产外资大企业的主要产品中, 有一百零二个品牌, According to statistics, among the main products of large foreign funded enterprises that have currently been put into production, there are 102 brands, / E2 of which 52 are foreign brands // E3 and 50 are domestic brands (Coordinate attributive clause).
The statistical distribution of adverbial clauses and attributive clauses in English corresponding subordinate clauses is given in Fig. 3. The results show that adverbial clauses are more than attributive clauses by five times. It indicates that English corresponding subordinate clauses are translated as adverbial clauses in most situations.

Structures of Subordinate Clauses: Restrictive Relative and Non-defining
Depending on core verbs, English subordinate clauses can be divided into restrictive relative clauses and non-defining clauses. Core verbs in restrictive relative clauses vary in terms of tense (for examples 11-14), core verbs in non-defining clauses not vary in terms of tense or omitted. Non-defining verbs can be divided into infinitive (example 15), present participle (example 16), past participle (example 17), non-verb (example 18), nominative absolute structure (example 19), prepositional phrase (example 20) and other structural forms.
The reason behind the decision by the Import and Export Bank of China to obtain a credit rating in Japan first is to prepare for entry into the international capital market for financing, / E2 so as to expand sources of funds (Infinitive).
This informal meeting of heads of Eastern Asian countries, the first time in history, E2 is a good start (Non-verb).
The report said that 1997 was the first year of economic growth for those countries with transitioning economies since they had stopped implementing centrally planned economies, E2 the rate reaching 1.7%, E3 and estimated to grow by 3.25% for 1998 (Nominative absolute structure).
The statistical distribution of restrictive relative clauses and non-defining clauses in English corresponding subordinate clauses is given in Fig. 4. The results show that:  (1) There are more non-defining clauses (358, 75.69%) than restrictive relative clauses (115, 24.31%). It indicates that English corresponding subordinate clauses are translated as non-defining clauses in most cases. (2) In the non-defining clauses, the above three categories (present participle structure, infinitive and prepositional phrase) account for nearly 90% of the total. The other three structures (non-verb, past participle, nominative absolute structure) account for 11% of the total. It indicates that English non-defining clauses are more likely to be translated as present participle structures, infinitive structures and prepositional phrase structures than others.

Simple Clauses and Coordinate Clauses
English clauses are divided into simple clauses and coordinate clauses according to their function. In coordinate clauses, two or two more English clauses perform the same function. Simple clauses can be divided into sentences (example 8), simple major clauses (example 9), simple adverbial subordinate clauses (example 11) and simple attributive subordinate clauses (example 13). Coordinate clauses can be divided into coordinate clauses (example 8), coordinate major clauses (example 10), coordinate adverbial subordinate clauses (example 12) and coordinate attributive subordinate clauses (example 14).

Distribution of Types of ECUCC
The Table 1 summarizes the distribution of types of ECUCC. ECUCC are grouped in the table by frequency range (high-frequency, intermediate frequency and lowfrequency). The table shows that: (1) There are 4 types of ECUCC of high-frequency (X > 10%) in the corpus which account for 78.38% of the total distribution. Compared with other grammatical types (except clause group) these four types clauses are major clause units (see Table 2). (2) In the corpus 8 types of ECUCC of intermediate frequency (1% < X < 10%) account for 18.93% of the total distribution, which including 7 adverbial subordinate clauses (16.01%) and 1 attributive subordinate clause 2.92%.
(3) 13 types of ECUCC of low-frequency (X < 1%) account for 2.69% in the corpus. In addition to clause group, the remaining 12 categories are subordinate clause units.

Distribution of Grammatical Functions of English Corresponding Units
The distribution of grammatical functions of English corresponding units is given in

Conclusion and Further Research
In this paper, we annotate and present the grammatical properties of ECUCC in the Chinese-to-English clause-aligned parallel corpus. It is of a great significance to Chinese-to-English translation. However, it should be noted here that: (1) Chinese-to-English translation is different from English-to-Chinese translation. It is necessary to distinguish the two translation directions during the analyzing process. The next step of our work is to build an English-to-Chinese clausesaligned corpus. The basic idea is the same as building the Chinese-to-English clause-aligned and -annotated parallel corpus. (2) It is still unknown the grammatical properties of Chinese clauses in the source texts due to the lack of annotations. Therefore, in the future work, grammatical properties of the Chinese clauses also will be annotated. Another paper illustrating the problem of Chinese clauses will be written. (3) Building the Chinese-to-English clause-aligned and -annotated parallel corpus is grounded in the theoretical framework of Chinese-English discourse structure parallel corpus [8]. The grammatical annotation of ECUCC is one of the important problems under the perspective of discourse structure. In the following works, our studies will improve and expand the scale of both corpora.