1
Comparable and translation corpora in cross-linguistic research
Design, analysis and applications
Sylviane Granger
Centre for English Corpus Linguistics
Université catholique de Louvain
1. Introduction
The history of Contrastive Linguistics has been characterized by a pattern of success-decline-
success. Contrastive Linguistics (CL) was originally a purely applied enterprise, aiming to
produce more efficient foreign language teaching methods and tools. Based on the general
assumption that difference equals difficulty, CL, which in those days was called Contrastive
Analysis (CA), consisted in charting areas of similarity and difference between languages and
basing the teaching syllabus on the contrastive findings. Advances in the understanding of
Second Language Acquisition (SLA) mechanisms led to a questioning of the very basis of
CA. Interlingual factors were found to be less prevalent than other factors, among which
intralingual mechanisms such as the overgeneralization of target rules and external factors
such as the influence of teaching methods or personal factors like motivation. This led to the
decline of CA, but not to its death. At first, it gave rise to some drastic pedagogical decisions,
which in some cases culminated in a total ban of the mother tongue in FL teaching. But
research (see Odlin 1989, Selinker 1992, James 1998) re-established transfer as a major – if
not the major – factor in SLA, which in turn led to a progressive – albeit limited – return of
contrastive considerations in teaching. More importantly, the questioning of the contrastive
approach to FL teaching did not impede its extension to other fields. The globalisation of
society led to an increased awareness of the importance of interlingual and intercultural
communication and played a major role in the revival of CL. Another factor which helped
boost contrastive studies was the emergence and rapid development of corpus linguistics and
natural language processing, which are increasingly focusing on cross-linguistic issues. Large
bilingual corpora gave contrastive linguists and NLP specialists a much more solid empirical
basis than had ever been previously available. Previous research had been largely intuition-
based. Vinay & Darbelnet (1958/1995) and Malblanc (1968) are well-known exemplars of
this type of approach. As the authors had an excellent knowledge of the languages they
compared, these books contain a wealth of interesting contrastive statements. However,
intuitions can be misleading and a few striking differences can lead to dangerous over-
generalisations. For instance, the absence in English of connectors corresponding to the
French ‘or’ or ‘en effet’ has led to the general conclusion that French favours explicit linking
while English tends to leave links implicit (Vinay & Darbelnet 1958: 222, Newmark 1988:
59, Hervey & Higgins 1992: 49). Like many others, this contrastive claim still awaits
empirical investigation. Contrastive linguists now have a way of testing and quantifying
intuition-based contrastive statements in a body of empirical data that is vastly superior – both
qualitatively and quantitatively – to the type of contrastive data that had hitherto been
available to them.
The domain of Translation Studies (TS) underwent a similar corpus-based trend in the early
90s under the impetus of Mona Baker, who laid down the agenda for corpus-based TS (1993
and 1995) and started collecting corpora of translated texts with a view to uncovering the
distinctive patterns of translation. Her investigations brought to light a number of potential
‘translation universals’ (Baker 1993) which further corpus studies are helping to confirm or
2
disprove (see Puurtinen 2007). Researchers in both CL and TS have thus come to rely on
corpora to verify, refine or clarify theories that hitherto had had little or no empirical support
and to achieve a higher degree of descriptive adequacy.
Section 2 gives an overview of the types of corpus used in cross-linguistic studies and
suggests a unified terminology. Section 3 presents the different types of corpus-based
comparison and section 4 highlights the respective advantages and disadvantages of bilingual
comparable vs translation corpora. Section 5 gives a brief overview of some of the
applications of corpus-based cross-linguistic research and the last section offers some
concluding remarks.
2. Corpora in cross-linguistic research
In the corpus, scholars of contrastive linguistics and translation studies now have a common
resource. Unfortunately, like in many new scientific fields, the terminology has not yet been
firmly established, leading to a great deal of confusion.
Contrastive linguists distinguish between two main types of corpus for use in cross-linguistic
research:
- corpora consisting of original texts in one language and their translations into one or
more languages – let us call these translation corpora;
- corpora consisting of original texts in two or more languages, matched by criteria such
as the time of composition, text category, intended audience, etc. – let us call these
comparable corpora. (Johansson & Hasselgård 1999).
It should be noted however, that even among contrastive linguists the terminology is not
entirely consistent. The term parallel corpus is sometimes used to refer to a comparable
corpus (Aijmer et al 1996: 79, Schmied & Schäffler 1996: 41), a translation corpus (Hartmann
1980: 37) or a combined comparable/translation corpus (Johansson et al 1996).TS researchers,
on the other hand, use the terms translation corpus, parallel corpus and comparable corpus to
cover different types of texts. The term comparable corpus is used to refer to ‘two separate
collections of texts in the same language: one corpus consists of original texts in the language
in question and the other consists of translations in that language from a given source
language or languages’ (Baker 1995: 234). The term translation (or translational) corpus is
used to refer to the corpus of translated texts (see Baker 1999 and Puurtinen 2007). While in
standard CL terminology, comparable corpora are usually multilingual (comparable original
texts in different languages), in TS terminology they are usually monolingual (original and
translated texts in the same language). Within the TS framework the term parallel corpus
usually refers to ‘corpora that contain a series of source texts aligned with their corresponding
translations’ (Malmkjaer 1998: 539), in other words what contrastive linguists usually refer to
as translation corpora.
Over and above the terminological difference, there is a more fundamental discrepancy
between the two cross-linguistic approaches. In the TS framework, translated texts are
considered as texts in their own right, which are analysed in order to “understand what
translation is and how it works” (Baker 1993: 243). In the CL framework they are often
presented as unreliable as the cross-linguistic similarities and differences that they help
establish may be ‘distorted’ by the translation process, i.e. may be the result of interference
from the source texts.
3
Faced with the terminological diversity that characterises current cross-linguistic research, I
feel that unified terminology is desirable and would like to suggest the general typology
illustrated in Figure 1.
Multilingual Monolingual
ComparableTranslation Comparable
Uni-
direction
Translated
texts
Original
texts
Bi-
directional
Original and
translated
texts
Native and
learner
texts
Corpora in cross-linguistic research
Figure 1: Corpora in cross-linguistic research
In this typology, a primary distinction is made between multilingual and monolingual corpora.
Multilingual corpora involve more than one language. They may be of two main types: (a)
translation corpora (which contain source texts and their translations and may be
unidirectional – from language X to language Z – or bi/multidirectional) and (b) comparable
corpora (which contain non-translated or translated texts of the same genre). The monolingual
corpora relevant for cross-linguistic research are all comparable corpora. They may contain
(a) original and translated texts in one and the same language or (b) native and learner texts
in one and the same language1. In this typology, the term parallel corpus is not used in view
of its ambiguity in the literature, where it has been used to refer to corpora of source texts and
their translations, comparable corpora or as a generic term to refer to any type of multilingual
corpus (Teubert 1996: 245).
This diagram does not include the many extralinguistic features that influence the data and
therefore need to be carefully recorded, such as the translator’s status (professional or student)
or the direction of the translation process (into the translator’s mother tongue or not).
3. Types of corpus-based comparison
With these different corpus types, a variety of comparisons can be undertaken. Table 1
presents an overview of the different types of cross-linguistic comparison and the disciplines
within which they are undertaken (see also Johansson 2007a)
1 For a description of this special type of contrastive research called Contrastive Interlanguage Analysis, see
Granger 1996 and Gilquin 2000/2001.
4
Type of comparison Type of corpus Discipline
1. OLx ⇔ OLy Multilingual comparable corpus of
original texts
CL
2. SLx ⇔ TLy Multilingual translation corpus CL & TS
3. SLx ⇔ TLx Monolingual comparable corpus of
original and translated texts
TS & CL
4. TLx ⇔ TLy Multilingual comparable corpus of
translated texts
TS
OL = original language
SL = source language
TL = translated language
Table 1: Types of corpus-based cross-linguistic comparison
The first type of comparison, between corpora of original texts in different languages (x and
y), is the CL domain of expertise par excellence. However, there is a growing awareness
among TS researchers of the interest of this type of research for translation studies. The
second type of comparison is the most obvious meeting point between CL and TS.
Researchers in both fields use the same resource but to different ends: uncovering differences
and similarities between two (or more) languages for CL and capturing the distinctive features
of the translation process and product for TS. The third type of comparison, which contrasts
original and translated varieties of one and the same language, is the ideal method for
uncovering the distinctive features of translated texts and hence seems at first sight to fall
exclusively within TS. However, this type of comparison is increasingly being used by CL
researchers who interpret differences between OL and TL as indirect evidence of differences
between the languages involved (see Johansson & Hasselgård 1999 and Johansson 2007a).
Finally, the comparison of translated varieties in different languages is quite clearly the
prerogative of TS. However, it is essential that contrastive linguists pay attention to this type
of study. Failing to properly understand the nature of translated texts might lead them to
attribute some difference between OL and TL to interference from OL when in fact the
phenomenon may simply be a manifestation of a translation universal.
4. Advantages and disadvantages of bilingual comparable and translation corpora
Table 2 summarizes the advantages and disadvantages of the two main types of multilingual
corpus: the comparable corpus and the translation corpus. It appears clearly from the table that
what constitutes an advantage for one type of corpus constitutes a disadvantage for the other
and vice versa.
+ / - Translation corpora Comparable corpora
+
Text type comparability
L1-L2 equivalence
Wide availability of texts
Original language
(reliable frequency and use)
5
-
Limited availability of texts
Translated language
(translationese & translation universals)
Text type comparability
L1-L2 equivalence
Table 2: Bilingual translation vs comparable corpora
AVAILABILITY
The most easily accessible corpora for cross-linguistic research are undoubtedly comparable
corpora of original languages. English is particularly well equipped with large balanced
corpora such as the British National Corpus or the Bank of English. For other languages, there
are electronic text collections, notably newspaper archives, that are regularly used for cross-
linguistic research, but they tend to be less representative than the English mega corpora. Less
widespread languages may not have any corpus resources at all or access to them may be
severely limited. As regards translation corpora, however, electronic resources are scarce. It is
not always possible to find translations of all texts, either because of the text type – letters and
e-mail messages, for instance, are not usually translated – or because there are more
translations in one direction (English to Chinese, for instance) than in another (Chinese to
English). Available translation corpora tend to include older, copyright free texts (cf. project
Gutenberg2 which contains c. 30,000 free books) or alternatively, highly specialised texts
such as documents from the European Union or the World Health Organization, the
disadvantage of which is that it is often impossible to determine the source and target
languages, a major variable for both CL and TS studies. While we are witnessing a rapid
growth in the number of bilingual (and multilingual) resources, some of which can even be
explored online, many high quality resources remain inaccessible to the academic community.
This is the case, for instance, of the excellent English-Norwegian and English-Swedish
corpora, which are only available to a limited group of researchers because of copyright
strictions.
dent in Chinese and which they decided to replace by a category of ‘martial arts
ction’.
re
TEXT TYPE COMPARABILITY
Translation corpora are an ideal resource for establishing equivalence between languages
since they convey the same semantic content and are pragmatically and textually comparable
(cf. James 1980: 178). In the case of comparable corpora, however, it is much more difficult
to ensure text type comparability. Some types of text are culture-specific and simply have no
exact equivalent in other languages. For example, when compiling the Lancaster Corpus of
Mandarin Chinese (LCMC), McEnery & Xiao (2004) designed the corpus as an exact replica
of the FLOB corpus to ensure comparability of the data. However, they encountered some
difficulty, notably with the category of ‘western and adventure fiction” which has no exact
correspon
fi
L1-L2 EQUIVALENCE
Cross-linguistic comparison requires a “common platform of comparison” (Connor & Moreno
2005), a “background of sameness” (James 1980: 169) against which differences can be
described. This constant, which is usually referred to as the tertium comparationis3 (TC), is
relatively easy to establish in the case of translation corpora but constitutes a major stumbling
block in the case of comparable corpora. In translation corpus studies, the TC is the
relationship between a unit in the source language and its translation in the target language,
2 Cf. http://www.gutenberg.org
3 The term tertium comparationis has been used in a wide range of meanings in the contrastive literature. Connor
& Moreno (2005), for instance, use the term TC for all levels of research, including the selection of corpora.
6
viz. translation equivalence. For example, in Aijmer’s (1999) study of epistemic modality in
English and Swedish, the TC is the relationship between the English modal verb may and the
corpus-attested equivalents in Swedish (modal verbs, modal adverbs or a combination of the
two). With comparable corpora, however, there is no readily available tertium comparationis.
And yet, researchers need to establish one if they want to make sure that they will compare
like with like. As regards grammar, James (1980: 167) reminds us that “the fact that we use
the labels ‘tense’ or ‘articles’ to refer to a certain grammatical category in two different
languages should not be taken to mean that we are talking about the same thing”. It is
therefore necessary to establish a basis for comparison. However, James (ibid: 168) hastens to
point out that “comparability does not presuppose absolute identity, but merely a degree of
shared similarity”. In the case of articles, the TC could be “a small class of function words
that occur in pronominal position and seem to indicate the specificness or genericness of the
noun” (ibid: 168). This is a thorny issue whatever the languages involved but the problem is
particularly acute in the case of very different language systems, such as English and Chinese
(cf. McEnery & Xiao’s 1999 comparison of aspect marking in English and Chinese). It is all
the more important to establish a clear TC in areas such as phraseology where units such as
ioms or collocations tend to be ill-defined. id
RELIABILITY OF LANGUAGE
Comparable corpora have the major advantage of representing original texts in the two (or
more) languages under comparison, i.e. language spontaneously produced by native speakers
of those languages. They are therefore in principle free from the influence of other languages4
and therefore arguably more reliable, especially to assess frequency and patterns of use.
Translation corpora, on the other hand, display two main types of features that mark them off
from original texts. On the one hand, they often contain features of what is usually referred to
as ‘translationese’, i.e. “deviance in translated texts induced by the source language” (Johansson
& Hofland 1994:26).5 On the other hand, they also display universal features, i.e. “features
which typically occur in translated text rather than original utterances and which are not the
result of interference from specific linguistic systems” (Baker 1993: 243). Gellerstam (1986)
gives ample lexical evidence of translationese in translated Swedish. The main characteristics he
lists are: a higher proportion of English loanwords, fewer colloquialisms, a higher frequency of
standard ‘press-the-button’ translations of English words; and international words such as lokal,
massiv, drastic used with new shades of meaning (for further examples of translationese, see
Borin & Prütz 2001, Frankenberg-Garcia 2008, Wang & Qin 2008). In an interesting article,
Rayson et al (2008) show how translationese can be detected fully automatically by
comparing the frequencies of words and phrases in three ICT (Information and
Communications Technology) corpora: a corpus of original Chinese texts, a corpus of
translations of these texts into English by a proficient Chinese translator and a corpus of
edited English, containing the versions of the Chinese translations corrected by a native
speaker of English. The authors focus on multiword units and uncover interesting differences,
4 This is obviously not entirely true. Newspaper texts, for example, have often been found to contain traces of the
(usually) English texts on which the journalists have based their articles.
5 The term ‘translationese’ is used in a range of meanings in contrastive and translation studies. It can be used in
a neutral sense to refer to any source language-related feature that distinguishes translated language from original
ack
excludes
language or in a clearly negative sense to refer to f