The Uppsala Student English corpus (USE) is a machine-readable collection
of essays from the Department of English, Uppsala University, spanning
the years 1999-2001.
USE was set up by Ylva
Berglund and Margareta
Westergren Axelsson with the aim of creating a powerful tool for research
into the process and results of foreign language teaching and acquisition,
as manifest in the written English of Swedish university students.
The corpus consists of 1,489 essays written by 440 Swedish
university students of English at three different levels, the majority
in their first term of full-time studies. The total number of words is
1,221,265, which means an average essay length of 820 words. A typical
first-term essay is somewhat shorter, averaging 777 words.
The essays cover set topics of different types. They were written out
of class, against a deadline of two to three weeks, length limitations
imposed (usually 700-800 words), and suitable text structure suggested.
First-term students were admitted for both spring (January 20 - June 6)
and autumn terms (September 1 - January 19).
First-term essays:
a1. "English, my English." Students describe their experience of the
English language, evaluating their reading, writing, speaking, and listening
proficiency. Personal, involved style. Written late January or early September.
a2. Argumentation. Students argue for or against a statement concerning
a topical issue. Formal style. Written in mid-February or early October.
a3. Reflections. Students reflect on the medium of television and its
impact on people, or on related issues of their choice. Personal/formal
style. Written in March or October.
a4. Literature course assignment. Students choose between a discussion
of theme/character/narrator and a close-reading based analysis of a set
passage. Formal style. Written in early April or November.
a5. Culture course assignment. Students study topics in set secondary
sources and compose an essay using this material, often quoting and listing
these sources. Topics include issues such as 19th-century education of
women, the industrial revolution, slavery, and utopias. Written in late
April or November.
Second-term essays:
b1. Causal analysis. Students discuss causes of some recent trend of
their choice. Formal style. Suitable in content and style for comparison
or combination with essay a3.
b2. Argumentation. Students present counter-arguments to views expressed
in articles or letters to the editor. Similar in approach and tone to essay
a1.
b3. Short papers in English linguistics, on various topics, e.g. loan
words in English, English spelling, British and American English, the semantic
properties of synonymous pairs. Academic style. Lengthy tables, lists of
words, and appendices, irrelevant to the study of learner English, were
removed (the place was marked in the document). Essays may still contain
words in other languages than English, or from earlier periods of English,
items quoted directly from dictionaries, and lists of references.
b4. English literature. A discussion of character, theme etc., produced
in a survey course, dealing with Shakespeare's Julius Caesar or
contemporary novels. Essays may contain quotations, sometimes also
references to secondary sources. Academic style.
b5. American literature. Similar to b4. Essays formed part of a course
on American contemporary novels and may contain quotations and references
to secondary sources. Academic style.
In the autumn of 1999, 30 additional essays (coded b6-b8) were produced
by second-term teacher trainees, namely
b6. Taboo, not taboo. (12 essays)
b7. Politics and education. (15 essays)
b8. School visit reports. (3 essays)
Third-term essays:
c1. Collected only in the spring term, 2000. Seven longer essays, all
literature course assignments.
Tables 1-4 provide a survey of the USE corpus, tabulating
its content and size across the three years of collection, thus illuminating
the history of the corpus production. The number of words has been calculated
with the Wordlist option in WordSmith Tools, version 2.0, set to count
numbers as words, and hyphenated words as one word.
Table 1. Number of essays written by first-term
students (a) and number of words
| Essay type |
Spring
1999
|
Autumn
1999
|
Spring
2000
|
Autumn
2000
|
Spring
2001
|
Autumn
2001
|
Total N
essays
& words
|
Average
words/
essay
|
| Student id |
0100-238 |
1000-1121 |
2000-75 |
3000-72 |
4000-48 |
5000-49 |
|
|
Evaluation
(a1)
Words |
115 |
84 |
63 |
31 |
6 |
4 |
303 |
|
| 83,285 |
64,802 |
45,318 |
12,702 |
2,586 |
1,656 |
210,349 |
703/414 |
Argumen-
tation
(a2)
Words |
105 |
67 |
58 |
26 |
42 |
46 |
344 |
|
| 78,390 |
52,250 |
41,916 |
20,104 |
30,326 |
32,388 |
255,374 |
742 |
Reflections
(a3)
Words |
94 |
56 |
54 |
18 |
36 |
34 |
292 |
|
| 67,832 |
40,811 |
38,725 |
14,110 |
25,992 |
24,273 |
211,743 |
725 |
Literature
(a4)
Words |
73 |
49 |
48 |
8 |
6 |
1 |
185 |
|
| 66,125 |
43,123 |
43,140 |
8,085 |
5,218 |
1,214 |
166,905 |
902 |
Culture
(a5)
Words |
50 |
40 |
24 |
--- |
--- |
--- |
114 |
|
| 51,736 |
39,867 |
25,583 |
|
|
|
117,186 |
1028 |
Total N of
essays
Total N
of words
Average
w/essay |
437 |
296 |
247 |
83 |
90 |
85 |
1,238 |
|
| 347,368 |
240,853 |
194,682 |
55,001 |
64,122 |
59,531 |
961,557 |
|
| |
|
|
|
|
|
|
777 |
Notes on Table 1:
Evaluation (a1): From 2001 these essays were limited to about
400 words, and collection for the corpus was officially discontinued. A
few essays nevertheless submitted were included.
Literature (a4): Collection for the corpus was discontinued after
a sharp drop in students' interest toward the end of the autumn term, 2000.
A few essays still submitted were included.
Culture (a5) was dropped from the curriculum as of autumn, 2000.
Table 2. Essays and papers written by second-term students (b)
and number of words
| Text type |
Autumn
1999
|
Spring
2000
|
Autumn
2000
|
Spring
2001
|
Autumn
2001
|
Total N
essays
& words
|
Average
words/
essay
|
| Student id |
0100-318 |
1000-
1500-41 |
2000- |
3000-
3500-25 |
4000-
4500-8 |
|
|
Causal
analysis (b1)
Words |
15 |
21 |
12 |
22 |
6 |
76 |
|
| 11,469 |
16,479 |
8,597 |
16,762 |
4,559 |
57,886 |
761 |
Argumen-
tation
(b2)
Words |
10 |
15 |
8 |
17 |
3 |
53 |
|
| 9,728 |
14,386 |
5,866 |
14,730 |
2,741 |
47,451 |
895 |
Linguistics
(b3)
Words |
18 |
15 |
2 |
--- |
--- |
35 |
|
| 26,662 |
26,679 |
2,996 |
|
|
56,337 |
1,610 |
Literature
(b4)
Words |
14 |
12 |
3 |
--- |
--- |
29 |
|
| 17,437 |
14,131 |
3,307 |
|
|
34,875 |
1,203 |
Literature
(b5)
Words |
6 |
8 |
5 |
2 |
--- |
21 |
|
| 8,118 |
11,368 |
6,328 |
2,355 |
|
28,169 |
1,341 |
Total N
of essays
Total N
of words
Average
w/essay |
63 |
71 |
30 |
41 |
9 |
214 |
|
| 73,414 |
83,043 |
27,094 |
33,847 |
7,300 |
224,698 |
|
| |
|
|
|
|
|
1,050 |
Notes on Table 2
In the last two terms, only causal analysis (b1) and argumentation
(b2) essays were requested from the students.
Table 3. Number of essays written by second-term teacher trainees
and number of words. Only from the autumn term, 1999, student id. codes
0100-238
| Text type |
N of essays
|
N of words
|
| Taboo, not taboo (b6) |
12 |
8,377 |
| Politics and education (b7) |
15 |
12,132 |
| School visit report (b8) |
3 |
3,782 |
| Total |
30 |
24,291 |
Table 4. Number of literature course essays written by third-term
students (c) and number of words. Only from the spring term, 2000, student
id. codes 0100-238, 0500-2
| Text type (c1) |
N of essays
|
N of words
|
| American literature (0140 & 0165) |
2 |
4,136 |
| English literature |
5 |
6,583 |
| Total |
7 |
10,719 |
Notes on Table 4
The literature essays essays (c1) were produced in elective courses
of English and American literature. Five of the seven students taking part
also submitted essays on the underlying levels. These students keep their
original codes in the range of 0100-328. The new participants have the
codes 0500-2. All are coded with the student identification code with the
addition 'c1'.
Each essay in USE is a separate file in plain text format.
The first line always has a begin-document tag as the only word of the
line. That tag also provides the file name of the text document (e.g. <doc.id
= 2031.a3>). An end-document tag (</doc>) is the only word on the last
line of the document. The file name shows the student identity number (2031)
followed by an extension giving the term/level of writing and the type
of essay (e.g. a3, where a = first term, 3 = essay 3, Reflections). As
shown in the tables, the first digit of the student identity code denotes
the term the student entered the project (0 = spring term, 1999; 5 = autumn
term, 2001); the following digits are only numbers marking the order in
which students volunteered.
The student identity codes thus make it possible to select essays from
a particular term, if so desired. It is also possible to follow individual
students over time, as, once a student entered the project, her/his identity
code remained the same. The extension denotes the term or level the essay
belongs to. Thus, student 2012 may have produced several essays, such as
2012.a2, 2012.a3 (all on first-term level) and 2012.b1 (second-term level).
Normally students continue their studies on consecutive terms, so that
a student beginning first-term studies in the autumn of 2000 will proceed
to second-term studies in the spring of 2001. Four students, however,
interrupted their studies, returning to take the second-term courses one
or more terms later. Such second-term essays have an "i" (for "interrupted"
period of study) added to the file extension (2012.b1i). The exact term
when the student wrote her/his second-term essays is shown in the database.
This time factor may be important to consider, if such an essay is included
in a longitudinal sub-corpus.
Some editing of the essays has been done: author names have been removed
(deletion marked <name>) along with other identifying information. Apostrophes
have been standardised. Formatting characters (hard line and page breaks,
extra line spacing, etc.) have been removed. Paragraph breaks (end of paragraph)
have been kept, standardised as CR + LF (return and line feed, ASCII 13,
10) to enable study of text organisation. Three spaces have been substituted
for tabs (HT, ASCII 9). Titles, if any (some essays are untitled), are
preceded by <title> and followed by </title> to enable exclusion
of titles, if desired.
In connection with a grammar lecture by one of the compilers,
students were informed about the USE project, its aims and practical organisation.
They were encouraged to enrol, although on an entirely voluntary basis.
Consent to enrol and permission to use essays in the corpus were given
in writing (Appendix 1) and students also completed a questionnaire providing
information for a database (see below and Appendix 2).
All essays were written without supervision or time constraints (apart
from date deadlines), and with access to dictionaries, and written and
electronic sources for facts. Essay deadlines approaching, students were
reminded to hand in electronic copies of their original essays to the USE
compilers at the same time as they submitted a printed original to their
essay tutors. Electronic copies were handed in on disk, copied into e-mails,
or provided as e-mail attachments.
The USE compilers removed the students' names and other means of identification,
converted the texts to plain text format, standardised certain items (see
above) and saved the files under the identity codes allocated during the
enrolment procedure.
A consequence of the set topics is that the essays can
be expected to represent different text types and registers, i.e. they
exhibit different levels of formality, certain kinds of vocabulary etc.
This is why essay type (numbered a1-a5, b1-b8, and c1) rather than
the term of production has been chosen as the main principle of
organisation in the final version of the corpus, which facilitates grouping
of similar texts, in order to obtain larger samples. It also makes it possible
to compare similar text types on different levels. Table 5 shows the different
types of essays. Evidently, some essays represent similar text types on
different levels of proficiency. In terms of formality level and general
topic areas, argumentation and discussion essays are related, usually dealing
with topical subjects in society. This means that two large categories
of essays can be discerned, one about matters of interest in society and
the other about literature. Evaluation, culture and linguistics can be
seen as more specialised categories.
Table 5. Essay types in the USE corpus and how they are interrelated
| |
First term (a)
|
Second term (b)
|
Third term (c)
|
| Evaluation |
a1 |
b8 |
|
| Argumentation |
a2 |
b2 |
|
| Discussion |
a3 |
b1, b6, b7 |
|
| Literature |
a4 |
b4, b5 |
c1 |
| Culture |
a5 |
|
|
| Linguistics |
|
b3 |
|
Student background information
All 440 students in the USE project filled in a questionnaire,
answering questions about themselves, concerning their first language,
parents' first language, grades in English, previous studies, exposure
to English etc. This information is coded in a Microsoft Excel database,
see below. Incomplete data in the database are due to some students overlooking
the second page of the questionnaire, choosing not to answer all the questions,
or (in the last two terms) a shortened, simplified form of the questionnaire.
USE consists of three separate parts, shown in Figure
1.
| USE corpus |
USE database
|
USE manual
|
| 1,489 essays in plain text format, organised
in 14 essay type categories (each contained in one folder marked a1, a2,
etc.). Untagged text. For number of files, see Tables 1-4. |
Information about the 440 students coded
in a Microsoft Excel file. |
Detailed information about the corpus and
the database in a Microsoft Word file. |
Figure 1. USE resources
As yet, there is no special interface or search engine
created for the USE corpus. Most studies conducted on the material have
been carried out with the software program WordSmith Tools, version 3.0
or earlier.
Depending on the research question, the corpus can be
used in different ways. If the investigated feature is expected to be frequent,
a smaller sample of texts may be sufficient. If the investigation deals
with a feature sensitive to register variation, it is important to choose
essay type(s) suitable for the purpose.
Comparisons of Swedish students' English can be made with standard corpora
of authentic written English or with other corpora of learner English (see
Pravec 2002). Internal comparisons of samples of essays in the corpus may
also be of interest, for instance, to see to what extent a syntactic construction
or lexical unit is mastered on different levels of study (see Axelsson
and Berglund 2002).
The USE database provides an overview of the resources
available. By using the sorting or filter functions in Excel, one can easily
identify the essays relevant for a specific research question and then
sample the selected essays from the corpus.
The information in the database makes it possible to study the progression
of individual students who have submitted several essays over time. This
can mean several essays in one term, two terms or three terms. The number
of essays in each student’s production varies from one to eleven. About
sixty students have handed in varying numbers of essays on both the first-
and second-term levels, five of these even on the third-term level.
The following variables are coded in the database:
Column A. Student identity code. The code was given when the
student first entered the project and then followed him/her during all
the terms of participation. Table 1 shows the codes allotted on the different
first-terms. Students enrolling for the first time during the second-term
programme have their own set of identity codes (see Table 2 and the database).
The majority of students who contributed both first- and second-term essays
did so on consecutive terms. As mentioned above, four students submitted
second-term essays (b) after a break of one term or more. These essays
are marked "i" for "interrupted" and their term of production is given
in column AJ.
Columns B-O. Essays submitted. Each essay type has its own column.
Column Q. Sex. Female (f) or male (m).
Column R. Age.
Column S. Year of birth.
Column T. Course=programme. A1, B1, C1 = general programme. A2,
B2, A4, B4 and A6 = programmes for teacher trainees (A2, B2: upper secondary
level; A4, B4: school years 4-9, and A6: school years 1-7).
Column U. Mother tongue, defined as "language spoken at home".
Abbreviations are self-explanatory, for example, sw=Swedish, fi=Finnish,
nor=Norwegian, spa=Spanish, ger=German. If less than obvious, "ot" = "other"
is given and specified in column AJ.
Column V. Mother's first language.
Column W. Father's first language.
Column X. Answers the question "How many years have you studied
English at school?"
Column Y. Answers the question "When did you first go to university?"
Column Z. Answers the question "What was your grade in English
in Swedish upper secondary/high school?" Several changes in the Swedish
grading system explain the variation in the data:
A few, older students have grades from a system ranging across A, a,
AB, Ba, B, B? (with A as the best, very unusual grade, and B? a bare pass).
Many have grades 5, 4 and 3 (5 being the best, and 3 or better the requirement
for English at university level).
The most recent system comprises several programmes in English at upper-secondary
level: a minimum, a standard and a supplementary course, each graded MVG,
VG and G - excellent, pass with distinction, pass. G on the standard course
is required for university studies. Single grades refer to the standard
course, double grades to standard course/supplementary course.
Column AA. Grade in Swedish. See above.
Column AB. University credits (points) in language studies at a Swedish
university. The figure entered = weeks of study. Linguistics has been
counted in this category.
Column AC. University credits (points) in language studies at a university
abroad. The figure entered = weeks of study.
Column AD. University credits (points) in other subjects than languages
at a Swedish university. The figure entered = weeks of study.
Column AE. University credits (points) in other subjects than languages
at a university abroad. The figure entered = weeks of study.
Column AF. Worked in an English-speaking country. The figure
entered = months.
Column AG. Studied in an English-speaking country. The figure
entered = months. A college year in the United States, and earlier periods
of school or summer schools in some English-speaking country are coded
here.
Column AH. Total time spent in an English-speaking environment,
broadly defined as "where English is used every day, abroad or in Sweden".
The figure entered = months.
Column AI. Answers the question "Is there anything in particular
that has affected your command of English?" A common answer refers to visits
to English-speaking countries and has been coded as "Stay in Eng", regardless
of country.
Column AJ. Clarification as to the first language, term of resumed
studies on the second-term level, foreign grades, and other pertinent information
offered.
An empty cell means that the student did not answer the question.
For the last two terms of the USE project, the questionnaire was simplified,
which explains the empty columns AF, AG and AI.
Acknowledgement of the origin of USE
Anyone using USE for research is obligated to acknowledge
the source of data as follows: USE = Uppsala Student English corpus, compiled
by Margareta Westergren Axelsson and Ylva Berglund, the Department of English,
Uppsala University, 1999-2001.
The corpus can be used for research and educational purposes.
It can be accessed on the Internet from the Oxford Text Archive at http://www.ota.ahds.ac.uk/.
For Uppsala students and researchers, it is also available on a CD at the
Department of English (Professor Merja Kytö and Senior lecturers supervising
language project work).
About USE and other learner corpora
The two corpus compilers and students at the Department
of English have used material from the corpus for investigations. The titles
of studies finished so far are given in the list below.
Axelsson, Margareta Westergren (1999) 'Project USE (Uppsala Student English),'
ASLA
Information 25:2, 25-6.
Axelsson, Margareta Westergren. (2000) 'USE - The Uppsala Student English
Corpus: An instrument for needs analysis,' ICAME Journal 24:155-7.
Available online at http://nora.hd.uib.no/icame/ij24/
.
Axelsson, Margareta Westergren (2000) 'The use of a corpus of students'
written production in university English teaching, ' in Gunilla Byrman,
Hans Lindquist and Magnus Levin (eds.) Corpora in research and teaching:
Papers from the ASLA symposium on corpora in research and teaching, Växjö,
11-12 November 1999. ASLA:s skriftserie 13. 293-303.
Axelsson, Margareta Westergren and Angela Hahn (2001), 'The use of the
progressive in Swedish and German advanced learner English - a corpus-based
study,' ICAME Journal 25:5-30. Available online at http://nora.hd.uib.no/icame/ij25/.
Axelsson, Margareta Westergren and Ylva Berglund (2002), 'The Uppsala Student
English Corpus (USE): A multi-faceted resource for research and course
development,' in Lars Borin (ed.) Parallel corpora, parallel worlds.
Selected papers from a symposium on parallel and comparable corpora at
Uppsala University, Sweden, 22-23 April, 1999. Amsterdam: Rodopi. 79-90.
Berglund, Ylva and Oliver Mason (2002), 'The influence of external factors
on learner performance,' in Bernhard Kettemann and George Marko (eds.)
Teaching
and learning by doing corpus analysis. Amsterdam: Rodopi. 205-215.
Borin, Lars and Klas Prytz, '"New wine in old skins?" A corpus investigation
of L1 syntactic transfer in learner language.' Poster at TALC 2002, Fifth
International Conference on Teaching and Language Corpora, 27-31 July,
2002. Bertinoro, Italy.
Granger, Sylviane (ed.) (1998) Learner English on computer. London:
Longman.
Mason, Oliver and Ylva Berglund (2002), 'Low-level parameters reflecting
the naturalness of text,' in Conference publication of JADT 2002, 6th International
Conference on the Statistical Analysis of Textual Data. Saint-Malo, France.
Mason, Oliver and Ylva Berglund, '"But this formula doesn't mean anything!"
- some reflections on parameters of texts and their significance.' Forthcoming
in Festschrift for Geoffrey Leech (Peter Lang).
Pravec, Norma A. (2002) 'A survey of learner corpora,' ICAME Journal
26:81-114.
Available online at http://nora.hd.uib.no/icame/ij26/
Students' third-term (C) and fourth-term (D) papers
(unpublished, filed by the Department of English)
Blomberg, Karin (2000) Swedish learners' use of the progressive aspect
in English.
Eiman, Carin (2000) Adjectives and attitudes: A linguistic study of how
male and female students use adjectives when describing their knowledge
of and proficiency in the English language.
Hellén, Christina (2001) Swedish students' use of hypothetical conditional
sentences in English. (D-course)
Linerstad, Andrea (2002) The development of students' skills in handling
S-V concord.
Svensson, Jenny (2001) Noun compound, noun-compound or nouncompound?
Three different constructions of a noun+noun compound. (D-course)
Uppsala Student English Project (USE)
Projektet syftar till att skapa en korpus (datorläsbar
textsamling) bestående av material producerat av studenter. Korpusen
kommer att användas för forskning, undervisning och läromedelsframställning.
Att deltaga i projektet är frivilligt och de deltagande kommer att
vara anonyma (namnen avlägsnas från korpusen). Frågor
besvaras av projektledarna, Margareta Westergren Axelsson och Ylva Berglund.
The project aims at creating a corpus (computer-readable collection
of texts) of material submitted by students. The corpus will be used for
language research, teaching, and production of teaching material. All contribution
to the project is voluntary and anonymity will be maintained by the removal
of participants' names in the corpus. Questions will be answered by the
project coordinators, Margareta Westergren Axelsson and Ylva Berglund.
Medgivande
Consent form
Jag ger 'The Uppsala Student English Project' rätten att använda
det jag lämnar till projektet för forskning, undervisning, läromedelsframställning,
publicering och presentation (tex. på konferenser och seminarier).
Jag godkänner att mitt material eventuellt publiceras (helt eller
delvis) i någon form, tex. elektroniskt eller i tryck.
I hereby give to 'The Uppsala Student English Project' the right to
use the material I submit to the project for research, teaching, publication
and presentation (conferences, workshops, etc.). I consent to the possible
publication of my material (as a whole or in part) in various forms, including
paper and electronic media.
Namnteckning / signature
....................................................................................................
Namnförtydligande /printed name
.........................................................................................
Datum / date ...........................
Background data for corpus of student English
Name
(in the final corpus, all contributors will be anonymous, with only
a code for identification):
.................................................................................................................
Sex: [ ] female [ ] male
Year of birth: 19...........
Mother tongue (what language do you speak at home? ):
[ ] Swedish [ ] English [ ] Other,
namely ...........................................................
Mother tongue of parents
a) mother
[ ] Swedish [ ] English [ ] Other, namely
............................................................
b) father
[ ] Swedish [ ] English [ ] Other, namely
............................................................
How many years have you studied English at school? ...................
What year did you first go to university? 19............
Have you taken any previous language courses at university level ?
[ ] no [ ] yes (please specify language and points, for
example French 20 p, Russian 5 p, etc.):
.................................................................................................................
.................................................................................................................
Have you taken any other courses at university?
[ ] no [ ] yes (please specify course and
points, for example Economics 20 p, Law 5 p, etc.):
.................................................................................................................
.................................................................................................................
.................................................................................................................
Have you studied/worked abroad? (See also the following question.)
[ ] no [ ] yes (please specify country, type of activity,
length):
.................................................................................................................
.................................................................................................................
.................................................................................................................
How much time have you spent in an English-speaking environment (where
English was used every day), abroad or in Sweden? Please specify where,
for how long and to what extent if possible (for example prolonged stay
in an English-speaking country, long holidays and travels abroad, work
in an international environment):
.................................................................................................................
.................................................................................................................
.................................................................................................................
What was your grade in English in Swedish upper secondary/high school?
.........................
What was your grade in Swedish (language) in upper secondary/high school?
.....................
Is there anything in particular you feel has affected your command of
English?
[ ] no [ ] yes (please specify):
..................................................................................
|