Department of English

Uppsala Student English Corpus (USE)

The Uppsala Student English corpus (USE) is a machine-readable collection of essays from the Department of English, Uppsala University, spanning the years 1999-2001.

Aim

USE was set up by Ylva Berglund and Margareta Westergren Axelsson with the aim of creating a powerful tool for research into the process and results of foreign language teaching and acquisition, as manifest in the written English of Swedish university students.

Contents

The corpus consists of 1,489 essays written by 440 Swedish university students of English at three different levels, the majority in their first term of full-time studies. The total number of words is 1,221,265, which means an average essay length of 820 words. A typical first-term essay is somewhat shorter, averaging 777 words.

The essays cover set topics of different types. They were written out of class, against a deadline of two to three weeks, length limitations imposed (usually 700-800 words), and suitable text structure suggested. First-term students were admitted for both spring (January 20 - June 6) and autumn terms (September 1 - January 19).

First-term essays:

a1. "English, my English." Students describe their experience of the English language, evaluating their reading, writing, speaking, and listening proficiency. Personal, involved style. Written late January or early September.

a2. Argumentation. Students argue for or against a statement concerning a topical issue. Formal style. Written in mid-February or early October.

a3. Reflections. Students reflect on the medium of television and its impact on people, or on related issues of their choice. Personal/formal style. Written in March or October.

a4. Literature course assignment. Students choose between a discussion of theme/character/narrator and a close-reading based analysis of a set passage. Formal style. Written in early April or November.

a5. Culture course assignment. Students study topics in set secondary sources and compose an essay using this material, often quoting and listing these sources. Topics include issues such as 19th-century education of women, the industrial revolution, slavery, and utopias. Written in late April or November.

Second-term essays:

b1. Causal analysis. Students discuss causes of some recent trend of their choice. Formal style. Suitable in content and style for comparison or combination with essay a3.

b2. Argumentation. Students present counter-arguments to views expressed in articles or letters to the editor. Similar in approach and tone to essay a1.

b3. Short papers in English linguistics, on various topics, e.g. loan words in English, English spelling, British and American English, the semantic properties of synonymous pairs. Academic style. Lengthy tables, lists of words, and appendices, irrelevant to the study of learner English, were removed (the place was marked in the document). Essays may still contain words in other languages than English, or from earlier periods of English, items quoted directly from dictionaries, and lists of references.

b4. English literature. A discussion of character, theme etc., produced in a survey course, dealing with Shakespeare's Julius Caesar or contemporary novels. Essays may contain quotations, sometimes also references to secondary sources. Academic style.

b5. American literature. Similar to b4. Essays formed part of a course on American contemporary novels and may contain quotations and references to secondary sources. Academic style.
In the autumn of 1999, 30 additional essays (coded b6-b8) were produced by second-term teacher trainees, namely

b6. Taboo, not taboo. (12 essays)

b7. Politics and education. (15 essays)

b8. School visit reports. (3 essays)

Third-term essays:

c1. Collected only in the spring term, 2000. Seven longer essays, all literature course assignments.

A quantitative overview

Tables 1-4 provide a survey of the USE corpus, tabulating its content and size across the three years of collection, thus illuminating the history of the corpus production. The number of words has been calculated with the Wordlist option in WordSmith Tools, version 2.0, set to count numbers as words, and hyphenated words as one word.

Table 1. Number of essays written by first-term students (a) and number of words

Essay type Spring
1999
Autumn
1999
Spring
2000
Autumn
2000
Spring
2001
Autumn
2001
Total N
essays
& words
Average
words/
essay
Student id 0100-238 1000-1121 2000-75 3000-72 4000-48 5000-49
Evaluation (a1) Words 115 84 63 31 6 4 303
83,285 64,802 45,319 12,702 2,586 1,656 210,349 703/414
Argumen- tation (a2) Words 105 67 58 26 42 46 344
78,390 52,250 41,916 20,104 30,326 32,388 255,374 742
Reflections (a3) Words 94 56 54 18 36 34 292
67,832 40,811 38,725 14,110 25,992 24,273 211,743 725
Literature (a4) Words 73 49 48 8 6 1 185
66,125 43,123 43,140 8,085 5,218 1,214 166,905 902
Culture (a5) Words 90 40 24 --- --- --- 114
51,736 39,867 25,583 117,186 1028
Total N of essays Total N of words Average w/essay 437 296 247 83 90 85 1238
347,368 240,853 194,682 55,001 64,122 59,561 961,557
777

Notes on Table 1:

Evaluation (a1): From 2001 these essays were limited to about 400 words, and collection for the corpus was officially discontinued. A few essays nevertheless submitted were included.

Literature (a4): Collection for the corpus was discontinued after a sharp drop in students' interest toward the end of the autumn term, 2000. A few essays still submitted were included.

Culture (a5) was dropped from the curriculum as of autumn, 2000.

Table 2. Essays and papers written by second-term students (b) and number of words 

Text type Autum
1999
Spring
2000
Autumn
2000
Spring
2001
Autumn
2001
Total N essays & words Average words/ essay
Student id 0100-318 1000-1500-41 2000- 3000-3500-25 4000-4500-8
Causal analysis (b1) Words
 
15 21 12 22 6 76
11,469 16,479 8,597 16,762 4,559 57,886 761
Argumen- tation (b2) Words
 
10 15 8 17 3 53
9,728 14,386 5,866 14,730 2,741 47,451 895
Linguistics (b3) Words
 
18 15 2 --- --- 35
26,662 26,679 2,996 56,337 1,610
Literature (b4) Words 14 12 3 --- --- 29
 17,437 14,131 3,307 34,875 1,203
Literature (b5) Words 6 8 5 2 --- 21
8,118 11,368 6,328 2,355 28,169 1,341
Total N of essays

Total N of words

Average w/essay
63 71 30 41 9 214
73,414 83,043 27,094 33,847 7,300 224,698
1050

Notes on Table 2

In the last two terms, only causal analysis (b1) and argumentation (b2) essays were requested from the students.

Table 3. Number of essays written by second-term teacher trainees and number of words. Only from the autumn term, 1999, student id. codes 0100-238

Text type N of essays N of words
Taboo, not taboo (b6) 12 8,377
Politics and education (b7) 15 12,132
School visit report (b8) 3 3,782
Total 30 24,291

Table 4. Number of literature course essays written by third-term students (c) and number of words. Only from the spring term, 2000, student id. codes 0100-238, 0500-2

Text type (c1) N of essays N of words
American literature (0140 & 0165) 2 4,13
English literature 5 6,58
Total 7 10,71

Notes on Table 4

The literature essays essays (c1) were produced in elective courses of English and American literature. Five of the seven students taking part also submitted essays on the underlying levels. These students keep their original codes in the range of 0100-328. The new participants have the codes 0500-2. All are coded with the student identification code with the addition 'c1'.

File system and encoding

Each essay in USE is a separate file in plain text format. The first line always has a begin-document tag as the only word of the line. That tag also provides the file name of the text document (e.g. <doc.id = 2031.a3>). An end-document tag (</doc>) is the only word on the last line of the document. The file name shows the student identity number (2031) followed by an extension giving the term/level of writing and the type of essay (e.g. a3, where a = first term, 3 = essay 3, Reflections). As shown in the tables, the first digit of the student identity code denotes the term the student entered the project (0 = spring term, 1999; 5 = autumn term, 2001); the following digits are only numbers marking the order in which students volunteered.

The student identity codes thus make it possible to select essays from a particular term, if so desired. It is also possible to follow individual students over time, as, once a student entered the project, her/his identity code remained the same. The extension denotes the term or level the essay belongs to. Thus, student 2012 may have produced several essays, such as 2012.a2, 2012.a3 (all on first-term level) and 2012.b1 (second-term level).

Normally students continue their studies on consecutive terms, so that a student beginning first-term studies in the autumn of 2000 will proceed to second-term studies in the spring of 2001. Four students, however, interrupted their studies, returning to take the second-term courses one or more terms later. Such second-term essays have an "i" (for "interrupted" period of study) added to the file extension (2012.b1i). The exact term when the student wrote her/his second-term essays is shown in the database. This time factor may be important to consider, if such an essay is included in a longitudinal sub-corpus.

Some editing of the essays has been done: author names have been removed (deletion marked <name>) along with other identifying information. Apostrophes have been standardised. Formatting characters (hard line and page breaks, extra line spacing, etc.) have been removed. Paragraph breaks (end of paragraph) have been kept, standardised as CR + LF (return and line feed, ASCII 13, 10) to enable study of text organisation. Three spaces have been substituted for tabs (HT, ASCII 9). Titles, if any (some essays are untitled), are preceded by <title> and followed by </title> to enable exclusion of titles, if desired.

Collection procedure

In connection with a grammar lecture by one of the compilers, students were informed about the USE project, its aims and practical organisation. They were encouraged to enrol, although on an entirely voluntary basis. Consent to enrol and permission to use essays in the corpus were given in writing (Appendix 1) and students also completed a questionnaire providing information for a database (see below and Appendix 2).

All essays were written without supervision or time constraints (apart from date deadlines), and with access to dictionaries, and written and electronic sources for facts. Essay deadlines approaching, students were reminded to hand in electronic copies of their original essays to the USE compilers at the same time as they submitted a printed original to their essay tutors. Electronic copies were handed in on disk, copied into e-mails, or provided as e-mail attachments.

The USE compilers removed the students' names and other means of identification, converted the texts to plain text format, standardised certain items (see above) and saved the files under the identity codes allocated during the enrolment procedure.

Text types

A consequence of the set topics is that the essays can be expected to represent different text types and registers, i.e. they exhibit different levels of formality, certain kinds of vocabulary etc. This is why essay type (numbered a1-a5, b1-b8, and c1) rather than the term of production has been chosen as the main principle of organisation in the final version of the corpus, which facilitates grouping of similar texts, in order to obtain larger samples. It also makes it possible to compare similar text types on different levels. Table 5 shows the different types of essays. Evidently, some essays represent similar text types on different levels of proficiency. In terms of formality level and general topic areas, argumentation and discussion essays are related, usually dealing with topical subjects in society. This means that two large categories of essays can be discerned, one about matters of interest in society and the other about literature. Evaluation, culture and linguistics can be seen as more specialised categories.
 

Table 5. Essay types in the USE corpus and how they are interrelated

First term (a) Second term (b) Third term (c)
Evaluation a1 b8
Argumentation a2 b2
Discussion a3 b1, b6, b7
Literature a4 b4, b5 c1
Culture a5

Student background information

All 440 students in the USE project filled in a questionnaire, answering questions about themselves, concerning their first language, parents' first language, grades in English, previous studies, exposure to English etc. This information is coded in a Microsoft Excel database, see below. Incomplete data in the database are due to some students overlooking the second page of the questionnaire, choosing not to answer all the questions, or (in the last two terms) a shortened, simplified form of the questionnaire.

USE resources

USE consists of three separate parts, shown in Figure 1.
 

USE corpus USE database USE manual
1,489 essays in plain text format, organised in 14 essay type categories (each contained in one folder marked a1, a2, etc.). Untagged text. For number of files, see Tables 1-4. Information about the 440 students coded in a Microsoft Excel file. Detailed information about the corpus and the database in a Microsoft Word file.

Figure 1. USE resources
 

Researching the corpus

Interface

As yet, there is no special interface or search engine created for the USE corpus. Most studies conducted on the material have been carried out with the software program WordSmith Tools, version 3.0 or earlier.

Sampling

Depending on the research question, the corpus can be used in different ways. If the investigated feature is expected to be frequent, a smaller sample of texts may be sufficient. If the investigation deals with a feature sensitive to register variation, it is important to choose essay type(s) suitable for the purpose.

Comparisons of Swedish students' English can be made with standard corpora of authentic written English or with other corpora of learner English (see Pravec 2002). Internal comparisons of samples of essays in the corpus may also be of interest, for instance, to see to what extent a syntactic construction or lexical unit is mastered on different levels of study (see Axelsson and Berglund 2002).

Database

The USE database provides an overview of the resources available. By using the sorting or filter functions in Excel, one can easily identify the essays relevant for a specific research question and then sample the selected essays from the corpus.

The information in the database makes it possible to study the progression of individual students who have submitted several essays over time. This can mean several essays in one term, two terms or three terms. The number of essays in each student’s production varies from one to eleven. About sixty students have handed in varying numbers of essays on both the first- and second-term levels, five of these even on the third-term level.

The following variables are coded in the database:

Column A. Student identity code. The code was given when the student first entered the project and then followed him/her during all the terms of participation. Table 1 shows the codes allotted on the different first-terms. Students enrolling for the first time during the second-term programme have their own set of identity codes (see Table 2 and the database). The majority of students who contributed both first- and second-term essays did so on consecutive terms. As mentioned above, four students submitted second-term essays (b) after a break of one term or more. These essays are marked "i" for "interrupted" and their term of production is given in column AJ.

Columns B-O. Essays submitted. Each essay type has its own column.

Column Q. Sex. Female (f) or male (m).

Column R. Age.

Column S. Year of birth.

Column T. Course=programme. A1, B1, C1 = general programme. A2, B2, A4, B4 and A6 = programmes for teacher trainees (A2, B2: upper secondary level; A4, B4: school years 4-9, and A6: school years 1-7).

Column U. Mother tongue, defined as "language spoken at home". Abbreviations are self-explanatory, for example, sw=Swedish, fi=Finnish, nor=Norwegian, spa=Spanish, ger=German. If less than obvious, "ot" = "other" is given and specified in column AJ.

Column V. Mother's first language.

Column W. Father's first language.

Column X. Answers the question "How many years have you studied English at school?"

Column Y. Answers the question "When did you first go to university?"

Column Z. Answers the question "What was your grade in English in Swedish upper secondary/high school?" Several changes in the Swedish grading system explain the variation in the data:

A few, older students have grades from a system ranging across A, a, AB, Ba, B, B? (with A as the best, very unusual grade, and B? a bare pass).

Many have grades 5, 4 and 3 (5 being the best, and 3 or better the requirement for English at university level).

The most recent system comprises several programmes in English at upper-secondary level: a minimum, a standard and a supplementary course, each graded MVG, VG and G - excellent, pass with distinction, pass. G on the standard course is required for university studies. Single grades refer to the standard course, double grades to standard course/supplementary course.

Column AA. Grade in Swedish. See above.

Column AB. University credits (points) in language studies at a Swedish university. The figure entered = weeks of study. Linguistics has been counted in this category.

Column AC. University credits (points) in language studies at a university abroad. The figure entered = weeks of study.

Column AD. University credits (points) in other subjects than languages at a Swedish university. The figure entered = weeks of study.

Column AE. University credits (points) in other subjects than languages at a university abroad. The figure entered = weeks of study.

Column AF. Worked in an English-speaking country. The figure entered = months.

Column AG. Studied in an English-speaking country. The figure entered = months. A college year in the United States, and earlier periods of school or summer schools in some English-speaking country are coded here.

Column AH. Total time spent in an English-speaking environment, broadly defined as "where English is used every day, abroad or in Sweden". The figure entered = months.

Column AI. Answers the question "Is there anything in particular that has affected your command of English?" A common answer refers to visits to English-speaking countries and has been coded as "Stay in Eng", regardless of country.

Column AJ. Clarification as to the first language, term of resumed studies on the second-term level, foreign grades, and other pertinent information offered.

An empty cell means that the student did not answer the question. For the last two terms of the USE project, the questionnaire was simplified, which explains the empty columns AF, AG and AI.

Acknowledgement of the origin of USE

Anyone using USE for research is obligated to acknowledge the source of data as follows: USE = Uppsala Student English corpus, compiled by Margareta Westergren Axelsson and Ylva Berglund, the Department of English, Uppsala University, 1999-2001.

Availability of USE

The corpus can be used for research and educational purposes. It can be accessed on the Internet from the Oxford Text Archive at http://www.ota.ahds.ac.uk/. For Uppsala students and researchers, it is also available on a CD at the Department of English (Professor Merja Kytö and Senior lecturers supervising language project work).

About USE and other learner corpora

The two corpus compilers and students at the Department of English have used material from the corpus for investigations. The titles of studies finished so far are given in the list below.

• Axelsson, Margareta Westergren (1999) 'Project USE (Uppsala Student English),' ASLA Information 25:2, 25-6.
• Axelsson, Margareta Westergren. (2000) 'USE - The Uppsala Student English Corpus: An instrument for needs analysis,' ICAME Journal 24:155-7. Available online at http://nora.hd.uib.no/icame/ij24/ .
• Axelsson, Margareta Westergren (2000) 'The use of a corpus of students' written production in university English teaching, ' in Gunilla Byrman, Hans Lindquist and Magnus Levin (eds.) Corpora in research and teaching: Papers from the ASLA symposium on corpora in research and teaching, Växjö, 11-12 November 1999. ASLA:s skriftserie 13. 293-303.
• Axelsson, Margareta Westergren and Angela Hahn (2001), 'The use of the progressive in Swedish and German advanced learner English - a corpus-based study,' ICAME Journal 25:5-30. Available online at http://nora.hd.uib.no/icame/ij25/.
• Axelsson, Margareta Westergren and Ylva Berglund (2002), 'The Uppsala Student English Corpus (USE): A multi-faceted resource for research and course development,' in Lars Borin (ed.) Parallel corpora, parallel worlds. Selected papers from a symposium on parallel and comparable corpora at Uppsala University, Sweden, 22-23 April, 1999. Amsterdam: Rodopi. 79-90.
• Berglund, Ylva and Oliver Mason (2002), 'The influence of external factors on learner performance,' in Bernhard Kettemann and George Marko (eds.) Teaching and learning by doing corpus analysis. Amsterdam: Rodopi. 205-215.
• Borin, Lars and Klas Prytz, '"New wine in old skins?" A corpus investigation of L1 syntactic transfer in learner language.' Poster at TALC 2002, Fifth International Conference on Teaching and Language Corpora, 27-31 July, 2002. Bertinoro, Italy.
• Granger, Sylviane (ed.) (1998) Learner English on computer. London: Longman.
• Mason, Oliver and Ylva Berglund (2002), 'Low-level parameters reflecting the naturalness of text,' in Conference publication of JADT 2002, 6th International Conference on the Statistical Analysis of Textual Data. Saint-Malo, France.
• Mason, Oliver and Ylva Berglund, '"But this formula doesn't mean anything!" - some reflections on parameters of texts and their significance.' Forthcoming in Festschrift for Geoffrey Leech (Peter Lang).
• Pravec, Norma A. (2002) 'A survey of learner corpora,' ICAME Journal 26:81-114. Available online at http://nora.hd.uib.no/icame/ij26/

Students' third-term (C) and fourth-term (D) papers (unpublished, filed by the Department of English)

• Blomberg, Karin (2000) Swedish learners' use of the progressive aspect in English.
• Eiman, Carin (2000) Adjectives and attitudes: A linguistic study of how male and female students use adjectives when describing their knowledge of and proficiency in the English language.
• Hellén, Christina (2001) Swedish students' use of hypothetical conditional sentences in English. (D-course)
• Linerstad, Andrea (2002) The development of students' skills in handling S-V concord.
• Svensson, Jenny (2001) Noun compound, noun-compound or nouncompound? Three different constructions of a noun+noun compound. (D-course).


Appendix 1

Uppsala Student English Project (USE)

Projektet syftar till att skapa en korpus (datorläsbar textsamling) bestående av material producerat av studenter. Korpusen kommer att användas för forskning, undervisning och läromedelsframställning. Att deltaga i projektet är frivilligt och de deltagande kommer att vara anonyma (namnen avlägsnas från korpusen). Frågor besvaras av projektledarna, Margareta Westergren Axelsson och Ylva Berglund.
The project aims at creating a corpus (computer-readable collection of texts) of material submitted by students. The corpus will be used for language research, teaching, and production of teaching material. All contribution to the project is voluntary and anonymity will be maintained by the removal of participants' names in the corpus. Questions will be answered by the project coordinators, Margareta Westergren Axelsson and Ylva Berglund.

Medgivande

Consent form

Jag ger 'The Uppsala Student English Project' rätten att använda det jag lämnar till projektet för forskning, undervisning, läromedelsframställning, publicering och presentation (tex. på konferenser och seminarier). Jag godkänner att mitt material eventuellt publiceras (helt eller delvis) i någon form, tex. elektroniskt eller i tryck.

I hereby give to 'The Uppsala Student English Project' the right to use the material I submit to the project for research, teaching, publication and presentation (conferences, workshops, etc.). I consent to the possible publication of my material (as a whole or in part) in various forms, including paper and electronic media.

Namnteckning / signature ....................................................................................................
 

Namnförtydligande /printed name ........................................................................................
 

Datum / date ...........................
 


Appendix 2

Background data for corpus of student English

Name
(in the final corpus, all contributors will be anonymous, with only a code for identification):

.................................................................................................................
 

Sex: [ ] female [ ] male

Year of birth: 19...........

Mother tongue (what language do you speak at home? ):

[ ] Swedish [ ] English [ ] Other, namely ...........................................................
 

Mother tongue of parents
a) mother
[ ] Swedish [ ] English [ ] Other, namely ...........................................................
 

b) father
[ ] Swedish [ ] English [ ] Other, namely ...........................................................

How many years have you studied English at school? ...................

What year did you first go to university? 19............

Have you taken any previous language courses at university level ?
[ ] no [ ] yes (please specify language and points, for example French 20 p, Russian 5 p, etc.):

.................................................................................................................

.................................................................................................................

Have you taken any other courses at university?
[ ] no [ ] yes (please specify course and points, for example Economics 20 p, Law 5 p, etc.):

.................................................................................................................

.................................................................................................................

.................................................................................................................
 

Have you studied/worked abroad? (See also the following question.)
[ ] no [ ] yes (please specify country, type of activity, length):

.................................................................................................................

.................................................................................................................

.................................................................................................................

How much time have you spent in an English-speaking environment (where English was used every day), abroad or in Sweden? Please specify where, for how long and to what extent if possible (for example prolonged stay in an English-speaking country, long holidays and travels abroad, work in an international environment):

.................................................................................................................

.................................................................................................................

.................................................................................................................

What was your grade in English in Swedish upper secondary/high school?

.........................

What was your grade in Swedish (language) in upper secondary/high school?

.....................

Is there anything in particular you feel has affected your command of English?
[ ] no [ ] yes (please specify):