Vocabulary Resources for Material Writers

From The Materials Writers Newsletter
The Newsletter of the Materials Writers' National Special Interest Group
of the Japan Association of Language Teachers
Vol. IV, No. 3, October 1996

John Bauman
Enterprise Training Group

Material written for ESL students needs to use somewhat simplified vocabulary and structure if it is to be accessible to lower and intermediate level students. In terms of vocabulary, a writer can try to "keep it simple" while writing, but a more rigorous approach is to compare a text with a list of words prepared for this purpose. A variety of lists of words are available, as well as different ways to use them. In this article, I will briefly list and describe some lists. I'll also discuss a program that will analyze a text and give some links for further exploration of this topic on the internet. Links to sites mentioned are given in the "Web Links" section at the end of this article.

Teaching and Learning Vocabulary (Nation 1990) contains a good general discussion of this topic. Nation doesn't hesitate to quantify the issue. His model of an ideal vocabulary teaching sequence starts with the most frequent 2,000 words, which he calls general service vocabulary. Everybody needs to know these words; they make up about 87% of an average written text. After this point, general frequency becomes less useful as a guide to what words to teach. Students are better off studying a list of words specific to their field of interest or need, if one can be found. For the student aiming at English-language higher education, Nation's 800 word University Word List is appropriate. After this, the remaining vocabulary of English is of too little frequency to merit direct study. Skills such as analyzing word parts, context guessing, etc. can be taught.

The number of different words used will depend on the level of the text. Writers of material for ESL learners also have to decide which words to use, or, in a larger sense, to which population of words should they restrict themselves. Here a list becomes necessary. Many have been developed over the years. The following remain relevant.

The General Service List

The General Service List (GSL)(West 1953) is the specific list of 2,000 words that Nation refers to when he writes about the "first 2,000 words." It's based on written texts, it's old, and it's not in frequency order, though frequency numbers are given. The source of the frequency information is even earlier than the publication date, being derived from Thorndike and Lorge (1944). But the list was not compiled based on frequency alone. It was created to be an ideal vocabulary for ESL students to start out with. Through the 1970s, a lot of material, particularly graded readers, was based on this list. Even today, much of this material is sold and used. The GSL is out of print, and somewhat out of favor. The list is available as a component of the Vocabprofile program described below and, in a slightly different form, on this web page.

Thorndike and Lorge

The Teacher's Word Book of 30,000 Words (Thorndike and Lorge, 1944) was created as a resource for elementary and high school teachers in the United States. It is still frequently cited, though computer-produced corpora have largely replaced it as an authority on the frequency of words. For example, it's the source of the words above the 2,000 word level in the vocabulary test in Nation (1990). It's old, it's based on a compilation of pre-WW2, non-computerized word counts totaling about 18 million written words. As published, it's not in frequency order, but frequency ranks are given for each word.

The University Word List

The University Word List (UWL)(in Nation, 1990) is a list of academic vocabulary composed of about 800 words. It's designed for students who plan to study in an English-language college or university. Essentially, it's the most common 800 words in academic texts, excluding the 2,000 words of the GSL. This list is structurally linked to the GSL. A student who studies the GSL, followed by the UWL, will find no repetition of words. The list is divided into 11 parts. Part one has the greatest frequency and range, part 2 next, etc. This list is also a component of the Vocabprofile program.

The Brown Corpus

The Brown Corpus (Francis and Kucera, 1982) is the earliest computerized study of English vocabulary. It is an analysis of 1 million words published in the United States in 1961. It's also kind of old, but it's more consistent in it's definition of "word" (as a lemma) than the earlier lists. The 1982 publication, which includes both alphabetical and frequency order lists of the words, is a very useful resource.

The LOB Corpus

The LOB Corpus (Hofland and Johansson, 1982) is a study of 1 million words of British text published in 1961. It was designed to be a British counterpart to the Brown corpus.

The Cambridge English Lexicon

The Cambridge English Lexicon (CEL) (Hindmarsh, 1980) is a list of 4470 words, prepared with reference to the GSL, Thorndike and Lorge, Brown, other sources, and the author's experience as an ESL teacher and material developer. Each item is graded from 1 to 5. The most useful aspect of the list is that the different meanings of the words are also graded on the same scale. Only the CEL and the GSL give separate information on the different meanings of common words (though, of course, dictionaries do also). The GSL gives actual frequency numbers for the different meanings, but the age of the data and the fact that it was gathered by hand may make the CEL a more reliable source for an indication of the relative importance to students of different meanings of words. The grading in the CEL is not based solely on frequency.

Modern Corpora

These days, much is heard about corpora from dictionary publishers, who all boast about the enormous corpora that their learner dictionaries are based on. The British publishers are particularly enthusiastic about this, using either the CoBuild corpus or the British National Corpus (BNC) as a source of lexicographic information. Both of these corpora contain more than 100 million words. Limited access to them is possible through the internet, see the links on the Collocations Homepage listed below. Depending on your purpose, it may be more useful to access these corpora in pre-digested form through the dictionaries based on them. A lemmatized frequency list of the BNC has been prepared by Adam Kilgarriff and is available for FTP.

Vocabprofile

Vocabprofile is a freeware program for PCs that will compare a given text with any properly formatted list. Three lists can be done at a time. The output will report what percent of the words in the text are on each of the lists. It will also print the text with the words marked to indicate which list they are on, or if they aren't on a list. Vocabprofile is available for FTP at the URL below. The three lists that come with the program are the first 1,000 words of the GSL, the second 1,000 words of the GSL and the UWL.

Concluding Remarks

None of these resources is ideal. Thorndike and Lorge and the GSL are old, old enough that the English of today surely differs significantly. However, the core vocabulary of English changes more slowly, so at the frequency level of the first 2,000 words this may be less of a problem. The GSL offers some advantages as a standard. It was specifically designed as a teaching vocabulary list. It has a long history of use, both in teaching materials and in second language acquisition research. A program to compare it with a given text is readily available. Of the lists above, only the CEL was also compiled for the purpose of facilitating the creation of teaching materials. It's more modern than the GSL, but appears to have had less impact. It is not conveniently available for computerized text comparison.

The Brown Corpus, the LOB Corpus and the lemmatized list from the BNC are useful because they give the lists in frequency order. This allows a population of words to be defined much more precisely, and individual words to be compared with each other. But these lists were prepared for linguistic research, not teachers. They're lists of lemmas, which means that words are listed more than once if they can act as more than one part of speech. Some derived forms are also considered as separate lemmas, such as comparative and superlative forms of adjectives. These factors affect both the frequency rankings of words and the number of words that appear on a list. In other words, a list of 1,000 words taken from the GSL or CEL would contain more than 1,000 lemmas. These corpus-based lists need substantial adjustment to make them appropriate as vocabulary standards. These adjustments have already been made to the GSL and CEL.

An author of EFL material has many vocabulary options available. I hope this discussion of resources is useful and that the bibliography and the internet sites below will be helpful in finding the items that will serve your specific needs.

Links to sites mentioned

Adam Kilgarriff
http://www.itri.brighton.ac.uk/~Adam.Kilgarriff/
Links to his lemmatized, frequency order version of the BNC are here.

John Higgins
http://www.marlodge.supanet.com/index.html
Here you can find Vocabprofile as well as links to other programs.

Bibliography

Francis, W.N. and Kucera, H. (1982).Frequency Analysis of English Usage. Houghton Mifflin, Boston

Hindmarsh, R. (1980). Cambridge English Lexicon. Cambridge University Press, Cambridge

Hofland, K. and Johansson, S. (1982). Word Frequencies in British and American English. NAVF, Bergen

Nation, I.S.P. (1990). Teaching and Learning Vocabulary. Newbury House, New York

Thorndike, E.L. and Lorge, I. (1944). The teacher's Word Book of 30,000 Words. Teachers College, Columbia University, New York

West, M. (1953). A General Service List of English Words. Longman, London


Back to the Top
John Bauman's Homepage