readability {koRpus} | R Documentation |
Measure readability
Description
These methods calculate several readability indices.
Usage
readability(txt.file, ...)
## S4 method for signature 'kRp.text'
readability(
txt.file,
hyphen = NULL,
index = c("ARI", "Bormuth", "Coleman", "Coleman.Liau", "Dale.Chall",
"Danielson.Bryan", "Dickes.Steiwer", "DRP", "ELF", "Farr.Jenkins.Paterson", "Flesch",
"Flesch.Kincaid", "FOG", "FORCAST", "Fucks", "Gutierrez", "Harris.Jacobson",
"Linsear.Write", "LIX", "nWS", "RIX", "SMOG", "Spache", "Strain", "Traenkle.Bailer",
"TRI", "Tuldava", "Wheeler.Smith"),
parameters = list(),
word.lists = list(Bormuth = NULL, Dale.Chall = NULL, Harris.Jacobson = NULL, Spache =
NULL),
fileEncoding = "UTF-8",
sentc.tag = "sentc",
nonword.class = "nonpunct",
nonword.tag = c(),
quiet = FALSE,
keep.input = NULL,
as.feature = FALSE
)
## S4 method for signature 'missing'
readability(txt.file, index)
## S4 method for signature 'kRp.readability,ANY,ANY,ANY'
x[i]
## S4 method for signature 'kRp.readability'
x[[i]]
Arguments
txt.file |
An object of class |
... |
Additional arguments for the generics. |
hyphen |
An object of class |
index |
A character vector,
indicating which indices should actually be computed. If set to |
parameters |
A list with named magic numbers, defining the relevant parameters for each index. If none are given, the default values are used. |
word.lists |
A named list providing the word lists for indices which need one. If |
fileEncoding |
A character string defining the character encoding of the |
sentc.tag |
A character vector with POS tags which indicate a sentence ending. The default value |
nonword.class |
A character vector with word classes which should be ignored for readability analysis. The default value
|
nonword.tag |
A character vector with POS tags which should be ignored for readability analysis. Will only be
of consequence if |
quiet |
Logical. If |
keep.input |
Logical. If |
as.feature |
Logical,
whether the output should be just the analysis results or the input object with
the results added as a feature. Use |
x |
An object of class |
i |
Defines the row selector ( |
Details
In the following formulae, W
stands for the number of words,
St
for the number of sentences, C
for the number of
characters (usually meaning letters), Sy
for the number of syllables,
W_{3Sy}
for the number of words with at least three syllables,
W_{<3Sy}
for the number of words with less than three syllables, W^{1Sy}
for words with exactly one syllable,
W_{6C}
for the number of words with at least six letters, and W_{-WL}
for the number
of words which are not on a certain word list (explained where needed).
"ARI"
:Automated Readability Index:
ARI = 0.5 \times \frac{W}{St} + 4.71 \times \frac{C}{W} - 21.43
If
parameters
is set toARI="NRI"
, the revised parameters from the Navy Readability Indexes are used:ARI_{NRI} = 0.4 \times \frac{W}{St} + 6 \times \frac{C}{W} - 27.4
If
parameters
is set toARI="simple"
, the simplified formula is calculated:ARI_{simple} = \frac{W}{St} + 9 \times \frac{C}{W}
Wrapper function:
ARI
"Bormuth"
:Bormuth Mean Cloze & Grade Placement:
B_{MC} = 0.886593 - \left( 0.08364 \times \frac{C}{W} \right) + 0.161911 \times \left(\frac{W_{-WL}}{W} \right)^3
- 0.21401 \times \left(\frac{W}{St} \right) + 0.000577 \times \left(\frac{W}{St} \right)^2
- 0.000005 \times \left(\frac{W}{St} \right)^3
Note: This index needs the long Dale-Chall list of 3000 familiar (english) words to compute
W_{-WL}
. That is, you must have a copy of this word list and provide it via theword.lists=list(Bormuth=<your.list>)
parameter!B_{GP} = 4.275 + 12.881 \times B_{MC} - (34.934 \times B_{MC}^2) + (20.388 \times B_{MC}^3)
+ (26.194C - 2.046 C_{CS}^2) - (11.767 C_{CS}^3) - (44.285 \times B_{MC} \times C_{CS})
+ (97.620 \times (B_{MC} \times C_{CS})^2) - (59.538 \times (B_{MC} \times C_{CS})^3)
Where
C_{CS}
represents the cloze criterion score (35% by default).Wrapper function:
bormuth
"Coleman"
:Coleman's Readability Formulas:
C_1 = 1.29 \times \left( \frac{100 \times W^{1Sy}}{W} \right) - 38.45
C_2 = 1.16 \times \left( \frac{100 \times W^{1Sy}}{W} \right) + 1.48 \times \left( \frac{100 \times St}{W} \right) - 37.95
C_3 = 1.07 \times \left( \frac{100 \times W^{1Sy}}{W} \right) + 1.18 \times \left( \frac{100 \times St}{W} \right) + 0.76 \times \left( \frac{100 \times W_{pron}}{W} \right) - 34.02
C_4 = 1.04 \times \left( \frac{100 \times W^{1Sy}}{W} \right) + 1.06 \times \left( \frac{100 \times St}{W} \right) \\ + 0.56 \times \left( \frac{100 \times W_{pron}}{W} \right) - 0.36 \times \left( \frac{100 \times W_{prep}}{W} \right) - 26.01
Where
W_{pron}
is the number of pronouns, andW_{prep}
the number of prepositions.Wrapper function:
coleman
"Coleman.Liau"
:First estimates cloze percentage, then calculates grade equivalent:
CL_{ECP} = 141.8401 - 0.214590 \times \frac{100 \times C}{W} + 1.079812 \times \frac{100 \times St}{W}
CL_{grade} = -27.4004 \times \frac{CL_{ECP}}{100} + 23.06395
The short form is also calculated:
CL_{short} = 5.88 \times \frac{C}{W} - 29.6 \times \frac{St}{W} - 15.8
Wrapper function:
coleman.liau
"Dale.Chall"
:New Dale-Chall Readability Formula. By default the revised formula (1995) is calculated:
DC_{new} = 64 - 0.95 \times{} \frac{100 \times{} W_{-WL}}{W} - 0.69 \times{} \frac{W}{St}
This will result in a cloze score which is then looked up in a grading table. If
parameters
is set toDale.Chall="old"
, the original formula (1948) is used:DC_{old} = 0.1579 \times{} \frac{100 \times{} W_{-WL}}{W} + 0.0496 \times{} \frac{W}{St} + 3.6365
If
parameters
is set toDale.Chall="PSK"
, the revised parameters by Powers-Sumner-Kearl (1958) are used:DC_{PSK} = 0.1155 \times{} \frac{100 \times{} W_{-WL}}{W} + 0.0596 \times{} \frac{W}{St} + 3.2672
Note: This index needs the long Dale-Chall list of 3000 familiar (english) words to compute
W_{-WL}
. That is, you must have a copy of this word list and provide it via theword.lists=list(Dale.Chall=<your.list>)
parameter!Wrapper function:
dale.chall
"Danielson.Bryan"
:-
DB_1 = \left( 1.0364 \times \frac{C}{Bl} \right) + \left( 0.0194 \times \frac{C}{St} \right) - 0.6059
DB_2 = 131.059 - \left( 10.364 \times \frac{C}{Bl} \right) - \left( 0.194 \times \frac{C}{St} \right)
Where
Bl
means blanks between words, which is not really counted in this implementation, but estimated bywords - 1
.C
is interpreted as literally all characters.Wrapper function:
danielson.bryan
"Dickes.Steiwer"
:Dickes-Steiwer Handformel:
DS = 235.95993 - \left( 73.021 \times \frac{C}{W} \right) - \left(12.56438 \times \frac{W}{St} \right) - \left(50.03293 \times TTR \right)
Where
TTR
refers to the type-token ratio, which will be calculated case-insensitive by default.Wrapper function:
dickes.steiwer
"DRP"
:Degrees of Reading Power. Uses the Bormuth Mean Cloze Score:
DRP = (1 - B_{MC}) \times 100
This formula itself has no parameters. Note: The Bormuth index needs the long Dale-Chall list of 3000 familiar (english) words to compute
W_{-WL}
. That is, you must have a copy of this word list and provide it via theword.lists=list(Bormuth=<your.list>)
parameter! Wrapper function:DRP
"ELF"
:Fang's Easy Listening Formula:
ELF = \frac{W_{2Sy}}{St}
Wrapper function:
ELF
"Farr.Jenkins.Paterson"
:A simplified version of Flesch Reading Ease:
FJP = -31.517 - 1.015 \times \frac{W}{St} + 1.599 \times \frac{W^{1Sy}}{W}
If
parameters
is set toFarr.Jenkins.Paterson="PSK"
, the revised parameters by Powers-Sumner-Kearl (1958) are used:FJP_{PSK} = 8.4335 + 0.0923 \times \frac{W}{St} - 0.0648 \times \frac{W^{1Sy}}{W}
Wrapper function:
farr.jenkins.paterson
"Flesch"
:Flesch Reading Ease:
F_{EN} = 206.835 - 1.015 \times \frac{W}{St} - 84.6 \times \frac{Sy}{W}
Certain internationalisations of the parameters are also implemented. They can be used by setting the
Flesch
parameter to one of the following language abbreviations."de"
(Amstad's Verständlichkeitsindex):F_{DE} = 180 - \frac{W}{St} - 58.5 \times \frac{Sy}{W}
"es"
(Fernandez-Huerta):F_{ES} = 206.835 - 1.02 \times \frac{W}{St} - 60 \times \frac{Sy}{W}
"es-s"
(Szigriszt):F_{ES S} = 206.835 - \frac{W}{St} - 62.3 \times \frac{Sy}{W}
"nl"
(Douma):F_{NL} = 206.835 - 0.93 \times \frac{W}{St} - 77 \times \frac{Sy}{W}
"nl-b"
(Brouwer Leesindex):F_{NL B} = 195 - 2 \times \frac{W}{St} - 67 \times \frac{Sy}{W}
"fr"
(Kandel-Moles):F_{FR} = 209 - 1.15 \times \frac{W}{St} - 68 \times \frac{Sy}{W}
If
parameters
is set toFlesch="PSK"
, the revised parameters by Powers-Sumner-Kearl (1958) are used to calculate a grade level:F_{PSK} = 0.0778 \times \frac{W}{St} + 4.55 \times \frac{Sy}{W} - 2.2029
Wrapper function:
flesch
"Flesch.Kincaid"
:Flesch-Kincaid Grade Level:
FK = 0.39 \times \frac{W}{St} + 11.8 \times \frac{Sy}{W} - 15.59
Wrapper function:
flesch.kincaid
"FOG"
:Gunning Frequency of Gobbledygook:
FOG = 0.4 \times \left( \frac{W}{St} + \frac{100 \times W_{3Sy}}{W} \right)
If
parameters
is set toFOG="PSK"
, the revised parameters by Powers-Sumner-Kearl (1958) are used:FOG_{PSK} = 3.0680 + \left( 0.0877 \times \frac{W}{St} \right) + \left(0.0984 \times \frac{100 \times W_{3Sy}}{W} \right)
If
parameters
is set toFOG="NRI"
, the new FOG count from the Navy Readability Indexes is used:FOG_{new} = \frac{\frac{W_{<3Sy} + (3 * W_{3Sy})}{\frac{100 \times St}{W}} - 3}{2}
If the text was POS-tagged accordingly, proper nouns and combinations of only easy words will not be counted as hard words, and the syllables of verbs ending in "-ed", "-es" or "-ing" will be counted without these suffixes.
Due to the need to re-hyphenate combined words after splitting them up, this formula takes considerably longer to compute than most others. If will be omitted if you set
index="fast"
instead of the default.Wrapper function:
FOG
"FORCAST"
:-
FORCAST = 20 - \frac{W^{1Sy} \times \frac{150}{W}}{10}
If
parameters
is set toFORCAST="RGL"
, the parameters for the precise reading grade level are used (see Klare, 1975, pp. 84–85):FORCAST_{RGL} = 20.43 - 0.11 \times W^{1Sy} \times \frac{150}{W}
Wrapper function:
FORCAST
"Fucks"
:Fucks' Stilcharakteristik (Fucks, 1955, as cited in Briest, 1974):
Fucks = \frac{Sy}{W} \times \frac{W}{St}
This simple formula has no parameters.
Wrapper function:
fucks
"Gutierrez"
:Gutiérrez de Polini's Fórmula de comprensibilidad (Gutiérrez, 1972, as cited in Fernández, 2016) for Spanish:
Gutierrez = 95.2 - \frac{9.7 \times C}{W} - \frac{0.35 \times W}{St}
Wrapper function:
gutierrez
"Harris.Jacobson"
:Revised Harris-Jacobson Readability Formulas (Harris & Jacobson, 1974): For primary-grade material:
HJ_1 = 0.094 \times \frac{100 \times{} W_{-WL}}{W} + 0.168 \times \frac{W}{St} + 0.502
For material above third grade:
HJ_2 = 0.140 \times \frac{100 \times{} W_{-WL}}{W} + 0.153 \times \frac{W}{St} + 0.560
For material below forth grade:
HJ_3 = 0.158 \times \frac{W}{St} + 0.055 \times \frac{100 \times{} W_{6C}}{W} + 0.355
For material below forth grade:
HJ_4 = 0.070 \times \frac{100 \times{} W_{-WL}}{W} + 0.125 \times \frac{W}{St} + 0.037 \times \frac{100 \times{} W_{6C}}{W} + 0.497
For material above third grade:
HJ_5 = 0.118 \times \frac{100 \times{} W_{-WL}}{W} + 0.134 \times \frac{W}{St} + 0.032 \times \frac{100 \times{} W_{6C}}{W} + 0.424
Note: This index needs the short Harris-Jacobson word list for grades 1 and 2 (english) to compute
W_{-WL}
. That is, you must have a copy of this word list and provide it via theword.lists=list(Harris.Jacobson=<your.list>)
parameter!Wrapper function:
harris.jacobson
"Linsear.Write"
(O'Hayre, undated, see Klare, 1975, p. 85):-
LW_{raw} = \frac{100 - \frac{100 \times W_{<3Sy}}{W} + \left( 3 \times \frac{100 \times W_{3Sy}}{W} \right)}{\frac{100 \times St}{W}}
LW(LW_{raw} \leq 20) = \frac{LW_{raw} - 2}{2}
LW(LW_{raw} > 20) = \frac{LW_{raw}}{2}
Wrapper function:
linsear.write
"LIX"
Björnsson's Läsbarhetsindex. Originally proposed for Swedish texts, calculated by:
LIX = \frac{W}{St} + \frac{100 \times{} W_{7C}}{W}
Texts with a LIX < 25 are considered very easy, around 40 normal, and > 55 very difficult to read.
Wrapper function:
LIX
"nWS"
:Neue Wiener Sachtextformeln (Bamberger & Vanecek, 1984):
nWS_1 = 19.35 \times \frac{W_{3Sy}}{W} + 0.1672 \times \frac{W}{St} + 12.97 \times \frac{W_{6C}}{W} - 3.27 \times \frac{W^{1Sy}}{W} - 0.875
nWS_2 = 20.07 \times \frac{W_{3Sy}}{W} + 0.1682 \times \frac{W}{St} + 13.73 \times \frac{W_{6C}}{W} - 2.779
nWS_3 = 29.63 \times \frac{W_{3Sy}}{W} + 0.1905 \times \frac{W}{St} - 1.1144
nWS_4 = 27.44 \times \frac{W_{3Sy}}{W} + 0.2656 \times \frac{W}{St} - 1.693
Wrapper function:
nWS
"RIX"
Anderson's Readability Index. A simplified version of LIX:
RIX = \frac{W_{7C}}{St}
Texts with a RIX < 1.8 are considered very easy, around 3.7 normal, and > 7.2 very difficult to read.
Wrapper function:
RIX
"SMOG"
:Simple Measure of Gobbledygook. By default calculates formula D by McLaughlin (1969):
SMOG = 1.043 \times \sqrt{W_{3Sy} \times \frac{30}{St}} + 3.1291
If
parameters
is set toSMOG="C"
, formula C will be calculated:SMOG_{C} = 0.9986 \times \sqrt{W_{3Sy} \times \frac{30}{St} + 5} + 2.8795
If
parameters
is set toSMOG="simple"
, the simplified formula is used:SMOG_{simple} = \sqrt{W_{3Sy} \times \frac{30}{St}} + 3
If
parameters
is set toSMOG="de"
, the formula adapted to German texts ("Qu", Bamberger & Vanecek, 1984, p. 78) is used:SMOG_{de} = \sqrt{W_{3Sy} \times \frac{30}{St}} - 2
Wrapper function:
SMOG
"Spache"
:Spache Revised Formula (1974):
Spache = 0.121 \times \frac{W}{St} + 0.082 \times{} \frac{100 \times{} W_{-WL}}{W} + 0.659
If
parameters
is set toSpache="old"
, the original parameters (Spache, 1953) are used:Spache_{old} = 0.141 \times \frac{W}{St} + 0.086 \times{} \frac{100 \times{} W_{-WL}}{W} + 0.839
Note: The revised index needs the revised Spache word list (see Klare, 1975, p. 73), and the old index the short Dale-Chall list of 769 familiar (english) words to compute
W_{-WL}
. That is, you must have a copy of this word list and provide it via theword.lists=list(Spache=<your.list>)
parameter!Wrapper function:
spache
"Strain"
:Strain Index. This index was proposed in [1]:
S = Sy \times{} \frac{1}{St / 3} \times{} \frac{1}{10}
Wrapper function:
strain
"Traenkle.Bailer"
:Tränkle-Bailer Formeln. These two formulas were the result of a re-examination of the ones proposed by Dickes-Steiwer. They try to avoid the usage of the type-token ratio, which is dependent on text length (Tränkle & Bailer, 1984):
TB1 = 224.6814 - \left(79.8304 \times \frac{C}{W} \right) - \left(12.24032 \times \frac{W}{St} \right) - \left(1.292857 \times \frac{100 \times{} W_{prep}}{W} \right)
TB2 = 234.1063 - \left(96.11069 \times \frac{C}{W} \right) - \left(2.05444 \times \frac{100 \times{} W_{prep}}{W} \right) - \left(1.02805 \times \frac{100 \times{} W_{conj}}{W} \right)
Where
W_{prep}
refers to the number of prepositions, andW_{conj}
to the number of conjunctions.Wrapper function:
traenkle.bailer
"TRI"
:Kuntzsch's Text-Redundanz-Index. Intended mainly for German newspaper comments.
TRI = \left(0.449 \times W^{1Sy}\right) - \left(2.467 \times Ptn\right) - \left(0.937 \times Frg\right) - 14.417
Where
Ptn
is the number of punctuation marks andFrg
the number of foreign words.Wrapper function:
TRI
"Tuldava"
:Tuldava's Text Difficulty Formula. Supposed to be rather independent of specific languages (Grzybek, 2010).
TD = \frac{Sy}{W} \times ln\left( \frac{W}{St} \right)
Wrapper function:
tuldava
"Wheeler.Smith"
:Intended for english texts in primary grades 1–4 (Wheeler & Smith, 1954):
WS = \frac{W}{St} \times \frac{10 \times{} W_{2Sy}}{W}
If
parameters
is set toWheeler.Smith="de"
, the calculation stays the same, but grade placement is done according to Bamberger & Vanecek (1984), that is for german texts.Wrapper function:
wheeler.smith
By default, if the text has to be tagged yet,
the language definition is queried by calling get.kRp.env(lang=TRUE)
internally.
Or, if txt
has already been tagged,
by default the language definition of that tagged object is read
and used. Set force.lang=get.kRp.env(lang=TRUE)
or to any other valid value,
if you want to forcibly overwrite this
default behaviour,
and only then. See kRp.POS.tags
for all supported languages.
Value
Depending on as.feature
,
either an object of class kRp.readability
,
or an object of class kRp.text
with the added feature readability
containing it.
Note
To get a printout of the default parameters like they're set if no other parameters are specified,
call readability(parameters="dput")
.
In case you want to provide different parameters,
you must provide a complete set for an index, or special parameters that are
mentioned in the index descriptions above (e.g., "PSK", if appropriate).
References
Anderson, J. (1981). Analysing the readability of english and non-english texts in the classroom with Lix. In Annual Meeting of the Australian Reading Association, Darwin, Australia.
Anderson, J. (1983). Lix and Rix: Variations on a little-known readability index. Journal of Reading, 26(6), 490–496.
Bamberger, R. & Vanecek, E. (1984). Lesen–Verstehen–Lernen–Schreiben. Wien: Jugend und Volk.
Briest, W. (1974). Kann man Verständlichkeit messen? Zeitschrift für Phonetik, Sprachwissenschaft und Kommunikationsforschung, 27, 543–563.
Coleman, M. & Liau, T.L. (1975). A computer readability formula designed for machine scoring, Journal of Applied Psychology, 60(2), 283–284.
Dickes, P. & Steiwer, L. (1977). Ausarbeitung von Lesbarkeitsformeln für die deutsche Sprache. Zeitschrift für Entwicklungspsychologie und Pädagogische Psychologie, 9(1), 20–28.
DuBay, W.H. (2004). The Principles of Readability. Costa Mesa: Impact Information. WWW: http://www.impact-information.com/impactinfo/readability02.pdf; 22.03.2011.
Farr, J.N., Jenkins, J.J. & Paterson, D.G. (1951). Simplification of Flesch Reading Ease formula. Journal of Applied Psychology, 35(5), 333–337.
Fernández, A. M. (2016, November 30). Fórmula de comprensibilidad de Gutiérrez de Polini. https://legible.es/blog/comprensibilidad-gutierrez-de-polini/
Flesch, R. (1948). A new readability yardstick. Journal of Applied Psychology, 32(3), 221–233.
Grzybek, P. (2010). Text difficulty and the Arens-Altmann law. In Peter Grzybek, Emmerich Kelih, Ján Mačutek (Eds.), Text and Language. Structures – Functions – Interrelations. Quantitative Perspectives. Wien: Praesens, 57–70.
Harris, A.J. & Jacobson, M.D. (1974). Revised Harris-Jacobson readability formulas. In 18th Annual Meeting of the College Reading Association, Bethesda.
Klare, G.R. (1975). Assessing readability. Reading Research Quarterly, 10(1), 62–102.
McLaughlin, G.H. (1969). SMOG grading – A new readability formula. Journal of Reading, 12(8), 639–646.
Powers, R.D, Sumner, W.A, & Kearl, B.E. (1958). A recalculation of four adult readability formulas, Journal of Educational Psychology, 49(2), 99–105.
Smith, E.A. & Senter, R.J. (1967). Automated readability index. AMRL-TR-66-22. Wright-Paterson AFB, Ohio: Aerospace Medical Division.
Spache, G. (1953). A new readability formula for primary-grade reading materials. The Elementary School Journal, 53, 410–413.
Tränkle, U. & Bailer, H. (1984). Kreuzvalidierung und Neuberechnung von Lesbarkeitsformeln für die deutsche Sprache. Zeitschrift für Entwicklungspsychologie und Pädagogische Psychologie, 16(3), 231–244.
Wheeler, L.R. & Smith, E.H. (1954). A practical readability formula for the classroom teacher in the primary grades. Elementary English, 31, 397–399.
[1] https://strainindex.wordpress.com/2007/09/25/hello-world/
Examples
# code is only run when the english language package can be loaded
if(require("koRpus.lang.en", quietly = TRUE)){
sample_file <- file.path(
path.package("koRpus"), "examples", "corpus", "Reality_Winner.txt"
)
# call readability() on a tokenized text
tokenized.obj <- tokenize(
txt=sample_file,
lang="en"
)
# if you call readability() without arguments,
# you will get its results directly
rdb.results <- readability(tokenized.obj)
# there are [ and [[ methods for these objects
rdb.results[["ARI"]]
# alternatively, you can also store those results as a
# feature in the object itself
tokenized.obj <- readability(
tokenized.obj,
as.feature=TRUE
)
# results are now part of the object
hasFeature(tokenized.obj)
corpusReadability(tokenized.obj)
} else {}