irtQ-package {irtQ} | R Documentation |
irtQ: Unidimensional Item Response Theory Modeling
Description
Fit unidimensional item response theory (IRT) models to a mixture of dichotomous and polytomous data, calibrate online item parameters (i.e., pretest and operational items), estimate examinees' abilities, and provide useful functions related to unidimensional IRT such as IRT model-data fit evaluation and differential item functioning analysis.
For the item parameter estimation, the marginal maximum likelihood estimation via the expectation-maximization (MMLE-EM) algorithm (Bock & Aitkin, 1981) is used. Also, the fixed item parameter calibration (FIPC) method (Kim, 2006) and the fixed ability parameter calibration (FAPC) method, (Ban, Hanson, Wang, Yi, & Harris, 2001; stocking, 1988), often called Stocking's Method A, are provided. For the ability estimation, several popular scoring methods (e.g., ML, EAP, and MAP) are implemented.
In addition, there are many useful functions related to IRT analyses such as evaluating IRT model-data fit, analyzing differential item functioning (DIF), importing item and/or ability parameters from popular IRT software, running flexMIRT (Cai, 2017) through R, generating simulated data, computing the conditional distribution of observed scores using the Lord-Wingersky recursion formula, computing item and test information functions, computing item and test characteristic curve functions, and plotting item and test characteristic curves and item and test information functions.
Package: | irtQ |
Version: | 0.2.0 |
Date: | 2023-07-05 |
Depends: | R (>= 4.1) |
License: | GPL (>= 2) |
Details
Following five sections describe a) how to implement the online item calibration using FIPC, a) how to implement the online item calibration using Method A, b) two illustrations of the online calibration, and c) IRT Models used in irtQ package.
Online item calibration with the fixed item parameter calibration method (e.g., Kim, 2006)
The fixed item parameter calibration (FIPC) is one of useful online item calibration methods for computerized adaptive testing (CAT) to put the parameter estimates of pretest items on the same scale of operational item parameter estimates without post hoc linking/scaling (Ban, Hanson, Wang, Yi, & Harris, 2001; Chen & Wang, 2016). In FIPC, the operational item parameters are fixed to estimate the characteristic of the underlying latent variable prior distribution when calibrating the pretest items. More specifically, the underlying latent variable prior distribution of the operational items is estimated during the calibration of the pretest items to put the item parameters of the pretest items on the scale of the operational item parameters (Kim, 2006). In the irtQ package, FIPC is implemented with two main steps:
Prepare a response data set and the item metadata of the fixed (or operational) items.
Implement FIPC to estimate the item parameters of pretest items using the
est_irt
function.
- 1. Preparing a data set
-
To run the
est_irt
function, it requires two data sets:Item metadata set (i.e., model, score category, and item parameters. see the desciption of the argument
x
in the functionest_irt
).Examinees' response data set for the items. It should be a matrix format where a row and column indicate the examinees and the items, respectively. The order of the columns in the response data set must be exactly the same as the order of rows of the item metadata.
- 2. Estimating the pretest item parameters
-
When FIPC is implemented in
est_irt
function, the pretest item parameters are estimated by fixing the operational item parameters. To estimate the item parameters, you need to provide the item metadata in the argumentx
and the response data in the argumentdata
.It is worthwhile to explain about how to prepare the item metadata set in the argument
x
. A specific form of a data frame should be used for the argumentx
. The first column should have item IDs, the second column should contain the number of score categories of the items, and the third column should include IRT models. The available IRT models are "1PLM", "2PLM", "3PLM", and "DRM" for dichotomous items, and "GRM" and "GPCM" for polytomous items. Note that "DRM" covers all dichotomous IRT models (i.e, "1PLM", "2PLM", and "3PLM") and "GRM" and "GPCM" represent the graded response model and (generalized) partial credit model, respectively. From the fourth column, item parameters should be included. For dichotomous items, the fourth, fifth, and sixth columns represent the item discrimination (or slope), item difficulty, and item guessing parameters, respectively. When "1PLM" or "2PLM" is specified for any items in the third column, NAs should be inserted for the item guessing parameters. For polytomous items, the item discrimination (or slope) parameters should be contained in the fourth column and the item threshold (or step) parameters should be included from the fifth to the last columns. When the number of categories differs between items, the empty cells of item parameters should be filled with NAs. See 'est_irt' for more details about the item metadata.Also, you should specify in the argument
fipc = TRUE
and a specific FIPC method in the argumentfipc.method
. Finally, you should provide a vector of the location of the items to be fixed in the argumentfix.loc
. For more details about implementing FIPC, see the description of the functionest_irt
.When implementing FIPC, you can estimate both the emprical histogram and the scale of latent variable prior distribution by setting
EmpHist = TRUE
. IfEmpHist = FALSE
, the normal prior distribution is used during the item parameter estimation and the scale of the normal prior distribution is updated during the EM cycle.The
est_item
function requires a vector of the number of score categories for the items in the argumentcats
. For example, a dichotomous item has two score categories. If a single numeric value is specified, that value will be recycled across all items. If NULL and all items are binary items (i.e., dichotomous items), it assumes that all items have two score categories.If necessary, you need to specify whether prior distributions of item slope and guessing parameters (only for the IRT 3PL model) are used in the arguments of
use.aprior
anduse.gprior
, respectively. If you decide to use the prior distributions, you should specify what distributions will be used for the prior distributions in the arguments ofaprior
andgprior
, respectively. Currently three probability distributions of Beta, Log-normal, and Normal distributions are available.In addition, if the response data include missing values, you must indicate the missing value in argument
missing
.Once the
est_irt
function has been implemented, you'll get a list of several internal objects such as the item parameter estimates, standard error of the parameter estimates.
Online item calibration with the fixed ability parameter calibration method (e.g., Stocking, 1988)
In CAT, the fixed ability parameter calibration (FAPC), often called Stocking's Method A, is the relatively simplest and most straightforward online calibration method, which is the maximum likelihood estimation of the item parameters given the proficiency estimates. In CAT, FAPC can be used to put the parameter estimates of pretest items on the same scale of operational item parameter estimates and recalibrate the operational items to evaluate the parameter drifts of the operational items (Chen & Wang, 2016; Stocking, 1988). Also, FAPC is known to result in accurate, unbiased item parameters calibration when items are randomly rather than adaptively administered to examinees, which occurs most commonly with pretest items (Ban, Hanson, Wang, Yi, & Harris, 2001; Chen & Wang, 2016). Using irtQ package, the FAPC is implemented to calibrate the items with two main steps:
Prepare a data set for the calibration of item parameters (i.e., item response data and ability estimates).
Implement the FAPC to estimate the item parameters using the
est_item
function.
- 1. Preparing a data set
-
To run the
est_item
function, it requires two data sets:Examinees' ability (or proficiency) estimates. It should be in the format of a numeric vector.
response data set for the items. It should be in the format of matrix where a row and column indicate the examinees and the items, respectively. The order of the examinees in the response data set must be exactly the same as that of the examinees' ability estimates.
- 2. Estimating the pretest item parameters
-
The
est_item
function estimates the pretest item parameters given the proficiency estimates. To estimate the item parameters, you need to provide the response data in the argumentdata
and the ability estimates in the argumentscore
.Also, you should provide a string vector of the IRT models in the argument
model
to indicate what IRT model is used to calibrate each item. Available IRT models are "1PLM", "2PLM", "3PLM", and "DRM" for dichotomous items, and "GRM" and "GPCM" for polytomous items. "GRM" and "GPCM" represent the graded response model and (generalized) partial credit model, respectively. Note that "DRM" is considered as "3PLM" in this function. If a single character of the IRT model is specified, that model will be recycled across all items.The
est_item
function requires a vector of the number of score categories for the items in the argumentcats
. For example, a dichotomous item has two score categories. If a single numeric value is specified, that value will be recycled across all items. If NULL and all items are binary items (i.e., dichotomous items), it assumes that all items have two score categories.If necessary, you need to specify whether prior distributions of item slope and guessing parameters (only for the IRT 3PL model) are used in the arguments of
use.aprior
anduse.gprior
, respectively. If you decide to use the prior distributions, you should specify what distributions will be used for the prior distributions in the arguments ofaprior
andgprior
, respectively. Currently three probability distributions of Beta, Log-normal, and Normal distributions are available.In addition, if the response data include missing values, you must indicate the missing value in argument
missing
.Once the
est_item
function has been implemented, you'll get a list of several internal objects such as the item parameter estimates, standard error of the parameter estimates.
Three examples of R script
The example code below shows how to implement the online calibration and how to evaluate the IRT model-data fit:
##--------------------------------------------------------------- # Attach the packages library(irtQ) ##---------------------------------------------------------------------------- # 1. The example code below shows how to prepare the data sets and how to # implement the fixed item parameter calibration (FIPC): ##---------------------------------------------------------------------------- ## Step 1: prepare a data set ## In this example, we generated examinees' true proficiency parameters and simulated ## the item response data using the function "simdat". ## import the "-prm.txt" output file from flexMIRT flex_sam <- system.file("extdata", "flexmirt_sample-prm.txt", package = "irtQ") # select the item metadata x <- bring.flexmirt(file=flex_sam, "par")$Group1$full_df # generate 1,000 examinees' latent abilities from N(0.4, 1.3) set.seed(20) score <- rnorm(1000, mean=0.4, sd=1.3) # simulate the response data sim.dat <- simdat(x=x, theta=score, D=1) ## Step 2: Estimate the item parameters # fit the 3PL model to all dichotmous items, fit the GRM model to all polytomous data, # fix the five 3PL items (1st - 5th items) and three GRM items (53th to 55th items) # also, estimate the empirical histogram of latent variable fix.loc <- c(1:5, 53:55) (mod.fix1 <- est_irt(x=x, data=sim.dat, D=1, use.gprior=TRUE, gprior=list(dist="beta", params=c(5, 16)), EmpHist=TRUE, Etol=1e-3, fipc=TRUE, fipc.method="MEM", fix.loc=fix.loc)) summary(mod.fix1) # plot the estimated empirical histogram of latent variable prior distribution (emphist <- getirt(mod.fix1, what="weights")) plot(emphist$weight ~ emphist$theta, xlab="Theta", ylab="Density") ##---------------------------------------------------------------------------- # 2. The example code below shows how to prepare the data sets and how to estimate # the item parameters using the fixed abilit parameter calibration (FAPC): ##---------------------------------------------------------------------------- ## Step 1: prepare a data set ## In this example, we generated examinees' true proficiency parameters and simulated ## the item response data using the function "simdat". Because, the true ## proficiency parameters are not known in reality, however, the true proficiencies ## would be replaced with the proficiency estimates for the calibration. # import the "-prm.txt" output file from flexMIRT flex_sam <- system.file("extdata", "flexmirt_sample-prm.txt", package = "irtQ") # select the item metadata x <- bring.flexmirt(file=flex_sam, "par")$Group1$full_df # modify the item metadata so that some items follow 1PLM, 2PLM and GPCM x[c(1:3, 5), 3] <- "1PLM" x[c(1:3, 5), 4] <- 1 x[c(1:3, 5), 6] <- 0 x[c(4, 8:12), 3] <- "2PLM" x[c(4, 8:12), 6] <- 0 x[54:55, 3] <- "GPCM" # generate examinees' abilities from N(0, 1) set.seed(23) score <- rnorm(500, mean=0, sd=1) # simulate the response data data <- simdat(x=x, theta=score, D=1) ## Step 2: Estimate the item parameters # 1) item parameter estimation: constrain the slope parameters of the 1PLM to be equal (mod1 <- est_item(x, data, score, D=1, fix.a.1pl=FALSE, use.gprior=TRUE, gprior=list(dist="beta", params=c(5, 17)), use.startval=FALSE)) summary(mod1) # 2) item parameter estimation: fix the slope parameters of the 1PLM to 1 (mod2 <- est_item(x, data, score, D=1, fix.a.1pl=TRUE, a.val.1pl=1, use.gprior=TRUE, gprior=list(dist="beta", params=c(5, 17)), use.startval=FALSE)) summary(mod2) # 3) item parameter estimation: fix the guessing parameters of the 3PLM to 0.2 (mod3 <- est_item(x, data, score, D=1, fix.a.1pl=TRUE, fix.g=TRUE, a.val.1pl=1, g.val=.2, use.startval=FALSE)) summary(mod3)
IRT Models
In the irtQ package, both dichotomous and polytomous IRT models are available. For dichotomous items, IRT one-, two-, and three-parameter logistic models (1PLM, 2PLM, and 3PLM) are used. For polytomous items, the graded response model (GRM) and the (generalized) partial credit model (GPCM) are used. Note that the item discrimination (or slope) parameters should be fixed to 1 when the partial credit model is fit to data.
In the following, let Y
be the response of an examinee with latent ability \theta
on an item and suppose that there
are K
unique score categories for each polytomous item.
- IRT 1-3PL models
-
For the IRT 1-3PL models, the probability that an examinee with
\theta
provides a correct answer for an item is given by,P(Y = 1|\theta) = g + \frac{(1 - g)}{1 + exp(-Da(\theta - b))},
where
a
is the item discrimination (or slope) parameter,b
represents the item difficulty parameter,g
refers to the item guessing parameter.D
is a scaling factor in IRT models to make the logistic function as close as possible to the normal ogive function whenD = 1.702
. When the 1PLM is used,a
is either fixed to a constant value (e.g.,a=1
) or constrained to have the same value across all 1PLM item data. When the IRT 1PLM or 2PLM is fit to data,g = 0
is set to 0. - GRM
-
For the GRM, the probability that an examinee with latent ability
\theta
responds to score categoryk
(k=0,1,...,K-1
) of an item is a given by,P(Y = k | \theta) = P^{*}(Y \ge k | \theta) - P^{*}(Y \ge k + 1 | \theta),
P^{*}(Y \ge k | \theta) = \frac{1}{1 + exp(-Da(\theta - b_{k}))}, and
P^{*}(Y \ge k + 1 | \theta) = \frac{1}{1 + exp(-Da(\theta - b_{k+1}))},
where
P^{*}(Y \ge k | \theta
refers to the category boundary (threshold) function for score categoryk
of an item and its formula is analogous to that of 2PLM.b_{k}
is the difficulty (or threshold) parameter for category boundaryk
of an item. Note thatP(Y = 0 | \theta) = 1 - P^{*}(Y \ge 1 | \theta)
andP(Y = K-1 | \theta) = P^{*}(Y \ge K-1 | \theta)
. - GPCM
-
For the GPCM, the probability that an examinee with latent ability
\theta
responds to score categoryk
(k=0,1,...,K-1
) of an item is a given by,P(Y = k | \theta) = \frac{exp(\sum_{v=0}^{k}{Da(\theta - b_{v})})}{\sum_{h=0}^{K-1}{exp(\sum_{v=0}^{h}{Da(\theta - b_{v})})}},
where
b_{v}
is the difficulty parameter for category boundaryv
of an item. In other contexts, the difficulty parameterb_{v}
can also be parameterized asb_{v} = \beta - \tau_{v}
, where\beta
refers to the location (or overall difficulty) parameter and\tau_{jv}
represents a threshold parameter for score categoryv
of an item. In the irtQ package,K-1
difficulty parameters are necessary when an item hasK
unique score categories becauseb_{0}=0
. When a partial credit model is fit to data,a
is fixed to 1.
Author(s)
Hwanggyu Lim hglim83@gmail.com
References
Ames, A. J., & Penfield, R. D. (2015). An NCME Instructional Module on Item-Fit Statistics for Item Response Theory Models. Educational Measurement: Issues and Practice, 34(3), 39-48.
Baker, F. B., & Kim, S. H. (2004). Item response theory: Parameter estimation techniques. CRC Press.
Ban, J. C., Hanson, B. A., Wang, T., Yi, Q., & Harris, D., J. (2001) A comparative study of on-line pretest item calibration/scaling methods in computerized adaptive testing. Journal of Educational Measurement, 38(3), 191-212.
Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee's ability. In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental test scores (pp. 397-479). Reading, MA: Addison-Wesley.
Bock, R.D. (1960), Methods and applications of optimal scaling. Chapel Hill, NC: L.L. Thurstone Psychometric Laboratory.
Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46, 443-459.
Bock, R. D., & Mislevy, R. J. (1982). Adaptive EAP estimation of ability in a microcomputer environment. Psychometrika, 35, 179-198.
Cai, L. (2017). flexMIRT 3.5 Flexible multilevel multidimensional item analysis and test scoring [Computer software]. Chapel Hill, NC: Vector Psychometric Group.
Chalmers, R. P. (2012). mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48(6), 1-29.
Chen, P., & Wang, C. (2016). A new online calibration method for multidimensional computerized adaptive testing. Psychometrika, 81(3), 674-701.
González, J. (2014). SNSequate: Standard and nonstandard statistical models and methods for test equating. Journal of Statistical Software, 59, 1-30.
Hambleton, R. K., & Swaminathan, H. (1985) Item response theory: Principles and applications. Boston, MA: Kluwer.
Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991) Fundamentals of item response theory. Newbury Park, CA: Sage.
Han, K. T. (2016). Maximum likelihood score estimation method with fences for short-length tests and computerized adaptive tests. Applied psychological measurement, 40(4), 289-301.
Howard, J. P. (2017). Computational methods for numerical analysis with R. New York: Chapman and Hall/CRC.
Kang, T., & Chen, T. T. (2008). Performance of the generalized S-X2 item fit index for polytomous IRT models. Journal of Educational Measurement, 45(4), 391-406.
Kim, S. (2006). A comparative study of IRT fixed parameter calibration methods. Journal of Educational Measurement, 43(4), 355-381.
Kolen, M. J. & Brennan, R. L. (2004) Test Equating, Scaling, and Linking (2nd ed.). New York: Springer.
Kolen, M. J. & Tong, Y. (2010). Psychometric properties of IRT proficiency estimates. Educational Measurement: Issues and Practice, 29(3), 8-14.
Laplace, P. S. (1820).Theorie analytique des probabilites (in French). Courcier.
Li, Y. & Lissitz, R. (2004). Applications of the analytically derived asymptotic standard errors of item response theory item parameter estimates. Journal of educational measurement, 41(2), 85-117.
Lim, H., & Choe, E. M. (2023). Detecting differential item functioning in CAT using IRT residual DIF approach. Journal of Educational Measurement. doi:10.1111/jedm.12366.
Lim, H., Choe, E. M., & Han, K. T. (2022). A residual-based differential item functioning detection framework in item response theory. Journal of Educational Measurement, 59(1), 80-104. doi:10.1111/jedm.12313.
Lim, H., Zhu, D., Choe, E. M., & Han, K. T. (2023, April). Detecting differential item functioning among multiple groups using IRT residual DIF framework. Paper presented at the Annual Meeting of the National Council on Measurement in Education. Chicago, IL.
Lim, H., Davey, T., & Wells, C. S. (2020). A recursion-based analytical approach to evaluate the performance of MST. Journal of Educational Measurement. DOI: 10.1111/jedm.12276.
Lord, F. & Wingersky, M. (1984). Comparison of IRT true score and equipercentile observed score equatings. Applied Psychological Measurement, 8(4), 453-461.
Magis, D., & Barrada, J. R. (2017). Computerized adaptive testing with R: Recent updates of the package catR. Journal of Statistical Software, 76, 1-19.
McKinley, R., & Mills, C. (1985). A comparison of several goodness-of-fit statistics. Applied Psychological Measurement, 9, 49-57.
Meilijson, I. (1989). A fast improvement to the EM algorithm on its own terms. Journal of the Royal Statistical Society: Series B (Methodological), 51, 127-138.
Muraki, E. & Bock, R. D. (2003). PARSCALE 4: IRT item analysis and test scoring for rating scale data [Computer Program]. Chicago, IL: Scientific Software International. URL http://www.ssicentral.com
Newcombe, R. G. (1998). Two-sided confidence intervals for the single proportion: comparison of seven methods. Statistics in medicine, 17(8), 857-872.
Orlando, M., & Thissen, D. (2000). Likelihood-based item-fit indices for dichotomous item response theory models. Applied Psychological Measurement, 24(1), 50-64.
Orlando, M., & Thissen, D. (2003). Further investigation of the performance of S-X2: An item fit index for use with dichotomous item response theory models. Applied Psychological Measurement, 27(4), 289-298.
Pritikin, J. (2018). rpf: Response Probability Functions. R package version 0.59. https://CRAN.R-project.org/package=rpf.
Pritikin, J. N., & Falk, C. F. (2020). OpenMx: A modular research environment for item response theory method development. Applied Psychological Measurement, 44(7-8), 561-562.
Stocking, M. L. (1996). An alternative method for scoring adaptive tests. Journal of Educational and Behavioral Statistics, 21(4), 365-389.
Stocking, M. L. (1988). Scale drift in on-line calibration (Research Rep. 88-28). Princeton, NJ: ETS.
Thissen, D. (1982). Marginal maximum likelihood estimation for the one-parameter logistic model. Psychometrika, 47, 175-186.
Thissen, D. & Wainer, H. (1982). Weighted likelihood estimation of ability in item response theory. Psychometrika, 54(3), 427-450.
Thissen, D., Pommerich, M., Billeaud, K., & Williams, V. S. (1995). Item Response Theory for Scores on Tests Including Polytomous Items with Ordered Responses. Applied Psychological Measurement, 19(1), 39-49.
Thissen, D. & Orlando, M. (2001). Item response theory for items scored in two categories. In D. Thissen & H. Wainer (Eds.), Test scoring (pp.73-140). Mahwah, NJ: Lawrence Erlbaum.
Wainer, H., & Mislevy, R. J. (1990). Item response theory, item calibration, and proficiency estimation. In H. Wainer (Ed.), Computer adaptive testing: A primer (Chap. 4, pp.65-102). Hillsdale, NJ: Lawrence Erlbaum.
Warm, T. A. (1989). Weighted likelihood estimation of ability in item response theory. Psychometrika, 54(3), 427-450.
Weeks, J. P. (2010). plink: An R Package for Linking Mixed-Format Tests Using IRT-Based Methods. Journal of Statistical Software, 35(12), 1-33. URL http://www.jstatsoft.org/v35/i12/.
Wells, C. S., & Bolt, D. M. (2008). Investigation of a nonparametric procedure for assessing goodness-of-fit in item response theory. Applied Measurement in Education, 21(1), 22-40.
Wilson, E. B. (1927). Probable inference, the law of succession, and statistical inference. Journal of the American Statistical Association, 22(158), 209-212.
Woods, C. M. (2007). Empirical histograms in item response theory with ordinal data. Educational and Psychological Measurement, 67(1), 73-87.
Yen, W. M. (1981). Using simulation results to choose a latent trait model. Applied Psychological Measurement, 5, 245-262.
Zimowski, M. F., Muraki, E., Mislevy, R. J., & Bock, R. D. (2003). BILOG-MG 3: Multiple-group IRT analysis and test maintenance for binary items [Computer Program]. Chicago, IL: Scientific Software International. URL http://www.ssicentral.com