lsa.bin.log.reg {RALSA} | R Documentation |
Compute binary logistic regression coefficients specified groups
Description
lsa.bin.log.reg
computes binary logistic regression coefficients within groups defined by one or more variables.
Usage
lsa.bin.log.reg(
data.file,
data.object,
split.vars,
bin.dep.var,
bckg.indep.cont.vars,
bckg.indep.cat.vars,
bckg.cat.contrasts,
bckg.ref.cats,
PV.root.indep,
interactions,
standardize = FALSE,
weight.var,
norm.weight = FALSE,
include.missing = FALSE,
shortcut = FALSE,
save.output = TRUE,
output.file,
open.output = TRUE
)
Arguments
data.file |
The file containing |
data.object |
The object in the memory containing |
split.vars |
Categorical variable(s) to split the results by. If no split variables are provided, the results will be for the overall countries' populations. If one or more variables are provided, the results will be split by all but the last variable and the percentages of respondents will be computed by the unique values of the last splitting variable. |
bin.dep.var |
Name of a binary (i.e. just two distinct values) background or contextual variable used as a dependent variable in the model. See details. |
bckg.indep.cont.vars |
Names of continuous independent background or contextual variables used as predictors in the model. See details. |
bckg.indep.cat.vars |
Names of categorical independent background or contextual variables
used as predictors in the model to compute contrasts for (see
|
bckg.cat.contrasts |
String vector with the same length as the length of
|
bckg.ref.cats |
Vector of integers with the same length as the length of
|
PV.root.indep |
The root names for a set of plausible values used as a independent variables in the model. See details. |
interactions |
Interaction terms - a list containing vectors of length of two. See details. |
standardize |
Shall the dependent and independent variables be standardized to
produce beta coefficients? The default is |
weight.var |
The name of the variable containing the weights. If no name of a weight variable is provide, the function will automatically select the default weight variable for the provided data, depending on the respondent type. |
norm.weight |
Shall the weights be normalized before applying them, default is
|
include.missing |
Logical, shall the missing values of the splitting variables be
included as categories to split by and all statistics produced for
them? The default ( |
shortcut |
Logical, shall the "shortcut" method for IEA TIMSS, TIMSS Advanced,
TIMSS Numeracy, eTIMSS PSI, PIRLS, ePIRLS, PIRLS Literacy and RLII
be applied? The default ( |
save.output |
Logical, shall the output be saved in MS Excel file (default) or not (printed to the console or assigned to an object). |
output.file |
If |
open.output |
Logical, shall the output be open after it has been written? The
default ( |
Details
Either data.file
or data.object
shall be provided as source of data. If both of them are provided, the function will stop with an error message.
The function computes binary logistic regression coefficients by the categories of the splitting variables. The percentages of respondents in each group are computed within the groups specified by the last splitting variable. If no splitting variables are added, the results will be computed only by country.
If standardize = TRUE
, the variables will be standardized before computing any statistics to provide beta regression coefficients.
A binary (i.e. dichotomous) background/contextual variable must be provided to bin.dep.var
(numeric or factor). If more than two categories exist in the variable, the function will exit with an error. The function automatically recodes the two categories of the bin.dep.var
to 0 and 1 if they are not as such (e.g. as 1 and 2 as in factors). If the variable of interest has more than two distinct values (can use the lsa.var.dict
to see them), they can be collapsed using the lsa.recode.vars
.
Background/contextual variables passed to bckg.indep.cont.vars
will be treated as numeric variables in the model. Variables with discrete number of categories (i.e. factors) passed to bckg.indep.cat.vars
will be used to compute contrasts. In this case the type of contrast have to be passed to bckg.cat.contrasts
and the number of the reference categories for each of the bckg.indep.cat.vars
. The number of types of contrasts and the reference categories must be the same as the number of bckg.indep.cat.vars
. The currently supported contrast coding schemes are:
-
dummy
(also called "indicator" in logistic regression) - the odds ratios show what is the probability for a positive (i.e. 1) outcome in the binary dependent variable compared to the negative outcome (i.e. 0) per category of a variable in thebckg.indep.cat.cats
compared to the reference category of that dummy coded variable. The intercept shows the log of the odds for the reference category when all other levels are 0. -
deviation
(also called "effect" in logistic regression) - comparing the effect of each category (except for the reference) of the deviation coded variable to the overall effect (which is the intercept). -
simple
- the same as for thedummy
contrast coding, except for the intercept which in this case is the overall effect.
Note that when using standardize = TRUE
, the contrast coding of bckg.indep.cat.vars
is not standardized. Thus, the regression coefficients may not be comparable to other software solutions for analyzing large-scale assessment data which rely on, for example, SPSS or SAS where the contrast coding of categorical variables (e.g. dummy coding) takes place by default. However, the model statistics will be identical.
Multiple continuous or categorical background variables and/or sets of plausible values can be provided to compute regression coefficients for. Please note that in this case the results will slightly differ compared to using each pair of the same background continuous variables or PVs in separate analysis. This is because the cases with the missing values are removed in advance and the more variables are provided, the more cases are likely to be removed. That is, the function support only listwisie deletion.
Computation of regression coefficients involving plausible values requires providing a root of the plausible values names in PV.root.dep
and/or PV.root.indep
. All studies (except CivED, TEDS-M, SITES, TALIS and TALIS Starting Strong Survey) have a set of PVs per construct (e.g. in TIMSS five for overall mathematics, five for algebra, five for geometry, etc.). In some studies (say TIMSS and PIRLS) the names of the PVs in a set always start with character string and end with sequential number of the PV. For example, the names of the set of PVs for overall mathematics in TIMSS are BSMMAT01, BSMMAT02, BSMMAT03, BSMMAT04 and BSMMAT05. The root of the PVs for this set to be added to PV.root.dep
or PV.root.indep
will be "BSMMAT". The function will automatically find all the variables in this set of PVs and include them in the analysis. In other studies like OECD PISA and IEA ICCS and ICILS the sequential number of each PV is included in the middle of the name. For example, in ICCS the names of the set of PVs are PV1CIV, PV2CIV, PV3CIV, PV4CIV and PV5CIV. The root PV name has to be specified in PV.root.dep
or PV.root.indep
as "PV#CIV". More than one set of PVs can be added in PV.root.indep
.
The function can also compute two-way interaction effects between independent variables by passing a list to the interactions
argument. The list must contain vectors of length two and all variables in these vectors must also be passed as independent variables (see the examples). Note the following:
When an interaction is between two independent background continuous variables (i.e. both are passed to
bckg.indep.cont.vars
), the interaction effect will be computed between them as they are.When the interaction is between two categorical variables (i.e. both are passed to
bckg.indep.cat.vars
), the interaction effect will be computed between each possible pair of categories of the two variables, except for the reference categories.When the interaction is between one continuous (i.e. passed to
bckg.indep.cont.vars
) and one categorical (i.e. passed tobckg.indep.cat.vars
), the interaction effect will be computed between the continuous variable and each category of the categorical variable, except for the reference category.When the interaction is between a continuous variable (i.e. passed to
bckg.indep.cont.vars
) and a set of PVs (i.e. passed toPV.root.indep
), the interaction effect is computed between the continuous variable and each PV in the set and the results are aggregated.When the interaction is between a categorical variable (i.e. passed to
bckg.indep.cat.vars
) and a set of PVs (i.e. passed toPV.root.indep
), the interaction effect is computed between each category of the categorical variable (except the reference category) and each PV in the set. The results are aggregated for each of the categories of the categorical variables and the set of PVs.When the interaction is between two sets of PVs (i.e. passed to
PV.root.indep
), the interaction effect is computed between the first PV in the first set and the first PV in the second set, the second PV in the first set and the second PV in the second set, and so on. The results are then aggregated.
If norm.weight = TRUE
, the weights will be normalized before used in the model. This may be necessary in some countries in some studies extreme weights for some of the cases may result in inflated estimates due to model perfect separation. The consequence of normalizing weights is that the number of elements in the population will sum to the number of cases in the sample. Use with caution.
If include.missing = FALSE
(default), all cases with missing values on the splitting variables will be removed and only cases with valid values will be retained in the statistics. Note that the data from the studies can be exported in two different ways: (1) setting all user-defined missing values to NA
; and (2) importing all user-defined missing values as valid ones and adding their codes in an additional attribute to each variable. If the include.missing
is set to FALSE
(default) and the data used is exported using option (2), the output will remove all values from the variable matching the values in its missings
attribute. Otherwise, it will include them as valid values and compute statistics for them.
The shortcut
argument is valid only for TIMSS, eTIMSS PSI, TIMSS Advanced, TIMSS Numeracy, PIRLS, ePIRLS, PIRLS Literacy and RLII. Previously, in computing the standard errors, these studies were using 75 replicates because one of the schools in the 75 JK zones had its weights doubled and the other one has been taken out. Since TIMSS 2015 and PIRLS 2016 the studies use 150 replicates and in each JK zone once a school has its weights doubled and once taken out, i.e. the computations are done twice for each zone. For more details see Foy & LaRoche (2016) and Foy & LaRoche (2017). If replication of the tables and figures is needed, the shortcut
argument has to be changed to TRUE
.
The function provides two-tailed t-test and p-values for the regression coefficients.
Value
If save.output = FALSE
, a list containing the estimates and analysis information. If save.output = TRUE
(default), an MS Excel (.xlsx
) file (which can be opened in any spreadsheet program), as specified with the full path in the output.file
. If the argument is missing, an Excel file with the generic file name "Analysis.xlsx" will be saved in the working directory (getwd()
). The workbook contains four spreadsheets. The first one ("Estimates") contains a table with the results by country and the final part of the table contains averaged results from all countries' statistics. The following columns can be found in the table, depending on the specification of the analysis:
-
<
Country ID>
- a column containing the names of the countries in the file for which statistics are computed. The exact column header will depend on the country identifier used in the particular study. -
<
Split variable 1>
,<
Split variable 2>
... - columns containing the categories by which the statistics were split by. The exact names will depend on the variables insplit.vars
. n_Cases - the number of cases in the sample used to compute the statistics.
Sum_
<
Weight variable>
- the estimated population number of elements per group after applying the weights. The actual name of the weight variable will depend on the weight variable used in the analysis.Sum_
<
Weight variable>
_
SE - the standard error of the the estimated population number of elements per group. The actual name of the weight variable will depend on the weight variable used in the analysis.Percentages_
<
Last split variable>
- the percentages of respondents (population estimates) per groups defined by the splitting variables insplit.vars
. The percentages will be for the last splitting variable which defines the final groups.Percentages_
<
Last split variable>
_
SE - the standard errors of the percentages from above.Variable - the variable names (background/contextual or PV root names, or contrast coded variable names).
Coefficients - the logistic regression coefficients (intercept and slopes).
Coefficients_SE - the standard error of the logistic regression coefficients (intercepts and slopes) for each independent variable (background/contextual or PV root names, or contrast coded variable names) in the model.
Coefficients_SVR - the sampling variance component for the logistic regression coefficients if root PVs are specified either as dependent or independent variables.
Coefficients_
<
root PV>
_
MVR - the measurement variance component for the logistic regression coefficients if root PVs are specified either as dependent or independent variables.Wald_Statistic - Wald (z) statistic for each coefficient.
p_value - the p-value for the regression coefficients.
Odds_Ratio - the odds ratios of the logistic regression.
Odds_Ratio_SE - the standard errors for the odds ratios of the logistic regression.
Wald_L95CI - the lower 95% model-based confidence intervals for the logistic regression coefficients.
Wald_U95CI - the upper 95% model-based confidence intervals for the logistic regression coefficients.
Odds_L95CI - the lower 95% model-based confidence intervals for the odds ratios.
Odds_U95CI - the upper 95% model-based confidence intervals for the odds ratios.
When interaction terms are included, the cells with the interactions in the Variables
column will contain the names of the two variables in each of the interaction terms, divided by colon, e.g. ASBGSSB:ASBGHRL
.
The second sheet contains the model statistics:
-
<
Country ID>
- a column containing the names of the countries in the file for which statistics are computed. The exact column header will depend on the country identifier used in the particular study. -
<
Split variable 1>
,<
Split variable 2>
... - columns containing the categories by which the statistics were split by. The exact names will depend on the variables insplit.vars
. Statistic - a column containing the Null Deviance (-2LL, no predictors in the model, just constant, also called "baseline"), Deviance (-2LL, after adding predictors, residual deviance, also called "new"), DF Null (degrees of freedom for the null deviance), DF Residual (degrees of freedom for the residual deviance), Akaike Information Criteria (AIC), Bayesian information criterion (BIC), model Chi-Square, different R-Squared statistics (Hosmer & Lemeshow - HS, Cox & Snell - CS, and Nagelkerke - N).
Estimate - the numerical estimates for each of the above.
Estimate_SE - the standard errors of the estimates from above.
Estimate_SVR - the sampling variance component if PVs were included in the model.
Estimate_MVR - the measurement variance component if PVs were included in the model.
The third sheet contains some additional information related to the analysis per country in columns:
DATA - used
data.file
ordata.object
.STUDY - which study the data comes from.
CYCLE - which cycle of the study the data comes from.
WEIGHT - which weight variable was used.
DESIGN - which resampling technique was used (JRR or BRR).
SHORTCUT - logical, whether the shortcut method was used.
NREPS - how many replication weights were used.
ANALYSIS_DATE - on which date the analysis was performed.
START_TIME - at what time the analysis started.
END_TIME - at what time the analysis finished.
DURATION - how long the analysis took in hours, minutes, seconds and milliseconds.
The fourth sheet contains the call to the function with values for all parameters as it was executed. This is useful if the analysis needs to be replicated later.
References
LaRoche, S., Joncas, M., & Foy, P. (2016). Sample Design in TIMSS 2015. In M. O. Martin, I. V. S. Mullis, & M. Hooper (Eds.), Methods and Procedures in TIMSS 2015 (pp. 3.1-3.37). Chestnut Hill, MA: TIMSS & PIRLS International Study Center.
LaRoche, S., Joncas, M., & Foy, P. (2017). Sample Design in PIRLS 2016. In M. O. Martin, I. V. S. Mullis, & M. Hooper (Eds.), Methods and Procedures in PIRLS 2016 (pp. 3.1-3.34). Chestnut Hill, MA: Lynch School of Education, Boston College.
UCLA: Statistical Consulting Group. 2020. "R LIBRARY CONTRAST CODING SYSTEMS FOR CATEGORICAL VARIABLES." IDRE Stats - Statistical Consulting Web Resources. Retrieved June 16, 2020 (https://stats.idre.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/).
Hilbe, J. M. (2015). Practical Guide to Logistic Regression. CRC Press.
See Also
lsa.convert.data
, , lsa.vars.dict
, lsa.recode.vars
, lsa.lin.reg
Examples
# Compute logistic regression predicting the log of the odds the students will respond
# "Agree a lot" when asked if teachers are fair (dependent variable, categorical), as a function
# of their own sense of school belonging (independent variable, continuous) using PIRLS 2016
# student data. Because the dependent variable has four categories, it needs to be recoded first
# into a dichotomous (using the \code{lsa.recode.vars}).
## Not run:
lsa.recode.vars(data.file = "C:/temp/test.RData", src.variables = "ASBG12D",
old.new = "1=2;2=2;3=1;4=1;5=3", new.variables = "ASBG12Dr",
new.labels = c("Disagree", "Agree", "Omitted or invalid"),
missings.attr = "Omitted or invalid",
variable.labels = "GEN/AGREE/TEACHERS ARE FAIR - RECODED",
out.file = "C:/temp/test.RData")
lsa.bin.log.reg(data.file = "C:/temp/test.RData", split.vars = "ASBG01",
bin.dep.var = "ASBG12Dr", bckg.indep.cont.vars = "ASBGSSB")
## End(Not run)
# Perform the same analysis from above, this time use the overall student reading achievement
# as a predictor.
## Not run:
lsa.bin.log.reg(data.object = test, split.vars = "ASBG01",
bin.dep.var = "ASBG12Dr", PV.root.indep = "ASRREA")
## End(Not run)
# Compute linear regression with interaction terms using PIRLS 2016 student data.
## Not run:
lsa.bin.log.reg(data.file = "C:/temp/test.RData", bin.dep.var = "ASBG05B",
bckg.indep.cont.vars = "ASBGSSB", bckg.indep.cat.vars = c("ASBG01", "ASBG12B"),
PV.root.indep = c("ASRREA", "ASRLIT"),
interactions = list(c("ASBG12B", "ASBGSSB"), c("ASBG01", "ASRLIT")))
## End(Not run)