Read {lessR} | R Documentation |
Read Contents of a Data File with Optional Variable Labels and Feedback
Description
Abbreviation: rd
, rd_lbl
, Read2
Reads the contents of the specified data file into an R data table, what R calls a data frame. By default the format of the file is detected from its filetype: comma or tab separated value text file from .csv
, SPSS data file from .sav
, SAS data from from .sas7bdat
, or R data file from .rda
, and Excel file from .xls
, .xlsx
using Alexander Walker's openxlsx
package, or .ods
using Gerrit-Jan Schutten and Chung-hong Chan plus other contributor's readODS
package. Specify a fixed width formatted text data file to be read with the required R widths
option. Identify the data file by either browsing for the file on the local computer system with Read()
, or identify the file with the first argument a character string in the form of a path name or a web URL (except for .Rda files which must be on the local computer system).
Any variable labels in a native SPSS are automatically included in the data file. See the details
section below for more information. Variable labels can also be added and modified individually with the lessR
function label
, and more comprehensively with the VariableLabels
function.
The function provides feedback regarding the data that is read by invoking the lessR
function details
. The default brief form of this function invoked by default only lists the input files, the variable name table, and any variable labels.
The lessR
function corRead
reads a correlation matrix.
Usage
Read(from=NULL, format=NULL, var_labels=FALSE, widths=NULL,
missing="", n_mcut=1,
miss_show=30, miss_zero=FALSE, miss_matrix=FALSE,
max_lines=30, sheet=1, row_names=NULL,
brief=TRUE, quiet=getOption("quiet"),
fun_call=NULL, ...)
rd(...)
rd_lbl(..., var_labels=TRUE)
Read2(..., sep=";", dec=",")
Arguments
from |
File reference included in quotes, either empty to browse
for the data file, a full path name or web URL, or the name of a
data file included with lessR, such as |
format |
Format of the data in the file, not usually specified because set
by default according to the
file type of the file to read: |
var_labels |
Set |
widths |
Specifies the width of the successive columns for fixed width formatted data. |
missing |
Missing value code, which by default is literally a missing data value in the data table. |
n_mcut |
For the missing value analysis, list the row name and number of
missing values if the number of missing exceeds or equals this cutoff.
Requires |
miss_show |
For the missing value analysis, the number of rows, one row per
observation, that has as many or missing values as |
miss_zero |
For the missing value analysis, list the variable name or the
row name even for values of 0, that is rows with no missing data.
By default only variables and rows with missing data are listed.
Requires |
miss_matrix |
For the missing value analysis, if there is any missing data, list a version of the complete data table with a 0 for a non-missing value and a 1 for a missing value. |
sep |
Character that separates adjacent values in a text file of data. |
dec |
Character that serves as the decimal separator in a number. |
max_lines |
Maximum number of lines to list of the data and labels. |
sheet |
For Excel files, specifies the work sheet to read. Provide either the worksheet number according to its position, or its name enclosed in quotes. The default is the first work sheet. |
row_names |
|
brief |
If |
quiet |
If set to |
fun_call |
Function call. Used with |
... |
Other parameter values define with the R read functions, such as the
|
Details
By default Read
reads text data files which are either comma delimited, csv
, or tab-delimited data files, native Excel files of type .xls
or .xlsx
, native ODS files of type .ods
, native R files with file type of .rda
, native SAS files with file type .sas7bdat
, and native SPSS files with file type .sav
. Invoke the widths
option to allow for the reading of fixed width formatted data. Calls the lessR
function details
to provide feedback regarding details of the data frame that was read. By default, variables defined by non-numeric variables are read as character strings. To read as factors
specify stringsAsFactors
as FALSE
, unless all the values of a variable a non-numeric and unique, in which case the variable is classified as a character string.
CREATE csv FILE
One way to create a csv data file is to enter the data into a text editor. A more structured method is to use a worksheet application such as MS Excel, LibreOffice Calc, or Apple Numbers. Place the variable names in the first row of the worksheet. Each column of the worksheet contains the data for the corresponding variable. Each subsequent row contains the data for a specific observation, such as for a person or a company.
Call help(read.table)
to view the other R options that can also be implemented from Read
.
MECHANICS
Specify the file as with the Read
function for reading the data into a data frame. If no arguments are passed to the function, then interactively browse for the file.
Given a csv data file, or tab-delimited text file, read the data into an R data frame called d
with Read
. Because Read
calls the standard R function read.csv
, which serves as a wrapper for read.table
, the usual options that work with read.table
, such as row.names
, also can be passed through the call to Read
.
SPSS DATA
Relies upon read_spss
from the haven
package to read data in the SPSS .sav
or .zsav
format. If the file has a file type of .sav
, that is, the file specification ends in .sav
, then the format
is automatically set to "SPSS"
. To invoke this option for a relevant data file of any file type, explicitly specify format="SPSS"
. Each (usually) integer variable with value labels is converted into two R variables: the original numeric code with the original variable name, and also the corresponding factor with the variable labels named with the original name plus the suffix _f
. The variable labels are also displayed for copying into a variable label file. See the SPSS section from vignette("Read")
.
R DATA
Relies upon the standard R function load
. By convention only, data files in native R format have a file type of .rda
. To read a native R data file, if the file type is .rda
, the format
is automatically set to "R"
. To invoke this option for a relevant data file of any file type, explicitly specify format="R"
. Create a native R data file by saving the current data frame, usually d
, with the lessR
function Write
.
Excel DATA
Relies upon the function read.xlsx
from Alexander Walker's openxlsx
package. Files with a file type of .xlsx
are assigned a format
of "Excel"
. The read.xlsx
parameter sheet
specifies the ordinal position of the worksheet in the Excel file, with a default value of 1. The row.names
parameter can only have a value of 1. Dates stored in Excel as an Excel date type are automatically read as an R Date type. See the help file for read.xlsx
for additional parameters, such as sheet
for the name or number of the worksheet to read and startRow
for the row number for which to start reading data.
lessR DATA
lessR
has some data sets included with the package: "BodyMeas", "Cars93", "Employee", "Jackets", "Learn", "Mach4", "Reading", and "StockPrice". Read
reads each such data set by specifying its name, such as Read("Employee")
. No specificaiton of format
and no provided filetype, just enter the name of the data set.
FIXED WIDTH FORMATTED DATA
Relies upon read.fwf
. Applies to data files in which the width of the column of data values of a variable is the same for each data value and there is no delimiter to separate adjacent data values_ An example is a data file of Likert scale responses from 1 to 5 on a 50 item survey such that the data consist of 50 columns with no spaces or other delimiter to separate adjacent data values_ To read this data set, invoke the widths
option of read.fwf
.
MISSING DATA
By default, Read
provides a list of each variable and each row with the display of the number of associated missing values, indicated by the standard R missing value code NA. When reading the data, Read
automatically sets any empty values as missing. Note that this is different from the R default in read.table
in which an empty value for character string variables are treated as a regular data value. Any other valid value for any data type can be set to missing as well with the missing
option. To mimic the standard R default for missing character values, set missing=NA
.
To not list the variable name or row name of variables or rows without missing data, invoke the miss_zero=FALSE
option, which can appreciably reduce the amount of output for large data sets. To view the entire data table in terms of 0's and 1's for non-missing and missing data, respectively, invoke the miss_matrix=TRUE
option.
VARIABLE LABELS
Unlike standard R, lessR
provides for variable labels, which can be provided for some or all of the variables in a data frame. Store the variable labels in a separate data frame l
. The variable labels file that is read by Read
consists of one row for each variable for which a variable label is provided. Each row consists of either two columns, the variable name in the first column and the associated variable label in the second column, or three columns with the third column the variable units. Use the units in conjunction for enhanced readability with the automatic markdown generated by the Rmd
parameter for the Regression
function. The format of the file can be csv
or xlsx
. The data frame Read
constructs from this input consists of one variable, called label
, with the variable names as row names.
The lessR
legacy approach is to store the variable labels directly with the data in the same data frame. The problem with this approach is that any transformations of the data with any function other than lessR
transformation functions remove the variable labels. The option for reading the variable labels with the labels
option of Read
statement is retained for compatibility.
Reading the data from an SPSS file, however, retains the SPSS variable labels as part of the data file. The lessR
data analysis functions will properly process these variable labels, but any non-lessR
data transformations will remove the labels from the data frame. To retain the labels, copy them to the l
data frame with the VariableLabels
function with the name of the data frame as the sole argument.
The lessR
functions that provide analysis, such as Histogram
for a histogram, automatically include the variable labels in their output, such as the title of a graph. Standard R functions can also use these variable labels by invoking the lessR
function label
, such as setting main=label(I4)
to put the variable label for a variable named I4 in the title of a graph.
Value
The read data frame is returned, usually assigned the name of d
as in the examples below. This is the default name for the data frame input into the lessR
data analysis functions.
Author(s)
David W. Gerbing (Portland State University; gerbing@pdx.edu)
References
Gerbing, D. W. (2020). R Visualizations: Derive Meaning from Data, Chapter 1, NY: CRC Press.
Alexander Walker (2018). openxlsx: Read, Write and Edit XLSX Files. R package version 4.1.0. https://CRAN.R-project.org/package=openxlsx
See Also
read.csv
,
read.fwf
, corRead
, label
,
details
, VariableLabels
.
Examples
# remove the # sign before each of the following Read statements to run
# to browse for a data file on the computer system, invoke Read with
# the from argument empty
# d <- Read()
# abbreviated name
# d <- rd()
# read the variable labels from
# the specified label file, here a Excel file with two columns,
# the first column of variable names and the second column the
# corresponding labels
# l <- Read("Employee_lbl", var_labels=TRUE)
# read a csv data file from the web
# d <- Read("http://web.pdx.edu/~gerbing/data/twogroup.csv")
# read a csv data file with -99 and XXX set to missing
# d <- Read(missing=c(-99, "XXX"))
# do not display any output
# d <- Read(quiet=TRUE)
# display full output
# d <- Read(brief=FALSE)
# read the built-in data set dataEmployee
d <- Read("Employee")
# read a data file organized by columns, with a
# 5 column ID field, 2 column Age field
# and 75 single columns of data, no spaces between columns
# name the variables with lessR function: to
# the variable names are Q01, Q02, ..., Q74, Q75
# d <- Read(widths=c(5,2,rep(1,75)), col.names=c("ID", "Age", to("Q", 75)))