| tidy_xlsx {tidyxl} | R Documentation |
Import xlsx (Excel) cell contents into a tidy structure.
Description
tidy_xlsx() is deprecated. Please use xlsx_cells() or xlsx_formats()
instead.
tidy_xlsx() imports data from spreadsheets without coercing it into a
rectangle. Each cell is represented by a row in a data frame, giving the
cell's address, contents, formula, height, width, and keys to look up the
cell's formatting in an adjacent data structure within the list returned by
this function.
Usage
tidy_xlsx(path, sheets = NA)
Arguments
path |
Path to the xlsx file. |
sheets |
Sheets to read. Either a character vector (the names of the sheets), an integer vector (the positions of the sheets), or NA (default, all sheets). |
Details
A cell has two 'values': its content, and sometimes also a formula. It also has formatting applied at the 'style' level, which can be locally overridden.
Content
Depending on the cell, the content may be a numeric value such as 365 or
365.25, it may represent a date/datetime in one of Excel's date/datetime
systems, or it may be an index into an internal table of strings.
tidy_xlsx() attempts to infer the correct data type of each cell,
returning its value in the appropriate column (error, logical, numeric,
date, character). In case this cleverness is unhelpful, the unparsed value
and type information is available in the 'content' and 'type' columns.
Formula
When a cell has a formula, the value in the 'content' column is the result of the formula the last time it was evaluated.
Certain groups of cells may share a formula that differs only by addresses
referred to in the formula; such groups are identified by an index, the
'formula_group'. The xlsx (Excel) file format only records the formula
against one cell in any group, but tidy_xlsx() propagates the formula to
all the cells in the group, making the necessary changes to relative
addresses in the formula.
Array formulas may also apply to a group of cells, identified by an address
'formula_ref', but xlsx (Excel) file format only records the formula
against one cell in the group. Unlike ordinary formulas, tidy_xlsx()
does not propagate these to the other cells in the group.
Formulas that refer to other workbooks currently do not name the workbooks
directly, instead via indices such as [1]. It is planned to
dereference these.
Formatting
Cell formatting is returned in x$formats. There are two types of
formatting: 'style' formatting, such as Excel's built-in styles 'normal',
'bad', etc., and 'local' formatting, which overrides the style. These are
returned in x$formats$style and x$formats$local, with
identical structures. To look up the local formatting of a given cell,
take the cell's 'local_format_id' value (x$data$Sheet1[1, "local_format_id"]), and use it as an index into the format structure.
E.g. to look up the font size,
x$formats$local$font$size[local_format_id]. To see all available
formats, type str(x$formats$local).
Value
A list of the data within each sheet ($data), and the formatting applied to
each cell ($formats).
Each sheet's data is returned as a data frames, one per sheet, by the sheet
name. For example, the data in a sheet named 'My Worksheet' is in
x$data$My Worksheet. Each data frame has the following
columns:
-
addressThe cell address in A1 notation. -
rowThe row number of a cell address (integer). -
colThe column number of a cell address (integer). -
is_blankWhether or not the cell has a value -
data_typeThe type of a cell, referring to the following columns: error, logical, numeric, date, character, blank. -
errorThe error value of a cell. -
logicalThe boolean value of a cell. -
numericThe numeric value of a cell. -
dateThe date value of a cell. -
characterThe string value of a cell. -
character_formattedA data frame of substrings and their individual formatting. -
formulaThe formula in a cell (see 'Details'). -
is_arrayWhether or not the formula is an array formula. -
formula_refThe address of a range of cells group to which an array formula or shared formula applies (see 'Details'). -
formula_groupThe formula group to which the cell belongs (see 'Details'). -
commentThe text of a comment attached to a cell. -
heightThe height of a cell's row, in Excel's units. -
widthThe width of a cell's column, in Excel's units. -
style_formatAn index into a table of style formatsx$formats$style(see 'Details'). -
local_format_idAn index into a table of local cell formatsx$formats$local(see 'Details').
Formula
When a cell has a formula, the value in the 'content' column is the result of the formula the last time it was evaluated.
Certain groups of cells may share a formula that differs only by addresses
referred to in the formula; such groups are identified by an index, the
'formula_group'. The xlsx (Excel) file format only records the formula
against one cell in any group. xlsx_cells() propagates such formulas to
the other cells in a group, making the necessary changes to relative
addresses in the formula.
Array formulas may also apply to a group of cells, identified by an address
'formula_ref', but xlsx (Excel) file format only records the formula
against one cell in the group. xlsx_cells() propagates such formulas to
the other cells in a group. Unlike shared formulas, no changes to
addresses in array formulas are necessary.
Formulas that refer to other workbooks currently do not name the workbooks
directly, instead via indices such as [1]. It is planned to
dereference these.
Formatting
Cell formatting is returned in x$formats. There are two types or scopes
of formatting: 'style' formatting, such as Excel's built-in styles
'normal', 'bad', etc., and 'local' formatting, which overrides particular
elements of the style, e.g. by making it bold. Both types of are returned
in x$formats$style and x$formats$local, with identical structures. To
look up the local formatting of a given cell, take the cell's
'local_format_id' value (x$data$Sheet1[1, "local_format_id"]), and use it
as an index into the format structure. E.g. to look up the font size,
x$formats$local$font$size[local_format_id]. To see all available
formats, type str(x$formats$local).
Colours may be recorded in any of three ways: a hexadecimal RGB string with
or without alpha, an 'indexed' colour, and an index into a 'theme'.
tidy_xlsx dereferences 'indexed' and 'theme' colours to their hexadecimal
RGB string representation, and standardises all RGB strings to have an
alpha channel in the first two characters. The 'index' and the 'theme'
name are still provided. To filter by an RGB string, you could look up
the RGB values in a spreadsheet program (e.g. Excel, LibreOffice,
Gnumeric), and use the grDevices::rgb() function to convert these to a
hexadecimal string. Put the alpha value in first, e.g.
A <- 1; R <- 0.5; G <- 0; B <- 0 rgb(A, R, G, B) # [1] "#FF800000"
Strings can be formatted within a cell, so that a single cell can contain
substrings with different formatting. This in-cell formatting is available
in the column character_formatted, which is a list-column of data frames.
Each row of each data frame describes a substring and its formatting. For
cells without a character value, character_formatted is NULL, so for
further processing you might need to filter out the NULLs first.
Examples
## Not run:
examples <- system.file("extdata/examples.xlsx", package = "tidyxl")
# All sheets
str(tidy_xlsx(examples)$data)
# Specific sheet either by position or by name
str(tidy_xlsx(examples, 2)$data)
str(tidy_xlsx(examples, "Sheet1")$data)
# Data (cell values)
x <- tidy_xlsx(examples)
str(x$data$Sheet1)
# Formatting
str(x$formats$local)
# The formats of particular cells can be retrieved like this:
Sheet1 <- x$data$Sheet1
x$formats$style$font$bold[Sheet1$style_format]
x$formats$local$font$bold[Sheet1$local_format_id]
# To filter for cells of a particular format, first filter the formats to get
# the relevant indices, and then filter the cells by those indices.
bold_indices <- which(x$formats$local$font$bold)
Sheet1[Sheet1$local_format_id %in% bold_indices, ]
# In-cell formatting is available in the `character_formatted` column as a
# data frame, one row per substring.
tidy_xlsx(examples)$data$Sheet1$character_formatted[77]
## End(Not run)