xlsx_cells {tidyxl} | R Documentation |
Import xlsx (Excel) cell contents into a tidy structure.
Description
xlsx_cells()
imports data from spreadsheets without coercing it into a
rectangle. Each cell is represented by a row in a data frame, giving the
cell's address, contents, formula, height, width, and keys to look up the
cell's formatting in the return value of xlsx_formats()
.
Usage
xlsx_cells(
path,
sheets = NA,
check_filetype = TRUE,
include_blank_cells = TRUE
)
Arguments
path |
Path to the xlsx file. |
sheets |
Sheets to read. Either a character vector (the names of the sheets), an integer vector (the positions of the sheets), or NA (default, all sheets). |
check_filetype |
Logical. Whether to check that the filetype is xlsx (or xlsm) by looking at the file itself, rather than using the filename extension. |
include_blank_cells |
Logical. Whether to include cells that have no
value or formula (but might have formatting or comments). Useful when a
whole column of cells has been formatted, but most are empty. Try setting
this to |
Details
A cell has two 'values': its content, and sometimes also a formula. It also has formatting applied at the 'style' level, which can be locally overridden.
Content
Depending on the cell, the content may be a numeric value such as 365 or
365.25, it may represent a date/datetime in one of Excel's date/datetime
systems, or it may be an index into an internal table of strings.
xlsx_cells()
attempts to infer the correct data type of each cell,
returning its value in the appropriate column (error, logical, numeric,
date, character). In case this cleverness is unhelpful, the unparsed value
and type information is available in the 'content' and 'data_type' columns.
Formula
When a cell has a formula, the value in the 'content' column is the result of the formula the last time it was evaluated.
Certain groups of cells may share a formula that differs only by addresses
referred to in the formula; such groups are identified by an index, the
'formula_group'. The xlsx (Excel) file format only records the formula
against one cell in any group. xlsx_cells()
propagates such formulas to
the other cells in a group, making the necessary changes to relative
addresses in the formula.
Array formulas may also apply to a group of cells, identified by an address
'formula_ref', but xlsx (Excel) file format only records the formula
against one cell in the group. xlsx_cells()
propagates such formulas to
the other cells in a group. Unlike shared formulas, no changes to
addresses in array formulas are necessary.
Formulas that refer to other workbooks currently do not name the workbooks
directly, instead via indices such as [1]
. It is planned to
dereference these.
Formatting
Cell formatting is returned by xlsx_formats()
. There are two
types of formatting: 'style' formatting, such as Excel's built-in styles
'normal', 'bad', etc., and 'local' formatting, which overrides the style.
These are returned in the $style
and $local
sublists of
xlsx_formats()
, with identical structures.
To look up the local formatting of a given cell, take the cell's
local_format_id
value (my_cells$Sheet1[1, "local_format_id"]
), and use
it as an index into the format structure. E.g. to look up the font size,
my_formats$local$font$size[local_format_id]
. To see all available
formats, type str(my_formats$local)
.
Strings can be formatted within a cell, so that a single cell can contain
substrings with different formatting. This in-cell formatting is available
in the column character_formatted
, which is a list-column of data frames.
Each row of each data frame describes a substring and its formatting. For
cells without a character value, character_formatted
is NULL
, so for
further processing you might need to filter out the NULL
s first.
Value
A data frame with the following columns.
-
sheet
The worksheet that the cell is from. -
address
The cell address in A1 notation. -
row
The row number of a cell address (integer). -
col
The column number of a cell address (integer). -
is_blank
Whether or not the cell has a value -
content
Raw cell value before type conversion, useful for debugging. -
data_type
The type of a cell, referring to the following columns: error, logical, numeric, date, character, blank. -
error
The error value of a cell. -
logical
The boolean value of a cell. -
numeric
The numeric value of a cell. -
date
The date value of a cell. -
character
The string value of a cell. -
formula
The formula in a cell (see 'Details'). -
is_array
Whether or not the formula is an array formula. -
formula_ref
The address of a range of cells group to which an array formula or shared formula applies (see 'Details'). -
formula_group
The formula group to which the cell belongs (see 'Details'). -
comment
The text of a comment attached to a cell. -
height
The height of a cell's row, in Excel's units. -
width
The width of a cell's column, in Excel's units. -
row_outline_level
The outline level of a cells's row. -
col_outline_level
The outline level of a cells's column. -
style_format
An index into a table of style formatsx$formats$style
(see 'Details'). -
local_format_id
An index into a table of local cell formatsx$formats$local
(see 'Details').
Cell formatting is returned in xlsx_formats()
. There are two types
or scopes of formatting: 'style' formatting, such as Excel's built-in styles
'normal', 'bad', etc., and 'local' formatting, which overrides particular
elements of the style, e.g. by making it bold. Both types are returned, in
the $style
and $local
sublists of xlsx_formats()
, with
identical structures. To look up the local formatting of a given cell, take
the cell's 'local_format_id' value (my_cells$data$Sheet1[1, "local_format_id"]
), and use it as an index into the format structure. E.g.
to look up the font size, my_formats$local$font$size[local_format_id]
. To
see all available formats, type str(my_formats$local)
.
Examples
examples <- system.file("extdata/examples.xlsx", package = "tidyxl")
# All sheets
str(xlsx_cells(examples))
# Specific sheet either by position or by name
str(xlsx_cells(examples, 2))
str(xlsx_cells(examples, "Sheet1"))
# The formats of particular cells can be retrieved like this:
Sheet1 <- xlsx_cells(examples, "Sheet1")
formats <- xlsx_formats(examples)
formats$local$font$bold[Sheet1$local_format_id]
formats$style$font$bold[Sheet1$style_format]
# To filter for cells of a particular format, first filter the formats to get
# the relevant indices, and then filter the cells by those indices.
bold_indices <- which(formats$local$font$bold)
Sheet1[Sheet1$local_format_id %in% bold_indices, ]
# In-cell formatting is available in the `character_formatted` column as a
# data frame, one row per substring.
xlsx_cells(examples)$character_formatted[77]