prepare_data {corporaexplorer} | R Documentation |
Prepare data for corpus exploration
Description
Convert data frame or character vector to a ‘corporaexplorerobject’ for subsequent exploration.
Usage
prepare_data(dataset, ...)
## S3 method for class 'data.frame'
prepare_data(
dataset,
date_based_corpus = TRUE,
grouping_variable = NULL,
within_group_identifier = "Seq",
columns_doc_info = c("Date", "Title", "URL"),
corpus_name = NULL,
use_matrix = TRUE,
matrix_without_punctuation = TRUE,
tile_length_range = c(1, 10),
columns_for_ui_checkboxes = NULL,
...
)
## S3 method for class 'character'
prepare_data(
dataset,
corpus_name = NULL,
use_matrix = TRUE,
matrix_without_punctuation = TRUE,
...
)
Arguments
dataset |
Object to convert to corporaexplorerobject:
|
... |
Other arguments to be passed to |
date_based_corpus |
Logical. Set to |
grouping_variable |
Character string.
If |
within_group_identifier |
Character string indicating column name in |
columns_doc_info |
Character vector. The columns from |
corpus_name |
Character string with name of corpus. |
use_matrix |
Logical. Should the function create a document term matrix
for fast searching? If |
matrix_without_punctuation |
Should punctuation and digits be stripped
from the text before constructing the document term matrix? If
If |
tile_length_range |
Numeric vector of length two.
Fine-tune the tile lengths in document wall
and day corpus view. Tile length is calculated by
|
columns_for_ui_checkboxes |
Character. Character or factor column(s) in dataset.
Include sets of checkboxes in the app sidebar for
convenient filtering of corpus.
Typical useful for columns with a small set of unique
(and short) values.
Checkboxes will be arranged by |
Details
For data.frame: Each row in dataset
is treated as a base differentiating unit in the corpus,
typically chapters in books, or a single document in document collections.
The following column names are reserved and cannot be used in dataset
:
"ID",
"Text_original_case",
"Tile_length",
"Year",
"Seq",
"Weekday_n",
"Day_without_docs",
"Invisible_fake_date",
"Tile_length".
A character vector will be converted to a simple corporaexplorerobject with no metadata.
Value
A corporaexplorer
object to be passed as argument to
explore
and
run_document_extractor
.
Examples
## From data.frame
# Constructing test data frame:
dates <- as.Date(paste(2011:2020, 1:10, 21:30, sep = "-"))
texts <- paste0(
"This is a document about ", month.name[1:10], ". ",
"This is not a document about ", rev(month.name[1:10]), "."
)
titles <- paste("Text", 1:10)
test_df <- tibble::tibble(Date = dates, Text = texts, Title = titles)
# Converting to corporaexplorerobject:
corpus <- prepare_data(test_df, corpus_name = "Test corpus")
if(interactive()){
# Running exploration app:
explore(corpus)
# Running app to extract documents:
run_document_extractor(corpus)
}
## From character vector
alphabet_corpus <- prepare_data(LETTERS)
if(interactive()){
# Running exploration app:
explore(alphabet_corpus)
}