corpus_new {act} | R Documentation |
Create a new corpus object
Description
Create a new corpus object and loads annotation files. Currently 'ELAN' .eaf, 'EXMARaLDA .exb and 'Praat' .TextGrid files are supported.
The parameter pathsAnnotationFiles
defines where the annotation files are located.
If skipDoubleFiles=TRUE
duplicated files will be skipped, otherwise the will be renamed.
If importFiles=TRUE
the corpus object will be created but files will not be loaded. To load the files then call corpus_import.
Usage
corpus_new(
pathsAnnotationFiles,
pathsMediaFiles = NULL,
name = "New Corpus",
importFiles = TRUE,
skipDoubleFiles = TRUE,
createFullText = TRUE,
assignMedia = TRUE,
pathNormalizationMatrix = NULL,
namesInclude = character(),
namesExclude = character(),
namesSearchPatterns = character(),
namesSearchReplacements = character(),
namesToUpperCase = FALSE,
namesToLowerCase = FALSE,
namesTrim = TRUE,
namesDefaultForEmptyNames = "no_name"
)
Arguments
pathsAnnotationFiles |
Vector of character strings; paths to annotations files or folders that contain annotation files. |
pathsMediaFiles |
Vector of character strings; paths to media files or folders that contain media files. |
name |
Character string; name of the corpus to be created. |
importFiles |
Logical; if |
skipDoubleFiles |
Logical; if |
createFullText |
Logical; if |
assignMedia |
Logical; if |
pathNormalizationMatrix |
Character string; path to the replacement matrix used for normalizing the annotations; if argument left open, the default normalization matrix of the package will be used. |
namesInclude |
Vector of character strings; Only files matching this regular expression will be imported into the corpus. |
namesExclude |
Vector of character strings; Files matching this regular expression will be skipped and not imported into the corpus. |
namesSearchPatterns |
Vector of character strings; Search pattern as regular expression. Leave empty for no search-replace in the names. |
namesSearchReplacements |
Vector of character strings; Replacements for search. Leave empty for no search-replace in the names. |
namesToUpperCase |
Logical; Convert transcript names all to upper case. |
namesToLowerCase |
Logical; Convert transcript names all to lower case. |
namesTrim |
Logical; Remove leading and trailing spaces in names. |
namesDefaultForEmptyNames |
Character string; Default value for empty transcript names (e.g., resulting from search-replace operations) |
Details
The parameter pathsMediaFiles
defines where the corresponding media files are located.
If assignMedia=TRUE
the paths defined in x@paths.media.files
will be scanned for media files and will be matched to the transcript object based on their names.
Only the the file types set in options()$act.fileformats.audio
and options()$act.fileformats.video
will be recognized.
You can modify these options to recognize other media types.
See @import.results
of the corpus object to check the results of importing the files.
To get a detailed overview of the corpus object use act::info(x)
, for a summary use act::info_summarized(x)
.
Value
Corpus object.
See Also
Examples
library(act)
# The example files that come with the act library are located here:
path <- system.file("extdata", "examplecorpus", package="act")
# The example corpus comes without media files.
# It is recommended to download a full example corpus also including the media files.
# You can use the following commands.
## Not run:
path <- "EXISTING_FOLDER_ON_YOUR_COMPUTER/examplecorpus"
temp <- tempfile()
download.file(options()$act.examplecorpusURL, temp)
unzip(zipfile=temp, exdir=path)
## End(Not run)
# The following command creates a new corpus object
mycorpus <- act::corpus_new(name = "mycorpus",
pathsAnnotationFiles = path,
pathsMediaFiles = path)
# Get a summary
mycorpus