R: Create a new corpus object

corpus_new {act}

R Documentation

Create a new corpus object

Description

Create a new corpus object and loads annotation files. Currently 'ELAN' .eaf, 'EXMARaLDA .exb and 'Praat' .TextGrid files are supported.

The parameter pathsAnnotationFiles defines where the annotation files are located. If skipDoubleFiles=TRUE duplicated files will be skipped, otherwise the will be renamed. If importFiles=TRUE the corpus object will be created but files will not be loaded. To load the files then call corpus_import.

Usage

corpus_new(
  pathsAnnotationFiles,
  pathsMediaFiles = NULL,
  name = "New Corpus",
  importFiles = TRUE,
  skipDoubleFiles = TRUE,
  createFullText = TRUE,
  assignMedia = TRUE,
  pathNormalizationMatrix = NULL,
  namesInclude = character(),
  namesExclude = character(),
  namesSearchPatterns = character(),
  namesSearchReplacements = character(),
  namesToUpperCase = FALSE,
  namesToLowerCase = FALSE,
  namesTrim = TRUE,
  namesDefaultForEmptyNames = "no_name"
)

Arguments

`pathsAnnotationFiles`	Vector of character strings; paths to annotations files or folders that contain annotation files.
`pathsMediaFiles`	Vector of character strings; paths to media files or folders that contain media files.
`name`	Character string; name of the corpus to be created.
`importFiles`	Logical; if `TRUE` annotation files will be imported immediately when the function is called, if `FALSE` corpus object will be created without importing the annotation files.
`skipDoubleFiles`	Logical; if `TRUE` transcripts with the same names will be skipped (only one of them will be added), if `FALSE` transcripts will be renamed to make the names unique.
`createFullText`	Logical; if `TRUE` full text will be created.
`assignMedia`	Logical; if `TRUE` the folder(s) specified in `@paths.media.files` of your corpus object will be scanned for media.
`pathNormalizationMatrix`	Character string; path to the replacement matrix used for normalizing the annotations; if argument left open, the default normalization matrix of the package will be used.
`namesInclude`	Vector of character strings; Only files matching this regular expression will be imported into the corpus.
`namesExclude`	Vector of character strings; Files matching this regular expression will be skipped and not imported into the corpus.
`namesSearchPatterns`	Vector of character strings; Search pattern as regular expression. Leave empty for no search-replace in the names.
`namesSearchReplacements`	Vector of character strings; Replacements for search. Leave empty for no search-replace in the names.
`namesToUpperCase`	Logical; Convert transcript names all to upper case.
`namesToLowerCase`	Logical; Convert transcript names all to lower case.
`namesTrim`	Logical; Remove leading and trailing spaces in names.
`namesDefaultForEmptyNames`	Character string; Default value for empty transcript names (e.g., resulting from search-replace operations)

Details

The parameter pathsMediaFiles defines where the corresponding media files are located. If assignMedia=TRUE the paths defined in x@paths.media.files will be scanned for media files and will be matched to the transcript object based on their names. Only the the file types set in options()$act.fileformats.audio and options()$act.fileformats.video will be recognized. You can modify these options to recognize other media types.

See @import.results of the corpus object to check the results of importing the files. To get a detailed overview of the corpus object use act::info(x), for a summary use act::info_summarized(x).

Value

Corpus object.

Examples

library(act)

# The example files that come with the act library are located here:
path <- system.file("extdata", "examplecorpus", package="act")

# The example corpus comes without media files.
# It is recommended to download a full example corpus also including the media files.
# You can use the following commands.
## Not run: 
   path <- "EXISTING_FOLDER_ON_YOUR_COMPUTER/examplecorpus"
   temp <- tempfile()
   download.file(options()$act.examplecorpusURL, temp)
   unzip(zipfile=temp, exdir=path)

## End(Not run)

# The following command creates a new corpus object
mycorpus <- act::corpus_new(name = "mycorpus",
	pathsAnnotationFiles = path,
	pathsMediaFiles = path)

# Get a summary
mycorpus

[Package act version 1.3.1 Index]