corpus_new {act}R Documentation

Create a new corpus object

Description

Create a new corpus object and loads annotation files. Currently 'ELAN' .eaf, 'EXMARaLDA .exb and 'Praat' .TextGrid files are supported.

The parameter pathsAnnotationFiles defines where the annotation files are located. If skipDoubleFiles=TRUE duplicated files will be skipped, otherwise the will be renamed. If importFiles=TRUE the corpus object will be created but files will not be loaded. To load the files then call corpus_import.

Usage

corpus_new(
  pathsAnnotationFiles,
  pathsMediaFiles = NULL,
  name = "New Corpus",
  importFiles = TRUE,
  skipDoubleFiles = TRUE,
  createFullText = TRUE,
  assignMedia = TRUE,
  pathNormalizationMatrix = NULL,
  namesInclude = character(),
  namesExclude = character(),
  namesSearchPatterns = character(),
  namesSearchReplacements = character(),
  namesToUpperCase = FALSE,
  namesToLowerCase = FALSE,
  namesTrim = TRUE,
  namesDefaultForEmptyNames = "no_name"
)

Arguments

pathsAnnotationFiles

Vector of character strings; paths to annotations files or folders that contain annotation files.

pathsMediaFiles

Vector of character strings; paths to media files or folders that contain media files.

name

Character string; name of the corpus to be created.

importFiles

Logical; if TRUE annotation files will be imported immediately when the function is called, if FALSE corpus object will be created without importing the annotation files.

skipDoubleFiles

Logical; if TRUE transcripts with the same names will be skipped (only one of them will be added), if FALSE transcripts will be renamed to make the names unique.

createFullText

Logical; if TRUE full text will be created.

assignMedia

Logical; if TRUE the folder(s) specified in @paths.media.files of your corpus object will be scanned for media.

pathNormalizationMatrix

Character string; path to the replacement matrix used for normalizing the annotations; if argument left open, the default normalization matrix of the package will be used.

namesInclude

Vector of character strings; Only files matching this regular expression will be imported into the corpus.

namesExclude

Vector of character strings; Files matching this regular expression will be skipped and not imported into the corpus.

namesSearchPatterns

Vector of character strings; Search pattern as regular expression. Leave empty for no search-replace in the names.

namesSearchReplacements

Vector of character strings; Replacements for search. Leave empty for no search-replace in the names.

namesToUpperCase

Logical; Convert transcript names all to upper case.

namesToLowerCase

Logical; Convert transcript names all to lower case.

namesTrim

Logical; Remove leading and trailing spaces in names.

namesDefaultForEmptyNames

Character string; Default value for empty transcript names (e.g., resulting from search-replace operations)

Details

The parameter pathsMediaFiles defines where the corresponding media files are located. If assignMedia=TRUE the paths defined in x@paths.media.files will be scanned for media files and will be matched to the transcript object based on their names. Only the the file types set in options()$act.fileformats.audio and options()$act.fileformats.video will be recognized. You can modify these options to recognize other media types.

See @import.results of the corpus object to check the results of importing the files. To get a detailed overview of the corpus object use act::info(x), for a summary use act::info_summarized(x).

Value

Corpus object.

See Also

corpus_import, examplecorpus

Examples

library(act)

# The example files that come with the act library are located here:
path <- system.file("extdata", "examplecorpus", package="act")

# The example corpus comes without media files.
# It is recommended to download a full example corpus also including the media files.
# You can use the following commands.
## Not run: 
   path <- "EXISTING_FOLDER_ON_YOUR_COMPUTER/examplecorpus"
   temp <- tempfile()
   download.file(options()$act.examplecorpusURL, temp)
   unzip(zipfile=temp, exdir=path)

## End(Not run)

# The following command creates a new corpus object
mycorpus <- act::corpus_new(name = "mycorpus",
	pathsAnnotationFiles = path,
	pathsMediaFiles = path)

# Get a summary
mycorpus


[Package act version 1.3.1 Index]