corpus_new {act}R Documentation

Create a new corpus object


Create a new corpus object and loads annotation files. Currently 'ELAN' .eaf, 'EXMARaLDA .exb and 'Praat' .TextGrid files are supported.

The parameter pathsAnnotationFiles defines where the annotation files are located. If skipDoubleFiles=TRUE duplicated files will be skipped, otherwise the will be renamed. If importFiles=TRUE the corpus object will be created but files will not be loaded. To load the files then call corpus_import.


  pathsMediaFiles = NULL,
  name = "New Corpus",
  importFiles = TRUE,
  skipDoubleFiles = TRUE,
  createFullText = TRUE,
  assignMedia = TRUE,
  pathNormalizationMatrix = NULL,
  namesSearchPatterns = character(),
  namesSearchReplacements = character(),
  namesToUpperCase = FALSE,
  namesToLowerCase = FALSE,
  namesTrim = TRUE,
  namesDefaultForEmptyNames = "no_name"



Vector of character strings; paths to annotations files or folders that contain annotation files.


Vector of character strings; paths to media files or folders that contain media files.


Character string; name of the corpus to be created.


Logical; if TRUE annotation files will be imported immediately when the function is called, if FALSE corpus object will be created without importing the annotation files.


Logical; if TRUE transcripts with the same names will be skipped (only one of them will be added), if FALSE transcripts will be renamed to make the names unique.


Logical; if TRUE full text will be created.


Logical; if TRUE the folder(s) specified in of your corpus object will be scanned for media.


Character string; path to the replacement matrix used for normalizing the annotations; if argument left open, the default normalization matrix of the package will be used.


Vector of character strings; Search pattern as regular expression. Leave empty for no search-replace in the names.


Vector of character strings; Replacements for search. Leave empty for no search-replace in the names.


Logical; Convert transcript names all to upper case.


Logical; Convert transcript names all to lower case.


Logical; Remove leading and trailing spaces in names.


Character string; Default value for empty transcript names (e.g., resulting from search-replace operations)


The parameter pathsMediaFiles defines where the corresponding media files are located. If assignMedia=TRUE the paths defined in will be scanned for media files and will be matched to the transcript object based on their names. Only the the file types set in options()$ and options()$ will be recognized. You can modify these options to recognize other media types.

See @import.results of the corpus object to check the results of importing the files. To get a detailed overview of the corpus object use act::info(x), for a summary use act::info_summarized(x).


Corpus object.

See Also

corpus_import, examplecorpus



# The example files that come with the act library are located here:
path <- system.file("extdata", "examplecorpus", package="act")

# The example corpus comes without media files.
# It is recommended to download a full example corpus also including the media files.
# You can use the following commands.
## Not run: 
   path <- "EXISTING_FOLDER_ON_YOUR_COMPUTER/examplecorpus"
   temp <- tempfile()
   download.file(options()$act.examplecorpusURL, temp)
   unzip(zipfile=temp, exdir=path)

## End(Not run)

# The following command creates a new corpus object
mycorpus <- act::corpus_new(name = "mycorpus",
	pathsAnnotationFiles = path,
	pathsMediaFiles = path)

# Get a summary

[Package act version 1.1.9 Index]