Aligned Corpus Toolkit


The Aligned Corpus Toolkit (act) is designed for linguists that work with time aligned transcription data. It offers functions to import and export various annotation file formats ('ELAN' .eaf, 'EXMARaLDA .exb and 'Praat' .TextGrid files), create print transcripts in the style of conversation analysis, search transcripts (span searches across multiple annotations, search in normalized annotations, make concordances etc.), export and re-import search results (.csv and 'Excel' .xlsx format), create cuts for the search results (print transcripts, audio/video cuts using 'FFmpeg' and video sub titles in 'Subrib title' .srt format), modify the data in a corpus (search/replace, delete, filter etc.), interact with 'Praat' using 'Praat'-scripts, and exchange data with the 'rPraat' package. The package is itself written in R and may be expanded by other users.

act functions


Package options

The package has numerous options that change the internal workings of the package. Please see act::options_show and the information given there.



# ========== Example data 
# The act package comes with some example data. 
# The data is stored at the following location:
path <- system.file("extdata", "examplecorpus", package="act")

# Since this folder is quite difficult to access, you might consider copying the 
# contents of this folder to a more convenient location.
# The following commands will create a new folder called 'examplecorpus' in the
# folder 'path'.
# You will find the data there.
## Not run: 
sourcepath <- system.file("extdata", "examplecorpus", package="act")
if (!dir.exists(path)) {dir.create(path)}
file.copy(sourcepath, dirname(path), recursive=TRUE)

## End(Not run)

# The example files that come with the package do only contain annotation files.
# Media files are not included.
# The following lines will download the data and create a new folder called 
# 'examplecorpus' in the folder 'path'.  
# You will find the data there.
## Not run: 
sourceurl <- 
temp <- tempfile()
download.file(sourceurl, temp)
unzip(zipfile=temp, exdir=path)

## End(Not run)

# ========== Create a corpus object and load data
# Now that we have the example data accessible, we can create a corpus object.
# The corpus object is a structured collection of all the information that you can 
# work with using act.
# It will contain the information of each transcript, links to media files and further 
# meta data.

# --- Locate folder with annotation files
# When creating a corpus object you will need to specify where your annotation 
# files ('Praat' '.TextGrids' or 'ELAN' .eaf) are located.
# We will use the example data, that we have just located in 'path'.

# In case that you want to use your own data, you can set the path here:
## Not run: 

## End(Not run)

# --- Create corpus object and load annotation files
# The following command will create a corpus object, with the name 'examplecorpus'.
examplecorpus <- act::corpus_new(
	pathsAnnotationFiles = path,
	pathsMediaFiles = path,
	name = "examplecorpus"

# The act package assumes, that annotation files and media files have the same base  
# name and differ only in the suffix (e.g. 'filename.TextGrid' and 'filename.wav'/
# 'filename.mp4').
# This allows act to automatically link media files to the transcripts.

# --- Information about your corpus
# The following command will give you a summary of the data contained in your corpus object.
# More detailed information about the transcripts in your corpus object is available by 
# calling the function act::info()
# If you are working in R studio, a nice way of inspecting this information is the following:
## Not run: 

## End(Not run)

# ========== all data
# You can also get all data that is in the loaded annotation files in a data frame:
all_annotations <- act::annotations_all(examplecorpus)
## Not run: 

## End(Not run)

# ========== Search
# Let's do some searches in the data.
# Search for the 1. Person Singular Pronoun in Spanish 'yo' in the examplecorpus
mysearch <- act::search_new(x=examplecorpus, 
							pattern= "yo")
# Have a look at the result:

# Directly view all search results in the viewer
## Not run: 

## End(Not run)

# --- Search original vs. normalized content
# You can either search in the original 'content' of the annotations,
# or you can search in a 'normalized' version of the annotations.
# Let's compare the two modes.
mysearch.norm  <- act::search_new(examplecorpus, pattern="yo", searchNormalized=TRUE)   <- act::search_new(examplecorpus, pattern="yo", searchNormalized=FALSE)
# There is a difference in the number of results.

# The difference is because during in the normalized version, for instance, capital letters 
# will be converted to small letters. 
# In our case, one annotation in the example corpus contains a "yO" with a
# capital letter:
mysearch <- act::search_new(examplecorpus, pattern="yO", searchNormalized=FALSE)

# During normalization a range of normalization procedures will be applied, using a 
# replacement matrix. This matrix searches and replaces certain patterns, that you want to 
# exclude from the normalized content.
# By default, normalization gets rid of all transcription conventions of GAT. 
# You may, in addition, also customize the replacement matrix to your own needs/transcription
# conventions.

# --- Search original content vs. full text
# There are two search modes.
# The 'fulltext' mode will will find matches across annotations.
# The 'content' mode will will respect the temporal boundaries of the original annotations.

# Let's define a search pattern with a certain span.
myRegEx <- "\\bno\\b.{1,20}pero"
# This regular expression matches the Spanish word "no" 'no' followed by a "pero" 'but'
# in a distance ranging from 1 to 20 characters.

# The 'content' search mode will not find any hit.
mysearch <- act::search_new(examplecorpus, pattern=myRegEx, searchMode="content")

# The 'fulltext' search mode will not find two hits that extend over several annotations.
mysearch <- act::search_new(examplecorpus, pattern=myRegEx, searchMode="fulltext")

