getMatches {nzilbb.labbcat}R Documentation

Search for tokens.

Description

Searches through transcripts for tokens matching the given pattern.

Usage

getMatches(
  labbcat.url,
  pattern,
  participant.expression = NULL,
  transcript.expression = NULL,
  main.participant = TRUE,
  aligned = NULL,
  matches.per.transcript = NULL,
  words.context = 0,
  max.matches = NULL,
  overlap.threshold = NULL,
  anchor.confidence.min = NULL,
  page.length = 1000,
  no.progress = FALSE
)

Arguments

labbcat.url

URL to the LaBB-CAT instance

pattern

An object representing the pattern to search for.

This can be:

  • A string, representing a search of the orthography layer - spaces are taken to be word boundaries

  • A single named list, representing a one-column search - names are taken to be layer IDs

  • A list of named lists, representing a multi-column search - the outer list represents the columns of the search matrix where each column 'immediately follows' the previous, and the names of the inner lists are taken to be layer IDs

  • A named list fully replicating the structure of the search matrix in the LaBB-CAT browser interface, with one element called “columns”, containing a named list for each column.

    Each element in the “columns” named list contains an element named “layers”, whose value is a named list for patterns to match on each layer, and optionally an element named “adj”, whose value is a number representing the maximum distance, in tokens, between this column and the next column - if “adj” is not specified, the value defaults to 1, so tokens are contiguous.

    Each element in the “layers” named list is named after the layer it matches, and the value is a named list with the following possible elements:

    • pattern A regular expression to match against the label

    • min An inclusive minimum numeric value for the label

    • max An exclusive maximum numeric value for the label

    • not TRUE to negate the match

    • anchorStart TRUE to anchor to the start of the annotation on this layer (i.e. the matching word token will be the first at/after the start of the matching annotation on this layer)

    • anchorEnd TRUE to anchor to the end of the annotation on this layer (i.e. the matching word token will be the last before/at the end of the matching annotation on this layer)

    • target TRUE to make this layer the target of the search; the results will contain one row for each match on the target layer

Examples of valid pattern objects include:

## the word 'the' followed immediately by a word starting with an orthographic vowel
pattern <- "the [aeiou]"

## a word spelt with "k" but pronounced "n" word initially
pattern <- list(orthography = "k.*", phonemes = "n.*")

## the word 'the' followed immediately by a word starting with an phonemic vowel
pattern <- list(
    list(orthography = "the"),
    list(phonemes = "[cCEFHiIPqQuUV0123456789~#\$@].*"))

## the word 'the' followed immediately or with one intervening word by
## a hapax legomenon (word with a frequency of 1) that doesn't start with a vowel
pattern <- list(columns = list(
    list(layers = list(
           orthography = list(pattern = "the")),
         adj = 2),
    list(layers = list(
           phonemes = list(not = TRUE, pattern = "[cCEFHiIPqQuUV0123456789~#\$@].*"),
           frequency = list(max = "2")))))
participant.expression

An optional participant query expression for identifying participants to search the utterances of. This should be the output of expressionFromIds, expressionFromAttributeValue, or expressionFromAttributeValues, or more than one concatentated together and delimited by ' && '. If not supplied, utterances of all participants will be searched.

transcript.expression

An optional transript query expression for identifying transcripts to search in. This should be the output of expressionFromIds, expressionFromTranscriptTypes, expressionFromAttributeValue, or expressionFromAttributeValues, or more than one concatentated together and delimited by ' && '. If not supplied, all transcripts will be searched.

main.participant

TRUE to search only main-participant utterances, FALSE to search all utterances.

aligned

This parameter is deprecated and will be removed in future versions; please use anchor.confidence.min=50 instead.

matches.per.transcript

Optional maximum number of matches per transcript to return. NULL means all matches.

words.context

Number of words context to include in the ‘Before.Match’ and ‘After.Match’ columns in the results.

max.matches

The maximum number of matches to return, or null to return all.

overlap.threshold

The percentage overlap with other utterances before simultaneous speech is excluded, or null to include overlapping speech.

anchor.confidence.min

The minimum confidence for alignments, e.g.

  • 0 – return all alignments, regardless of confidence;

  • 50 – return only alignments that have been at least automatically aligned;

  • 100 – return only manually-set alignments.

page.length

In order to prevent timeouts when there are a large number of matches or the network connection is slow, rather than retrieving matches in one big request, they are retrieved using many smaller requests. This parameter controls the number of results retrieved per request.

no.progress

TRUE to supress visual progress bar. Otherwise, progress bar will be shown when interactive().

Value

A data frame identifying matches, containing the following columns:

See Also

getFragments

getSoundFragments

getMatchLabels

getMatchAlignments

processWithPraat

getParticipantIds

Examples

## Not run: 
## define the LaBB-CAT URL
labbcat.url <- "https://labbcat.canterbury.ac.nz/demo/"

## the word 'the' followed immediately by a word starting with an orthographic vowel
theThenOrthVowel <- getMatches(labbcat.url, "the [aeiou]")

## a word spelt with "k" but pronounced "n" word initially
knWords <- getMatches(labbcat.url, list(orthography = "k.*", phonemes = "n.*"))

## the word 'the' followed immediately by a word starting with an phonemic vowel
theThenPhonVowel <- getMatches(
  labbcat.url, list(
    list(orthography = "the"),
    list(phonemes = "[cCEFHiIPqQuUV0123456789~#\\$@].*")))

## the word 'the' followed immediately or with one intervening word by
## a hapax legomenon (word with a frequency of 1) that doesn't start with a vowel
results <- getMatches(
  labbcat.url, list(columns = list(
    list(layers = list(
           orthography = list(pattern = "the")),
         adj = 2),
    list(layers = list(
           phonemes = list(not=TRUE, pattern = "[cCEFHiIPqQuUV0123456789~#\\$@].*"),
           frequency = list(max = "2"))))),
  overlap.threshold = 5)

## all tokens of the KIT vowel, from the interview or monologue
## of the participants AP511_MikeThorpe and BR2044_OllyOhlson
results <- getMatches(labbcat.url, list(segment="I"),
  participant.expression = expressionFromIds(c("AP511_MikeThorpe","BR2044_OllyOhlson")),
  transcript.expression = expressionFromTranscriptTypes(c("interview","monologue")))

## all tokens of the KIT vowel for male speakers who speak English
results <- getMatches(labbcat.url, list(segment="I"),
  participant.expression = paste(
    expressionFromAttributeValue("participant_gender", "M"),
    expressionFromAttributeValues("participant_languages_spoken", "en"),
    sep=" && "))

## results$Text is the text that matched
## results$MatchId can be used to access results using other functions

## End(Not run)


[Package nzilbb.labbcat version 1.3-0 Index]