sm.compositionality {cultevo}  R Documentation 
Spike's segmentation and measure of additive compositionality.
Description
Implementation of the SpikeMontague segmentation and measure of additive compositionality (Spike 2016), which finds the most predictive associations between meaning features and substrings. Computation is deterministic and fast.
Usage
sm.compositionality(x, y, groups = NULL, strict = FALSE)
sm.segmentation(x, y, strict = FALSE)
Arguments
x 
a list or vector of character sequences specifying the signals to
be analysed. Alternatively, 
y 
a matrix or data frame with as many rows as there are signals,
indicating the presence/value of the different meaning dimensions along
columns (see section Meaning data format). If 
groups 
a list or vector with as many items as strings, used to split

strict 
logical: if 
Details
The algorithm works on compositional meanings that can be expressed as sets of categorical meaning features (see below), and does not take the order of elements into account. Rather than looking directly at how complex meanings are expressed, the measure really captures the degree to which a homonymy and synonymyfree signalling system exists at the level of individual semantic features.
The segmentation algorithm provided by sm.segmentation()
scans through
all substrings found in strings
to find the pairings of meaning features
and substrings whose respective presence is most predictive of each
other. Mathematically, for every meaning feature f\in M
, it finds
the substring s_{ij}
from the set of strings S
that yields the
highest mutual predictability across all signals,
mp(f,S) = \max_{s_{ij}\in S}\ P(fs_{ij}) \cdot P(s_{ij}f)\;.
Based on the mutual predictability levels obtained for the individual
meaning features, sm.compositionality
then computes the mean mutual
predictability weighted by the individual features' relative frequencies of
attestation, i.e.
mp(M,S) = \sum_{f\in M} freq_f \cdot mp(f,S)\;,
as a measure of the overall compositionality of the signalling system.
Since mutual predictability is determined seperately for every meaning
feature, the most predictive substrings posited for different meaning
features as returned by sm.segmentation()
can overlap, and even coincide
completely. Such results are generally indicative of either limited data
(in particular frequent cooccurrence of the meaning features in question),
or spurious results in the absence of a consistent signalling system. The
latter will also be indicated by the significance level of the given mutual
predictability.
Value
sm.segmentation
provides detailed information about the most
predictably cooccurring segments for every meaning feature. It returns
a data frame with one row for every meaning feature, in descending order
of the mutual predictability from (and to) their corresponding string
segments. The data frame has the following columns:
N
The number of signalmeaning pairings in which this meaning feature was attested.
mp
The highest mutual predictability between this meaning feature and one (or more) segments that was found.
p
Significance levels of the given mutual predictability, i.e. the probability that the given mutual predictability level could be reached by chance. The calculation depends on the frequency of the meaning feature as well as the number and relative frequency of all substrings across all signals (see below).
ties
The number of substrings found in
strings
which have this same level of mutual predictability with the meaning feature.segments
For
strict=FALSE
: a list containing theties
substrings in descending order of their length (the ordering is for convenience only and not inherently meaningful). Whenstrict=TRUE
, the lists of segments for each meaning feature are all of the same length, with a meaningful relationship of the order of segments across the different rows: every set of segments which are found in the same position for each of the different meaning features constitute a valid segmentation where the segments occurrences in the actual signals do not overlap.
sm.compositionality
calculates the weighted average of the
mutual predictability of all meaning features and their most predictably
cooccurring strings, as computed by sm.segmentation
. The function
returns a data frame of three columns:
N
is the total number of signals (utterances) on which the computation
was based, M
the number of distinct meaning features attested across
all signals, and meanmp
the mean mutual predictability across all these
features, weighted by the features' relative frequency. When groups
is
not NULL
, the data frame contains one row for every group.
Null distribution and pvalue calculation
A perfectly unambiguous mapping between a meaning feature to a specific
string segment will always yield a mutual predictability of 1
. In the
absence of such a regular mapping, on the other hand, chance cooccurrences
of strings and meanings will in most cases stop the mutual predictability
from going all the way down to 0
. In order to help distinguish chance
cooccurrence levels from significant signalmeaning associations,
sm.segmentation()
provides significance levels for the mutual
predictability levels obtained for each meaning feature.
What is the baseline level of association between a meaning feature and a set of substrings that we would expect to be due to chance cooccurrences? This depends on several factors, from the number of data points on which the analysis is based to the frequency of the meaning feature in question and, perhaps most importantly, the overall makeup of the different substrings that are present in the signals. Since every substring attested in the data is a candidate for signalling the presence of a meaning feature, the absolute number of different substrings greatly affects the likelihood of chance signalmeaning associations. (Diversity of the set of substrings is in turn heavily influenced by the size of the underlying alphabet, a factor which is often not appreciated.)
For every candidate substring, the degree of association with a specific meaning feature that we would expect by chance is again dependent on the absolute number of signals in which the substring is attested.
Starting from the simplest case, take a meaning that is featured in m
of the total n
signals (where 0 < m \leq n
). Assume next that
there is a string segment that is attested in s
of these signals
(where again 0 < s \leq n
). The degree of association between the
meaning feature and string segment is dependent on the number of times that
they cooccur, which can be no more than c_{max} = min(m, s)
times.
The null probability of getting a given number of cooccurrences can be
obtained by considering all possible reshufflings of the meaning feature in
question across all signals: if s
signals contain a given substring,
how many of s
randomly drawn signals from the pool of n
signals
would contain the meaning feature if a total of m
signals in the pool
did? Approached from this angle, the likelihood of the number of
cooccurrences follows the
hypergeometric distribution,
with c
being the number of successes when taking s
draws without
replacement from a population of size n
with fixed number of successes
m
.
For every number of cooccurrences c \in [0, c_{max}]
, one can
compute the corresponding mutual probability level as
p(cs) \cdot p(cm)
to obtain the null distribution of mutual
predictability levels between a meaning feature and one substring of a
particular frequency s
:
Pr(mp = p(cs) \cdot p(cm)) = f(k=c; N=n, K=m, n=s)
From this, we can now derive the null distribution for the entire set of
attested substrings as follows: making the simplifying assumption that the
occurrences of different substrings are independent of each other, we first
aggregate over the null distributions of all the individual substrings to
obtain the mean probability p=Pr(X\ge mp)
of finding a given mutual
predictability level at least as high as mp
for one randomly drawn
string from the entire population of substrings. Assuming the total number
of candidate substrings is S
, the overall null probability that at
least one of them would yield a mutual predictability at least as high is
Pr(X\ge 0), X \equiv B(n=S, p=p)\;.
Note that, since the null distribution also depends on the frequency with which the meaning feature is attested, the significance levels corresponding to a given mutual predictability level are not necessarily identical for all meaning features, even within one analysis.
(In theory, one can also compute an overall pvalue of the weighted mean
mutual predictability as calculated by sm.compositionality
. However, the
significance levels for the individual meaning features are much more
insightful and should therefore be consulted directly.)
Meaning data format
The meanings
argument can be a matrix or data frame in one of two formats.
If it is a matrix of logicals (TRUE
/FALSE
values), then the columns are
assumed to refer to meaning features, with individual cells indicating
whether the meaning feature is present or absent in the signal represented
by that row (see binaryfeaturematrix()
for an explanation). If meanings
is a data frame or matrix of any other type, it is assumed that the columns
specify different meaning dimensions, with the cell values showing the
levels with which the different dimensions can be realised. This
dimensionbased representation is automatically converted to a
featurebased one by calling binaryfeaturematrix()
. As a consequence,
whatever the actual types of the columns in the meaning matrix, they will
be treated as categorical factors for the purpose of this algorithm, also
discarding any explicit knowledge of which 'meaning dimension' they might
belong to.
References
Spike, M. 2016 Minimal requirements for the cultural evolution of language. PhD thesis, The University of Edinburgh. http://hdl.handle.net/1842/25930.
See Also
binaryfeaturematrix()
, ssm.compositionality()
Examples
# perfect communication system for two meaning features (which are marked
# as either present or absent)
sm.compositionality(c("a", "b", "ab"),
cbind(a=c(TRUE, FALSE, TRUE), b=c(FALSE, TRUE, TRUE)))
sm.segmentation(c("a", "b", "ab"),
cbind(a=c(TRUE, FALSE, TRUE), b=c(FALSE, TRUE, TRUE)))
# not quite perfect communication system
sm.compositionality(c("as", "bas", "basf"),
cbind(a=c(TRUE, FALSE, TRUE), b=c(FALSE, TRUE, TRUE)))
sm.segmentation(c("as", "bas", "basf"),
cbind(a=c(TRUE, FALSE, TRUE), b=c(FALSE, TRUE, TRUE)))
# same communication system, but force candidate segments to be nonoverlapping
# via the 'strict' option
sm.segmentation(c("as", "bas", "basf"),
cbind(a=c(TRUE, FALSE, TRUE), b=c(FALSE, TRUE, TRUE)), strict=TRUE)
# the function also accepts meaningdimension based matrix definitions:
print(twobytwoanimals < enumerate.meaningcombinations(c(animal=2, colour=2)))
# note how there are many more candidate segments than just the full length
# ones. the less data we have, the more likely it is that shorter substrings
# will be just as predictable as the full segments that contain them.
sm.segmentation(c("greendog", "bluedog", "greencat", "bluecat"), twobytwoanimals)
# perform the same analysis, but using the formula interface
print(twobytwosignalingsystem < cbind(twobytwoanimals,
signal=c("greendog", "bluedog", "greencat", "bluecat")))
sm.segmentation(signal ~ colour + animal, twobytwosignalingsystem)
# since there is no overlap in the constituent characters of the identified
# 'morphemes', they are all tied in their mutual predictiveness with the
# (shorter) substrings they contain
#
# to reduce the pool of candidate segments to those which are
# nonoverlapping and of maximal length, again use the 'strict=TRUE' option:
sm.segmentation(signal ~ colour + animal, twobytwosignalingsystem, strict=TRUE)