skeleton {pcalg} | R Documentation |
Estimate (Initial) Skeleton of a DAG using the PC / PC-Stable Algorithm
Description
Estimate the skeleton of a DAG without latent and selection variables using the PC Algorithm or estimate an initial skeleton of a DAG with arbitrarily many latent and selection variables using the FCI and the RFCI algorithms.
If used in the PC algorithm, it estimates the order-independent
“PC-stable” ("stable"
) or original PC ("original"
)
“skeleton” of a directed acyclic graph (DAG) from observational
data.
When used in the FCI and RFCI algorithms, this function estimates only an initial order-independent (or PC original) “skeleton”. Because of the presence of latent and selection variables, to find the final skeleton those algorithms need to perform additional tests later on and consequently some edges can be further deleted.
Usage
skeleton(suffStat, indepTest, alpha, labels, p,
method = c("stable", "original", "stable.fast"), m.max = Inf,
fixedGaps = NULL, fixedEdges = NULL, NAdelete = TRUE,
numCores = 1, verbose = FALSE)
Arguments
suffStat |
Sufficient statistics: List containing all necessary
elements for the conditional independence decisions in the
function |
indepTest |
Predefined |
alpha |
significance level (number in |
labels |
(optional) character vector of variable (or
“node”) names. Typically preferred to specifying |
p |
(optional) number of variables (or nodes). May be specified
if |
method |
Character string specifying method; the default,
|
m.max |
Maximal size of the conditioning sets that are considered in the conditional independence tests. |
fixedGaps |
logical symmetric matrix of dimension p*p. If entry
|
fixedEdges |
a logical symmetric matrix of dimension p*p. If entry
|
NAdelete |
logical needed for the case |
numCores |
number of processor cores to use for parallel computation.
Only available for |
verbose |
if |
Details
Under the assumption that the distribution of the observed variables
is faithful to a DAG and that there are no latent and selection
variables, this function estimates the skeleton of the DAG. The
skeleton of a DAG is the undirected graph resulting from removing all
arrowheads from the DAG. Edges in the skeleton of a DAG have the
following interpretation:
There is an edge between i
and j
, i
– j
,
if and only if variables i
and j
are conditionally
dependent given S
for all possible subsets S
of the
remaining nodes.
On the other hand, the distribution of the observed variables
is faithful to a DAG with arbitrarily many latent and selection
variables, skeleton()
estimates the initial skeleton of the
DAG. Edges in this initial skeleton of a DAG have the
following interpretation:
There is an edge i
– j
if and only if variables i
and
j
are conditionally dependent given S
for all possible
subsets S
of the neighbours of i
and the neighbours of
j
.
The data are not required to follow a specific distribution,
but one should make sure that the conditional indepedence test used in
indepTest
is appropriate for the data. Pre-programmed versions
of indepTest
are available for Gaussian data
(gaussCItest
), discrete data (disCItest
),
and binary data (see binCItest
). Users may also specify
their own indepTest
function.
The PC algorithm (Spirtes, Glymour and Scheines, 2000)
(method = "original"
) is known to be order-dependent, in the
sense that the output may depend on the order in which the variables
are given. Therefore, Colombo and Maathuis (2014) proposed a simple
modification, called “PC-stable”, which yields
order-independent adjacencies in the skeleton, provided by pc()
with the new default method = "stable"
. This stable variant
of the algorithm is also available with the method = "stable.fast"
:
it runs the algorithm of Colombo and Maathuis (2014) faster than
method = "stable"
in general, but should be regarded as an
experimental option at the moment.
The algorithm starts with a complete undirected graph. In each
step, it visits all pairs (i,j)
of adjacent nodes in the
current graph, and determines based on conditional independence tests
whether the edge i--j
should be removed. In particular, for each step
m
(m=0,1,\dots
) of the size of the conditioning sets, the
algorithm at first determines the neighbours a(i)
of each node
i
in the graph. Then, the algorithm visits all pairs (i,j)
of adjacent nodes in the current graph, and the edge i--j
is
kept if and only if the null hypothesis
i
and j
are conditionally independent given S
rejected at significance level alpha
for all subsets S
of size
m
of a(i)
and of a(j)
(as judged by the function
indepTest
). For the "stable"
method, the neighborhoods
a(i)
are kept fixed within each value of m
, and this
makes the algorithm order-independent. Method "original"
,
the original PC algorithm would update the neighbour list after each
edge change.
The algorithm stops when m
is larger than the largest
neighbourhood size of all nodes, or when m
has reached the limit
m.max
which may be set by the user.
Since the FCI (Spirtes, Glymour and Scheines, 2000) and RFCI (Colombo
et al., 2012) algorithms are built up from the PC algorithm, they are also
order-dependent in the skeleton. To resolve their order-dependence
issues in the skeleton is more involved, see Colombo and Maathuis
(2014). However now, with method = "stable"
, this function
estimates an initial order-independent skeleton in these algorithms
(for additional details on how to make the final skeleton of FCI fully
order-independent see fci
and Colombo and Maathuis (2014)).
The information in fixedGaps
and fixedEdges
is used as follows.
The gaps given in fixedGaps
are introduced in the very beginning of
the algorithm by removing the corresponding edges from the complete
undirected graph. Pairs (i,j)
in fixedEdges
are skipped
in all steps of the algorithm, so that these edges remain in the graph.
Note: Throughout, the algorithm works with the column positions of the variables in the adjacency matrix, and not with the names of the variables.
Value
An object of class
"pcAlgo"
(see
pcAlgo
) containing an estimate of the skeleton of
the underlying DAG, the conditioning sets (sepset
) that led to
edge removals and several other parameters.
Author(s)
Markus Kalisch (kalisch@stat.math.ethz.ch), Martin Maechler, Alain Hauser, and Diego Colombo.
References
D. Colombo and M.H. Maathuis (2014).Order-independent constraint-based causal structure learning. Journal of Machine Learning Research 15 3741-3782.
D. Colombo, M. H. Maathuis, M. Kalisch, T. S. Richardson (2012). Learning high-dimensional directed acyclic graphs with latent and selection variables. Ann. Statist. 40, 294-321.
M. Kalisch and P. Buehlmann (2007). Estimating high-dimensional directed acyclic graphs with the PC-algorithm, JMLR 8 613-636.
P. Spirtes, C. Glymour and R. Scheines (2000). Causation, Prediction, and Search, 2nd edition, MIT Press.
See Also
pc
for generating a partially directed graph
using the PC algorithm; fci
for generating a partial
ancestral graph using the FCI algorithm; rfci
for
generating a partial ancestral graph using the RFCI algorithm;
udag2pdag
for converting the skeleton to a CPDAG.
Further, gaussCItest
, disCItest
,
binCItest
and dsepTest
as examples for
indepTest
.
Examples
##################################################
## Using Gaussian Data
##################################################
## Load predefined data
data(gmG)
n <- nrow (gmG8$x)
V <- colnames(gmG8$x) # labels aka node names
## estimate Skeleton
skel.fit <- skeleton(suffStat = list(C = cor(gmG8$x), n = n),
indepTest = gaussCItest, ## (partial correlations)
alpha = 0.01, labels = V, verbose = TRUE)
if (require(Rgraphviz)) {
## show estimated Skeleton
par(mfrow=c(1,2))
plot(skel.fit, main = "Estimated Skeleton")
plot(gmG8$g, main = "True DAG")
}
##################################################
## Using d-separation oracle
##################################################
## define sufficient statistics (d-separation oracle)
Ora.stat <- list(g = gmG8$g, jp = RBGL::johnson.all.pairs.sp(gmG8$g))
## estimate Skeleton
fit.Ora <- skeleton(suffStat=Ora.stat, indepTest = dsepTest, labels = V,
alpha=0.01) # <- irrelevant as dsepTest returns either 0 or 1
if (require(Rgraphviz)) {
## show estimated Skeleton
plot(fit.Ora, main = "Estimated Skeleton (d-sep oracle)")
plot(gmG8$g, main = "True DAG")
}
##################################################
## Using discrete data
##################################################
## Load data
data(gmD)
V <- colnames(gmD$x) # labels aka node names
## define sufficient statistics
suffStat <- list(dm = gmD$x, nlev = c(3,2,3,4,2), adaptDF = FALSE)
## estimate Skeleton
skel.fit <- skeleton(suffStat,
indepTest = disCItest, ## (G^2 statistics independence test)
alpha = 0.01, labels = V, verbose = TRUE)
if (require(Rgraphviz)) {
## show estimated Skeleton
par(mfrow = c(1,2))
plot(skel.fit, main = "Estimated Skeleton")
plot(gmD$g, main = "True DAG")
}
##################################################
## Using binary data
##################################################
## Load binary data
data(gmB)
X <- gmB$x
## estimate Skeleton
skel.fm2 <- skeleton(suffStat = list(dm = X, adaptDF = FALSE),
indepTest = binCItest, alpha = 0.01,
labels = colnames(X), verbose = TRUE)
if (require(Rgraphviz)) {
## show estimated Skeleton
par(mfrow = c(1,2))
plot(skel.fm2, main = "Binary Data 'gmB': Estimated Skeleton")
plot(gmB$g, main = "True DAG")
}