textplot_correlation_lines {textplot} | R Documentation |
Document/Term Correlation Plot
Description
Plots the highest occurring correlations among terms.
This is done by plotting the terms into nodes and the correlations between the terms as lines between the nodes.
Lines of the edges are proportional to the correlation height.
This uses the plot function for graphNEL objects (using the Rgraphviz package)
Usage
textplot_correlation_lines(x, ...)
## Default S3 method:
textplot_correlation_lines(
x,
terms = colnames(x),
threshold = 0.05,
top_n,
attrs = textplot_correlation_lines_attrs(),
terms_highlight,
label = FALSE,
cex.label = 1,
col.highlight = "red",
lwd = 1,
...
)
Arguments
x |
a document-term matrix of class dgCMatrix |
... |
other arguments passed on to plot |
terms |
a character vector with terms present in the columns of |
threshold |
a threshold to show only correlations between the terms with absolute values above this threshold. Defaults to 0.05. |
top_n |
an integer indicating to show only the top top_n correlations. This can be set to plot only the top correlations. E.g. set it to 20 to show only the top 20 correlations with the highest absolute value. |
attrs |
a list of attributes with graph visualisation elements passed on to the plot function of an object of class graphNEL.
Defaults to |
terms_highlight |
a vector of character |
label |
logical indicating to draw the label with the correlation size between the nodes |
cex.label |
cex of the label of the correlation size |
col.highlight |
color to use for highlighted terms specified in |
lwd |
numeric value - graphical parameter used to increase the edge thickness which indicates the correlation strength. Defaults to 1. |
Value
invisibly the plot
Examples
## Construct document/frequency/matrix
library(graph)
library(Rgraphviz)
library(udpipe)
data(brussels_reviews_anno, package = 'udpipe')
exclude <- c(32337682L, 27210436L, 26820445L, 37658826L, 33661134L, 48756422L,
23454554L, 30461127L, 23292176L, 32850277L, 30566303L, 21595142L,
20441279L, 38097066L, 28651065L, 29011387L, 37316020L, 22135291L,
40169379L, 38627667L, 29470172L, 24071827L, 40478869L, 36825304L,
21597085L, 21427658L, 7890178L, 32322472L, 39874379L, 32581310L,
43865675L, 31586937L, 32454912L, 34861703L, 31403168L, 35997324L,
29002317L, 33546304L, 47677695L)
dtm <- brussels_reviews_anno
dtm <- subset(dtm, !doc_id %in% exclude)
dtm <- subset(dtm, xpos %in% c("NN") & language == "nl" & !is.na(lemma))
dtm <- document_term_frequencies(dtm, document = "doc_id", term = "lemma")
dtm <- document_term_matrix(dtm)
dtm <- dtm_remove_lowfreq(dtm, minfreq = 5)
dtm <- dtm_remove_tfidf(dtm, top = 500)
## Plot top 20 correlations, having at least a correlation of 0.01
textplot_correlation_lines(dtm, top_n = 25, threshold = 0.01)
## Plot top 20 correlations
textplot_correlation_lines(dtm, top_n = 25, label = TRUE, lwd = 5)
## Plot top 20 correlations and highlight some terms
textplot_correlation_lines(dtm, top_n = 25, label = TRUE, lwd = 5,
terms_highlight = c("prijs", "privacy"),
main = "Top correlations in topic xyz")
## Plot top 20 correlations and highlight + increase some terms
textplot_correlation_lines(dtm, top_n = 25, label = TRUE, lwd=5,
terms_highlight = c(prijs = 0.8, privacy = 0.1),
col.highlight = "red")
## Plot correlations between specific terms
w <- dtm_colsums(dtm)
w <- head(sort(w, decreasing = TRUE), 100)
textplot_correlation_lines(dtm, terms = names(w), top_n = 20, label = TRUE)
attrs <- textplot_correlation_lines_attrs()
attrs$node$shape <- "rectangle"
attrs$edge$color <- "steelblue"
textplot_correlation_lines(dtm, top_n = 20, label = TRUE,
attrs = attrs)