duplist {tosca} | R Documentation |
Creating List of Duplicates
Description
Creates a List of different types of Duplicates in a textmeta-object.
Usage
duplist(object, paragraph = FALSE)
is.duplist(x)
## S3 method for class 'duplist'
print(x, ...)
## S3 method for class 'duplist'
summary(object, ...)
Arguments
object |
A textmeta-object. |
paragraph |
Logical: Should be set to |
x |
An R Object. |
... |
Further arguments for print and summary. Not implemented. |
Details
This function helps to identify different types of Duplicates and gives the ability to exclude these for further Analysis (e.g. LDA).
Value
Named List:
uniqueTexts |
Character vector of IDs so that each text occurs once - if a text occurs twice or more often in the corpus, the ID of the first text regarding the list-order is returned |
notDuplicatedTexts |
Character vector of IDs of texts which are represented only once in the whole corpus |
idFakeDups |
List of character vectors: IDs of texts which originally has the same ID but belongs to different texts grouped by their original ID |
idRealDups |
List of character vectors: IDs of texts which originally has the same ID and text but different meta information grouped by their original ID |
allTextDups |
List of character vectors: IDs of texts which occur twice or more often grouped by text equality |
textMetaDups |
List of character vectors: IDs of texts which occur twice or more often and have the same meta information grouped by text and meta equality |
Examples
texts <- list(A="Give a Man a Fish, and You Feed Him for a Day.
Teach a Man To Fish, and You Feed Him for a Lifetime",
A="A fake duplicate",
B="So Long, and Thanks for All the Fish",
B="So Long, and Thanks for All the Fish",
C="A very able manipulative mathematician, Fisher enjoys a real mastery
in evaluating complicated multiple integrals.",
C="A very able manipulative mathematician, Fisher enjoys a real mastery
in evaluating complicated multiple integrals.")
corpus <- textmeta(meta=data.frame(id=c("A", "A", "B", "B", "C", "C"),
title=c("Fishing", "Fake duplicate", "Don't panic!", "towel day", "Sir Ronald", "Sir Ronald"),
date=c("1885-01-02", "1885-01-03", "1979-03-04", "1979-03-05", "1951-05-06", "1951-05-06"),
stringsAsFactors=FALSE), text=texts)
duplicates <- deleteAndRenameDuplicates(object=corpus)
duplist(object=duplicates, paragraph = FALSE)