duplist {tosca}R Documentation

Creating List of Duplicates

Description

Creates a List of different types of Duplicates in a textmeta-object.

Usage

duplist(object, paragraph = FALSE)

is.duplist(x)

## S3 method for class 'duplist'
print(x, ...)

## S3 method for class 'duplist'
summary(object, ...)

Arguments

object

A textmeta-object.

paragraph

Logical: Should be set to TRUE if the article is a list of character strings, representing the paragraphs.

x

An R Object.

...

Further arguments for print and summary. Not implemented.

Details

This function helps to identify different types of Duplicates and gives the ability to exclude these for further Analysis (e.g. LDA).

Value

Named List:

uniqueTexts

Character vector of IDs so that each text occurs once - if a text occurs twice or more often in the corpus, the ID of the first text regarding the list-order is returned

notDuplicatedTexts

Character vector of IDs of texts which are represented only once in the whole corpus

idFakeDups

List of character vectors: IDs of texts which originally has the same ID but belongs to different texts grouped by their original ID

idRealDups

List of character vectors: IDs of texts which originally has the same ID and text but different meta information grouped by their original ID

allTextDups

List of character vectors: IDs of texts which occur twice or more often grouped by text equality

textMetaDups

List of character vectors: IDs of texts which occur twice or more often and have the same meta information grouped by text and meta equality

Examples

texts <- list(A="Give a Man a Fish, and You Feed Him for a Day.
Teach a Man To Fish, and You Feed Him for a Lifetime",
A="A fake duplicate",
B="So Long, and Thanks for All the Fish",
B="So Long, and Thanks for All the Fish",
C="A very able manipulative mathematician, Fisher enjoys a real mastery
in evaluating complicated multiple integrals.",
C="A very able manipulative mathematician, Fisher enjoys a real mastery
in evaluating complicated multiple integrals.")

corpus <- textmeta(meta=data.frame(id=c("A", "A", "B", "B", "C", "C"),
title=c("Fishing", "Fake duplicate", "Don't panic!", "towel day", "Sir Ronald", "Sir Ronald"),
date=c("1885-01-02", "1885-01-03", "1979-03-04", "1979-03-05", "1951-05-06", "1951-05-06"),
stringsAsFactors=FALSE), text=texts)

duplicates <- deleteAndRenameDuplicates(object=corpus)
duplist(object=duplicates, paragraph = FALSE)

[Package tosca version 0.3-2 Index]