tCorpus_modify_by_reference {corpustools}R Documentation

Modify tCorpus by reference

Description

(back to overview)

Details

If any tCorpus method is used that changes the corpus (e.g., set, subset), the change is made by reference. This is convenient when working with a large corpus, because it means that the corpus does not have to be copied when changes are made, which is slower and less memory efficient.

To illustrate, for a tCorpus object named 'tc', the subset method can be called like this:

tc$subset(doc_id %in% selection)

The 'tc' object itself is now modified, and does not have to be assigned to a name, as would be the more common R philosophy. Like this:

tc = tc$subset(doc_id %in% selection)

The results of both lines of code are the same. The assignment in the second approach is not necessary, but doesn't harm either because tc$subset returns the modified corpus invisibly (see ?invisible if that sounds spooky).

Be aware, however, that the following does not work!!

tc2 = tc$subset(doc_id %in% selection)

In this case, tc2 does contain the subsetted corpus, but tc itself will also be subsetted!!

Using the R6 method for subset forces this approach on you, because it is faster and more memory efficient. If you do want to make a copy, there are several solutions.

Firstly, for some methods we provide identical functions. For example, instead of the $subset() R6 method, we can use the subset() function.

tc2 = subset(tc, doc_id %in% selection)

We promise that only the R6 methods (called as tc$method()) will change the data by reference.

A second option is that R6 methods where copying is often usefull have copy parameter Modifying by reference only happens in the R6 methods

tc2 = tc$subset(doc_id %in% selection, copy=TRUE)

Finally, you can always make a deep copy of the entire tCorpus before modifying it, using the $copy() method.

tc2 = tc$copy()


[Package corpustools version 0.4.10 Index]