R: SolrFrame

SolrFrame-class {rsolr}

R Documentation

SolrFrame

Description

The SolrFrame object makes Solr data accessible through a data.frame-like interface. This is the typical way an R user accesses data from a Solr core. Much of its methods are shared with SolrList, which has very similar behavior.

Details

A SolrFrame should more or less behave analogously to a data frame. It provides the same basic accessors (nrow, ncol, length, rownames, colnames, [, [<-, [[, [[<-, $, $<-, head, tail, etc) and can be coerced to an actual data frame via as.data.frame. Supported types of data manipulations include subset, transform, sort, xtabs, aggregate, unique, summary, etc.

Mapping a collection of documents to a tablular data structure is not quite natural, as the document collection is ragged: a given document can have any arbitrary set of fields, out of a set that is essentially infinite. Unlike some other document stores, however, Solr constrains the type of every field through a schema. The schema achieves flexibility through “dynamic” fields. The name of a dynamic field is a wildcard pattern, and any document field that matches the pattern is expected to obey the declared type and other constraints.

When determining its set of columns, SolrFrame takes every actual field present in the collection, and (by default) adds all non-dynamic (static) fields, in the order specified by the schema. Note that is very likely that many columns will consist entirely or almost entirely of NAs.

If a collection is extremly ragged, where few fields are shared between documents, it may make more sense to treat the data as a list, through SolrList, which shares almost all of the functionality of SolrFrame but in a different shape.

The rownames are taken from the field declared in the schema to represent the unique document key. Schemas are not strictly required to declare such a field, so if there is no unique key, the rownames are NULL.

Field restrictions passed to e.g. [ or subset(fields=) may be specified by name, or wildcard pattern (glob). Similarly, a row index passed to [ must be either a character vector of identifiers (of length <= 1024, NAs are not supported, and this requires a unique key in the schema) or a SolrPromise/SolrExpression, but note that if it evaluates to NAs, the corresponding rows are excluded from the result, as with subset. Using a SolrPromise or SolrExpression is recommended, as filtering happens at the database.

A special feature of SolrFrame, vs. an ordinary data frame, is that it can be grouped into a GroupedSolrFrame, where every column is modeled as a list, split by some combination of grouping factors. This is useful for aggregation and supports the implementation of the aggregate method, which is the recommended high-level interface.

Another interesting feature is laziness. One can defer a SolrFrame, so that all column retrieval, e.g., via $ or eval, returns a SolrPromise object. Many operations on promises are deferred, until they are finally fulfilled by being shown or through explicit coercion to an R vector.

A note for developers: SolrList and SolrFrame share common functionality through the base Solr class. Much of the functionality mentioned here is actually implemented as methods on the Solr class.

Accessors

These are some accessors that SolrFrame adds on top of the basic data frame accessors. Most of these are for advanced use only.

ndoc(x): Gets the number of documents (rows); serves as an abstraction over SolrFrame and SolrList
nfield(x): Gets the number of fields (columns); serves as an abstraction over SolrFrame and SolrList
ids(x): Gets the document unique identifiers (may be NULL, treated as rownames); serves as an abstraction over SolrFrame and SolrList
fieldNames(x, includeStatic=TRUE, ...): Gets the name of each field represented by any document in the Solr core, with ... being passed down to fieldNames on SolrCore. Fields must be indexed to be reported, with the exception that when includeStatic is TRUE, we ensure all static (non-dynamic) fields are present in the return value. Names are returned in an order consistent with the order in the schema. Note that two different “instances” of the same dynamic field do not have a specified order in the schema, so we use the index order (lexicographical) for those cases.
core(x): Gets the SolrCore wrapped by x
query(x): Gets the query that is being constructed by x

Extended API

Most of the typical data frame accessors and data manipulation functions will work analogously on SolrFrame (see Details). Below, we list some of the non-standard methods that might be seen as an extension of the data frame API.

aggregate(x, data, FUN, ..., subset, na.action, simplify = TRUE, count = FALSE): If x is a formula, aggregates data, grouping by x, by either applying FUN, or evaluating an aggregating expression in ..., on each group. If count is TRUE, a “count” column is added with the number of elements in each group. The rest of the arguments behave like those for the base aggregate.

There are two main modes: aggregating with FUN, or, as an extension to the base aggregate, aggregating with expressions in ..., similar to the interface for transform. If FUN is specified, then behavior is much like the original, except one can omit the LHS on the formula, in which case the entire frame is passed to FUN. In the second mode, there is a column in the result for each argument in ..., and there must not be an LHS on the formula.

See the documentation for the underlying facet function for details on what is supported on the formula RHS.

For global aggregation, simply pass the SolrFrame as x, in which case the data argument does not exist.

Note that the function or expressions are only conceptually evaluated on each group. In reality, the computations occur on grouped columns/promises, which are modeled as lists. Thus, there is potential for conflict, in particular with length, which return the number of groups, instead of operating group-wise. One should use the abstraction ndoc instead of length, since ndoc always returns document counts, and thus will return the size of each group.
rename(x, ...): Renames the columns of x, where the names and character values of ... indicates the mapping (newname = oldname).
group(x, by): Returns a GroupedSolrFrame that is grouped by the factors in by, typically a formula. To get back to x, call ungroup(x).
grouping(x): Just returns NULL, since a SolrFrame is not grouped (unless extended to be groupable).
defer(x): Returns a SolrFrame that yields SolrPromise objects instead of vectors whenever a field is retrieved
searchDocs(x, q): Performs a conventional document search using the query string q. The main difference to filtering is that (by default) Solr will order the result by score, i.e., how well each document matches the query.

Constructor

SolrFrame(uri): Constructs a new SolrFrame instance, representing a Solr core located at uri, which should be a string or a RestUri object. The ... are passed to the SolrQuery constructor.

Evaluation

eval(expr, envir, enclos): Evaluates expr in the SolrFrame envir, using enclos as the enclosing environment. The expr can be an R language object or a SolrExpression, either of which are lazily evaluated if defer has been called on envir.

Coercion

as.data.frame(x, row.names=NULL, optional=FALSE, fill=TRUE): Downloads the data into an actual data.frame, specifically an instance of DocDataFrame. If fill is FALSE, only the fields represented in at least one document are added as columns.
as.list(x): Essentially as.list(as.data.frame(x)), except returns a list of promises if x is deferred.

Author(s)

Michael Lawrence

Examples


     schema <- deriveSolrSchema(mtcars)
     solr <- TestSolr(schema)
     sr <- SolrFrame(solr$uri)
     sr[] <- mtcars
     dim(sr)
     head(sr)
     subset(sr, mpg > 20 & cyl == 4)
     solr$kill()
     ## see the vignette for more

[Package rsolr version 0.0.13 Index]