R: Get N-grams of works

oa_ngrams {openalexR}

R Documentation

Get N-grams of works

Description

Some work entities in OpenAlex include N-grams (word sequences and their frequencies) of their full text. The N-grams are obtained from Internet Archive, which uses the spaCy parser to index scholarly works. See <https://docs.openalex.org/api-entities/works/get-n-grams> for coverage and more technical details.

Usage

oa_ngrams(
  works_identifier,
  ...,
  endpoint = "https://api.openalex.org",
  verbose = FALSE
)

Arguments

`works_identifier`	Character. OpenAlex ID(s) of "works" entities as item identifier(s). These IDs start with "W". See more at <https://docs.openalex.org/api-entities/works#id>.
`...`	Unused.
`endpoint`	Character. URL of the OpenAlex Endpoint API server. Defaults to endpoint = "https://api.openalex.org".
`verbose`	Logical. If TRUE, print information on querying process. Default to `verbose = FALSE`. To shorten the printed query URL, set the environment variable openalexR.print to the number of characters to print: `Sys.setenv(openalexR.print = 70)`.

Value

A dataframe of paper metadatada and a list-column of ngrams.

Note

A faster implementation is available for 'curl' >= v5.0.0, and 'oa_ngrams' will issue a one-time message about this. This can be suppressed with 'options("oa_ngrams.message.curlv5" = FALSE)'.

Examples

## Not run: 

ngrams_data <- oa_ngrams(c("W1963991285", "W1964141474"))

# 10 most common ngrams in the first work
first_paper_ngrams <- ngrams_data$ngrams[[1]]
first_paper_ngrams[
  order(first_paper_ngrams$ngram_count, decreasing = TRUE),
][
  1:10,
]

# Missing N-grams are `NULL` in the `ngrams` list-column
oa_ngrams("https://openalex.org/W2284876136")

## End(Not run)

[Package openalexR version 1.4.0 Index]