query_category_members {wikkitidy}R Documentation

Explore Wikipedia's category system

Description

These functions provide access to the CategoryMembers endpoint of the Action API.

query_category_members() builds a generator query to return the members of a given category.

build_category_tree() finds all the pages and subcategories beneath the passed category, then recursively finds all the pages and subcategories beneath them, until it can find no more subcategories.

Usage

query_category_members(
  .req,
  category,
  namespace = NULL,
  type = c("file", "page", "subcat"),
  limit = 10,
  sort = c("sortkey", "timestamp"),
  dir = c("ascending", "descending", "newer", "older"),
  start = NULL,
  end = NULL,
  language = "en"
)

build_category_tree(category, language = "en")

Arguments

.req

A query request object

category

The category to start from. query_category_members() accepts either a numeric pageid or the page title. build_category_tree() accepts a vector of page titles.

namespace

Only return category members from the provided namespace

type

Alternative to namespace: the type of category member to return. Multiple types can be requested using a character vector. Defaults to all.

limit

The number to return each batch. Max 500.

sort

How to sort the returned category members. 'timestamp' sorts them by the date they were included in the category; 'sortkey' by the category member's unique hexadecimal code

dir

The direction in which to sort them

start

If sort == 'timestamp', only return category members from after this date. The argument is parsed by lubridate::as_date()

end

If sort == 'timestamp', only return category members included in the category from before this date. The argument is parsed by lubridate::as_date()

language

The language edition of Wikipedia to query

Value

query_category_members(): A request object of type generator/query/action_api/httr2_request, which can be passed to next_batch() or retrieve_all(). You can specify which properties to retrieve for each page using query_page_properties().

build_category_tree(): A list containing two dataframes. nodes lists all the subcategories and pages found underneath the passed categories. edges records the connections between them. The source column gives the pageid of the parent category, while the target column gives the pageid of any categories, pages or files contained within the source category. The timestamp records the moment when the target page or subcategory was included in the source category. The two dataframes in the list can be passed to igraph::graph_from_data_frame for network analysis.

Examples

# Get the first 10 pages in 'Category:Physics' on English Wikipedia
physics_members <- wiki_action_request() %>%
  query_category_members("Physics") %>% next_batch()
physics_members


# Build the tree of all albums for the Melbourne band Custard
tree <- build_category_tree("Category:Custard_(band)_albums")
tree

# For network analysis and visualisation, you can pass the category tree
# to igraph
tree_graph <- igraph::graph_from_data_frame(tree$edges, vertices = tree$nodes)
tree_graph

[Package wikkitidy version 0.1.12 Index]