batch_pubmed_download {easyPubMed}R Documentation

Download PubMed Records in XML or TXT Format

Description

Performs a PubMed Query (via the get_pubmed_ids() function), downloads the resulting data (via multiple fetch_pubmed_data() calls) and then saves data in a series of xml or txt files on the local drive. The function is suitable for downloading a very large number of records.

Usage

batch_pubmed_download(pubmed_query_string, dest_dir = NULL, 
                             dest_file_prefix = "easyPubMed_data_", 
                             format = "xml", api_key = NULL, 
                             batch_size = 400, res_cn = 1, 
                             encoding = "UTF8")

Arguments

pubmed_query_string

String (character-vector of length 1): this is the string used for querying PubMed (the standard PubMed Query synthax applies).

dest_dir

String (character-vector of length 1): this string corresponds to the name of the existing folder where files will be saved. Existing files will be overwritten. If NULL, the current working directory will be used.

dest_file_prefix

String (character-vector of length 1): this string is used as prefix for the files that are written locally.

format

String (character-vector of length 1): data will be requested from Entrez in this format. Acceptable values are: c("medline","uilist","abstract","asn.1", "xml"). When format != "xml", data will be saved as text files (txt).

api_key

String (character vector of length 1): user-specific API key to increase the limit of queries per second. You can obtain your key from NCBI.

batch_size

Integer (1 < batch_size < 5000): maximum number of records to be saved in a single xml or txt file.

res_cn

Integer (> 0): numeric index of the data batch to start downloading from. This parameter is useful to resume an incomplete download job after a system crash.

encoding

The encoding of an input/output connection can be specified by name (for example, "ASCII", or "UTF-8", in the same way as it would be given to the function base::iconv(). See iconv() help page for how to find out more about encodings that can be used on your platform. Here, we recommend using "UTF-8".

Details

Download large number of PubMed records as a set of xml or txt files that are saved in the folder specified by the user. This function enforces data integrity. If a batch of downloaded data is corrupted, it is discarded and downloaded again. Each download cycle is monitored until the download job is successfully completed. This function should allow to download a whole copy of PubMed, if desired. The function informs the user about the current progress by constantly printing to console the number of batches still in queue for download. pubmed_query_string accepts standard PubMed synthax. The function will query PubMed multiple times using the same query string. Therefore, it is recommended to use a [EDAT] or a [PDAT] filter in the query if you want to ensure reproducible results.

Value

Character vector including the names of files downloaded to the local system

Author(s)

Damiano Fantini damiano.fantini@gmail.com

References

https://www.data-pulse.com/dev_site/easypubmed/

Examples

## Not run: 
## Example 01: retrieve data from PubMed and save as XML file
ml_query <- "Machine Learning[TI] AND 2016[PD]"
out1 <- batch_pubmed_download(pubmed_query_string = ml_query, batch_size = 180)
readLines(out1[1])[1:30]
##
## Example 02: retrieve data from PubMed and save as TXT file
ml_query <- "Machine Learning[TI] AND 2016[PD]"
out2 <- batch_pubmed_download(pubmed_query_string = ml_query, batch_size = 180, format = "medline")
readLines(out2[1])[1:30]

## End(Not run)


[Package easyPubMed version 2.13 Index]