JATSdecoder {JATSdecoder} | R Documentation |
JATSdecoder
Description
Function to extract and restructure NISO-JATS coded XML file or text into a list with metadata and text as selectable elements. Use CERMINE to convert PDF to CERMXML files.
Usage
JATSdecoder(
x,
sectionsplit = c("intro", "method", "result", "study", "experiment", "conclu",
"implica", "discussion"),
grepsection = "",
sentences = FALSE,
paragraph = FALSE,
abstract2sentences = TRUE,
output = "all",
letter.convert = TRUE,
unify.country.name = TRUE,
greek2text = FALSE,
warning = TRUE,
countryconnection = FALSE,
authorconnection = FALSE
)
Arguments
x |
a NISO-JATS coded XML file or text. |
sectionsplit |
search patterns for section split of text parts (forced to lower case), e.g. c("intro", "method", "result", "discus"). |
grepsection |
search pattern in regex to reduce text to specific section only. |
sentences |
Logical. IF TRUE text is returned as sectioned list with sentences. |
paragraph |
Logical. IF TRUE "<New paragraph>" is added at the end of each paragraph to enable manual splitting at paragraphs. |
abstract2sentences |
Logical. IF TRUE abstract is returned as vector with sentences. |
output |
selection of specific results to output c("all", "title", "author", "affiliation", "journal", "volume", "editor", "doi", "type", "history", "country", "subject", "keywords", "abstract", "sections", "text", "tables", "captions", "references"). |
letter.convert |
Logical. If TRUE converts hexadecimal and HTML coded characters to Unicode. |
unify.country.name |
Logical. If TRUE tries to unify country name/s with list of country names from worldmap(). |
greek2text |
Logical. If TRUE converts and unifies several greek letters to textual representation, e.g.: "alpha". |
warning |
Logical. If TRUE outputs a warning if processing CERMINE converted PDF files. |
countryconnection |
Logical. If TRUE outputs country connections as vector c("A - B","A - C", ...). |
authorconnection |
Logical. If TRUE outputs connections of a maximum of 50 involved authors as vector c("A - B","A - C", ...). |
Value
List with extracted meta data, sectioned text and references.
Note
A short tutorial on how to work with JATSdecoder and the generated outputs can be found at: https://github.com/ingmarboeschen/JATSdecoder
Source
An interactive web application for selecting and analyzing extracted article metadata and study characteristics for articles linked to PubMed Central is hosted at: https://www.scianalyzer.com/
The XML version of PubMed Central database articles can be downloaded in bulk from:
https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_bulk/
References
Böschen (2021). "Software review: The JATSdecoder package - extract metadata, abstract and sectioned text from NISO-JATS coded XML documents; Insights to PubMed Central’s open access database.” Scientometrics. doi: 10.1007/s1119202104162z.
See Also
study.character
for extracting different study characteristics at once.
get.stats
for extracting statistical results from textual input and different file formats.
Examples
# download example XML file via URL
x<-"https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0114876&type=manuscript"
# file name
file<-paste0(tempdir(),"/file.xml")
# download URL as "file.xml" in tempdir() if a connection is possible
tryCatch({
readLines(x,n=1)
download.file(x,file)
},
warning = function(w) message(
"Something went wrong. Check your internet connection and the link address."),
error = function(e) message(
"Something went wrong. Check your internet connection and the link address."))
# convert full article to list with metadata, sectioned text and reference list
if(file.exists(file)) JATSdecoder(file)
# extract specific content (here: abstract and text)
if(file.exists(file)) JATSdecoder(file,output=c("abstract","text"))
# or use specific functions, e.g.:
if(file.exists(file)) get.abstract(file)
if(file.exists(file)) get.text(file)