cleanup {canprot}R Documentation

Clean Up Data


Remove proteins with unavailable IDs, ambiguous expression ratios, and duplicated IDs.


  cleanup(dat, IDcol, up2 = NULL)



data frame, protein expression data


character, name of column that has the UniProt IDs


logical, TRUE for up-regulated proteins, FALSE for down-regulated proteins


cleanup is used in the pdat_ functions to clean up the dataset given in dat. IDcol is the name of the column that has the UniProt IDs, and up2 indicates the expression change for each protein. The function removes proteins with unavailable (NA or "") or duplicated IDs. If up2 is provided, the function also removes unquantified proteins (those that have NA values of up2) and those with ambiguous expression ratios (up and down for the same ID). For each operation, a message is printed describing the number of proteins that are unavailable, unquantified, ambiguous, or duplicated.

Alternatively, if IDcol is a logical value, it selects proteins to be unconditionally removed.

See Also

This function is used extensively by the pdat_ functions, where it is called after check_IDs (if needed).


# Set up a simple workflow
extdatadir <- system.file("extdata", package="canprot")
datadir <- paste0(extdatadir, "/expression/pancreatic/")
dataset <- "CYD+05"
dat <- read.csv(paste0(datadir, dataset, ".csv.xz"), = TRUE)
up2 <- dat$Ratio..cancer.normal. > 1
# Remove two unavailable and one duplicated proteins
dat <- cleanup(dat, "Entry", up2)
# Now we can retrieve the amino acid compositions
pcomp <- protcomp(dat$Entry)

# Read another data file
datadir <- paste0(system.file("extdata", package="canprot"), "/expression/colorectal/")
dataset <- "STK+15"
dat <- read.csv(paste0(datadir, "STK+15.csv.xz"), = TRUE)
# Remove unavailable proteins
dat <- cleanup(dat, "uniprot")
# Remove proteins that have less than 2-fold expression ratio
dat <- cleanup(dat, abs(log2(dat$invratio)) < 1)

[Package canprot version 1.1.0 Index]