delevels {rminer} | R Documentation |
Reduce, replace or transform levels of a data.frame or factor variable (useful for preprocessing datasets).
Description
Reduce, replace or transform levels of a data.frame or factor variable (useful for preprocessing datasets).
Usage
delevels(x, levels, label = NULL)
Arguments
x |
|
levels |
character vector with several options:
Another possibility is to define a vector list, with |
label |
the new label used for all |
Details
The Inverse Document Frequency (IDF) uses f(x)= log(n/f_x), where n is the length of x and f_x is the frequency of x.
The Percentage Categorical Pruned (PCP) merges all least frequent levels (summing up to perc percent) into a single level.
When other values are used for levels
, this function replaces all levels
values with the single label
value.
Value
Returns a transformed factor or data.frame.
Author(s)
Paulo Cortez http://www3.dsi.uminho.pt/pcortez/
References
PCP transform:
L.M. Matos, P. Cortez, R. Mendes, A. Moreau.
Using Deep Learning for Mobile Marketing User Conversion Prediction. In Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN 2019), paper N-19327, Budapest, Hungary, July, 2019 (8 pages), IEEE, ISBN 978-1-7281-2009-6.
https://doi.org/10.1109/IJCNN.2019.8851888
http://hdl.handle.net/1822/62771
IDF transform:
L.M. Matos, P. Cortez, R. Mendes and A. Moreau.
A Comparison of Data-Driven Approaches for Mobile Marketing User Conversion Prediction. In Proceedings of 9th IEEE International Conference on Intelligent Systems (IS 2018), pp. 140-146, Funchal, Madeira, Portugal, September, 2018, IEEE, ISBN 978-1-5386-7097-2.
https://ieeexplore.ieee.org/document/8710472
http://hdl.handle.net/1822/61586
See Also
fit
and imputation
.
Examples
### simples examples:
f=factor(c("A","A","B","B","C","D","E"))
print(table(f))
# replace "A" with "a":
f1=delevels(f,"A","a")
print(table(f1))
# merge c("C","D","E") into "CDE":
f2=delevels(f,c("C","D","E"),"CDE")
print(table(f2))
# merge c("B","C","D","E") into _OTHER:
f3=delevels(f,c("B","C","D","E"))
print(table(f3))
## Not run:
# larger factor:
x=factor(c(1,rep(2,2),rep(3,3),rep(4,4),rep(5,5),rep(10,10),rep(100,100)))
print(table(x))
# IDF: frequent values are close to zero and
# infrequent ones are more close to each other:
x1=delevels(x,"idf")
print(table(x1))
# PCP: infrequent values are merged
x2=delevels(x,c("pcp",0.1)) # around 10
print(table(x2))
# example with a data.frame:
y=factor(c(rep("a",100),rep("b",20),rep("c",5)))
z=1:125 # numeric
d=data.frame(x=x,y=y,z=z,x2=x)
print(summary(d))
# IDF:
d1=delevels(d,"idf")
print(summary(d1))
# PCP:
d2=delevels(d,"pcp")
print(summary(d2))
# delevels:
L=vector("list",ncol(d)) # one per attribute
L[[1]]=c("1","2","3","4","5")
L[[2]]=c("b","c")
L[[4]]=c("1","2","3") # different on purpose
d3=delevels(d,levels=L,label="other")
print(summary(d3))
## End(Not run) # end dontrun