PrivateLR {PrivateLR} | R Documentation |
Differentially Private Logistic Regression
Description
PrivateLR implements two randomized algorithms for estimating
L_2
-regularized logistic regression coefficients that allow
specifying the maximal effect a single point change in the training data are
allowed to have. Specifically, the algorithms take as parameter the
maximum allowed change in log-likelihood of producing particular
coefficients resulting from any single training data point substitution.
Usage
dplr(object, ...)
## S3 method for class 'formula'
dplr(object, data, lambda=NA, eps=1, verbose=0,
rp.dim = 0, threshold='fixed', do.scale=FALSE, ...)
## S3 method for class 'numeric'
dplr(object, x, ...)
## S3 method for class 'logical'
dplr(object, x, ...)
## S3 method for class 'factor'
dplr(object, x, ...)
## S3 method for class 'data.frame'
dplr(object, target=ncol(object),...)
## S3 method for class 'matrix'
dplr(object, target=ncol(object),...)
## S3 method for class 'dplr'
predict(object, data, type = "probabilities", ...)
## S3 method for class 'dplr'
summary(object, ...)
## S3 method for class 'dplr'
print.summary(x, ...)
## S3 method for class 'dplr'
print(x, ...)
scaled(fml, data)
Arguments
object |
can be given as an object of If a In If given as a vector, |
data |
a data frame or matrix containing the variables in the model described by
|
lambda |
the regularization parameter. If |
eps |
the privacy level. The coefficients of the model are computed by a
method that guarantees |
verbose |
regulates how much information is printed, 0 nothing, 1 a little, 2 more. |
rp.dim |
if |
threshold |
|
do.scale |
The privacy guarantees are for data where the covariate vectors lie
within the unit ball. If |
type |
|
x |
In the |
target |
the index of the column in |
fml |
A formula that describes the dimensions of the data that should be scaled into the unit ball. |
... |
|
Details
The function dplr
implements logistic regression using the
differentially private methods by Chaudhuri, Monteleoni, and
Sarwate.
The interface is similar but not identical to that of lm
, with
the addition of the possibility of supplying a data matrix or
data.frame together with a target column index (defaults to
ncol(data)
).
The returned model instance has a convenience function
model$pred
that takes a data matrix or data frame to be
classified as input.
The print
function currently prints the summary.
The scaled
function scales data such that covariate vectors
lie within the unit ball. Note that the response variable is
put as the last column in the data frame data
returned.
Also, the response column name might have changed, depending on
the left side of the formula given.
Methods details
A randomized algorithm A
, taking a dataset as input, is said to be
\epsilon
-differentially private if it holds that
|\log(P(A(D) \in S)) - \log(P(A(D') \in S))| \leq \epsilon
for any
pair of datasets D,D'
that differ in exactly one element, and any
set S
. We now turn to the algorithms implemented by dplr
.
Let \|v\|
denote the L2 norm of a vector v
, and let
J(w, \lambda) = ALL(w) + \lambda/2 \|w\|^2
where ALL(w)
is
the average logistic loss over the
training data of size n
and dimension d
with labels
y
and covariates x
. L2-regularized logistic regression
computes
w^* = \arg\min_w J(w, \lambda)
for a given \lambda
.
The function dplr
implements two approaches to
\epsilon
-differential private L2 regularized logistic regression
(see the ... argument op
above).
The first is output perturbation, where we compute
w' = w^* + 2/(n \lambda \epsilon) b,
where b
is a d
-dimensional real vector sampled with
probability proportional to
\exp(-\|b\|)
.
The second is objective perturbation. Let
F(w, \lambda, \epsilon) = J(w, \lambda) + 2/(\epsilon n)
b^Tw
where n
and b
are as above. Let c = 0.25
and let
z = 2\log(1 + c/(\lambda n))
, then if
\epsilon - z > 0,
we compute
w' = \arg\min_w F(w, \lambda, \epsilon - z)
otherwise we compute an adjusted lambda version
w' = \arg\min_w F(w, c/(n (exp(\epsilon/4) - 1)),
\epsilon/2).
The logistic regression model coefficients w'
are
then \epsilon
-differentially private.
Value
The dplr
function returns a class "dplr"
list object
comprised of elements including:
par |
the coefficients of the logistic model. |
coefficients |
same as |
value , counts , convergence , message |
these are as returned by the
|
CIndex |
the area under the ROC curve (aka., C-Index) of the model on its training data. |
eps |
the supplied privacy level. |
lambda |
the regularization parameter used |
n |
the number of data points |
d |
the dimensionality of the data points |
pred |
a convenience function such that |
p.tr |
this is the classification probability threshold. |
did.rp |
TRUE if random projection was performed. |
rp.dim |
if random projection was performed this contains the number of dimensions projected onto. Only present if random projection was performed. |
rp.p |
the projection matrix used for random projection. Only present if random projection was performed. |
scaled |
TRUE if data was scaled by providing |
status |
a text string indicating the status of the computations.
|
The scaled
function returns a list of the following:
data |
the scaled data frame |
scale |
the scaling factor used. |
Warning
The privacy level is only guaranteed for the coefficients of the
model, not for all the other returned values, and also only in the
case when input data points (potentially after expansion of factors) are
of L2-norm <= 1. In particular using prediction thresholds
estimated using data (methods 'youden'
and 'topleft'
),
as well as built in scaling of data is not guaranteed.
Both of these are turned off by default.
Note
This implementation was in part supported by NIH NLM grant 7R01LM007273-07 and NIH Roadmap for Medical Research grant U54 HL108460.
Author(s)
Staal A. Vinterbo <sav@ucsd.edu>
References
Chaudhuri K., Monteleoni C., and Sarwate, A. Differentially Private Empirical Risk Minimization. JMLR, 2011, 12, 1069-1109
See Also
glm
and
predict
Examples
data(iris)
# the following two are equivalent
# and predict Species being any
# but the first factor level.
model <- dplr(iris)
model <- dplr(Species ~ ., iris)
# pick a particular factor level and privacy level 2
model <- dplr(I(Species != 'setosa') ~ ., iris, eps=2)
# The following is again equivalent to the two first
# examples. Note that we need to remove 'Species' from the
# covariate matrix/data frame, and
# that the class reported by summary will now
# not be 'Species' but 'dplr.class'.
model <- dplr(iris$Species, iris[,-5])
# two equivalent methods to get at the predicted
# probabilities
p <- model$pred(iris)
p <- predict(model, iris)
# print a summary of the model. Note that
# only the coefficients are guaranteed
# to be generated in an eps-differentially
# private manner.
summary(model)