R: Estimate Dupuy and Galichon's model

estimate.affinity.matrix.lowrank {affinitymatrix}

R Documentation

Estimate Dupuy and Galichon's model

Description

This function estimates the affinity matrix of the matching model of Dupuy and Galichon (2014) under a rank restriction on the affinity matrix, as suggested by Dupuy, Galichon and Sun (2019). In their own words, "to accommodate high dimensionality of the data, they propose a novel method that incorporates a nuclear norm regularization which effectively enforces a rank constraint on the affinity matrix." This function also performs the saliency analysis and the rank tests. The user must supply a matched sample that is treated as the equilibrium matching of a bipartite one-to-one matching model without frictions and with Transferable Utility. For the sake of clarity, in the documentation we take the example of the marriage market and refer to "men" as the observations on one side of the market and to "women" as the observations on the other side. Other applications may include matching between CEOs and firms, firms and workers, buyers and sellers, etc.

Usage

estimate.affinity.matrix.lowrank(
  X,
  Y,
  w = rep(1, N),
  A0 = matrix(0, nrow = Kx, ncol = Ky),
  lb = matrix(-Inf, nrow = Kx, ncol = Ky),
  ub = matrix(Inf, nrow = Kx, ncol = Ky),
  pr = 0.05,
  max_iter = 10000,
  tol_level = 1e-08,
  tau = 1,
  scale = 1,
  cross_validation = TRUE,
  manual_lambda = 0,
  lambda_min = 0,
  Nfolds = 5,
  nB = 2000,
  verbose = TRUE
)

Arguments

`X`	The matrix of men's traits. Its rows must be ordered so that the i-th man is matched with the i-th woman: this means that `nrow(X)` must be equal to `nrow(Y)`. Its columns correspond to the different matching variables: `ncol(X)` can be different from `ncol(Y)`. For the sake of clarity of exposition when using descriptive tools such as `show.correlations`, it is recommended assigning the same matching variable to the k-th column of `X` and to the k-th column of `Y`, whenever possible. If `X` has more matching variables than `Y`, then those variables that appear in `X` but no in Y should be found in the last columns of `X` (and vice versa). The matrix is demeaned and rescaled before the start of the estimation algorithm.
`Y`	The matrix of women's traits. Its rows must be ordered so that the i-th woman is matched with the i-th man: this means that `nrow(Y)` must be equal to `nrow(X)`. Its columns correspond to the different matching variables: `ncol(Y)` can be different from `ncol(X)`. The matrix is demeaned and rescaled before the start of the estimation algorithm.
`w`	A vector of sample weights with length `nrow(X)`. Defaults to uniform weights.
`A0`	A vector or matrix with `ncol(X)*ncol(Y)` elements corresponding to the initial values of the affinity matrix to be fed to the estimation algorithm. Optional. Defaults to matrix of zeros.
`lb`	A vector or matrix with `ncol(X)*ncol(Y)` elements corresponding to the lower bounds of the elements of the affinity matrix. Defaults to `-Inf` for all parameters.
`ub`	A vector or matrix with `ncol(X)*ncol(Y)` elements corresponding to the upper bounds of the elements of the affinity matrix. Defaults to `Inf` for all parameters.
`pr`	A probability indicating the significance level used to compute bootstrap two-sided confidence intervals for `U`, `V` and `lambda`. Defaults to 0.05.
`max_iter`	An integer indicating the maximum number of iterations in the proximal gradient descent algorithm. Defaults to 10000.
`tol_level`	A positive real number indicating the tolerance level in the proximal gradient descent algorithm. Defaults to 1e-8.
`tau`	A positive real number indicating a sensitivity parameter in the proximal gradient descent algorithm. Defaults to 1 and should not be changed unless computational problems arise.
`scale`	A positive real number indicating the scale of the model. Defaults to 1.
`cross_validation`	If `TRUE`, the function looks for a rank restriction through cross validation. The cross validation exercise aims to minimize the covariance mismatch: in other words, it avoids overfitting without excessively reducing the number of free parameters. Defaults to `TRUE`.
`manual_lambda`	A positive real number indicating the user-supply `lambda` when `cross_validation==FALSE`. The higher `lambda`, the tighter the rank restriction. Defaults to 0.
`lambda_min`	A positive real number indicating minimum value for `lambda` considered during the cross validation. We recommend using 0, but with a high number of matching variables relatively to the sample size it is reasonable to set `lambda_min` to a higher value. Defaults to 0.
`Nfolds`	An integer indicating the number of folds in the cross validation. Defaults to 5 and can be increased with a large sample size.
`nB`	An integer indicating the number of bootstrap replications used to compute the confidence intervals of `Aopt`, `U`, `V` and `lambda`. Defaults to 2000.
`verbose`	If `TRUE`, the function displays messages to keep track of its progress. Defaults to `TRUE`.

Value

The function returns a list with elements: X, the demeaned and rescaled matrix of men's traits; Y, the demeaned and rescaled matrix of men's traits; fx, the empirical marginal distribution of men; fy, the empirical marginal distribution of women; Aopt, the estimated affinity matrix; sdA, the standard errors of Aopt; tA, the Z-test statistics of Aopt; VarCovA, the full variance-covariance matrix of Aopt; rank.tests, a list with all the summaries of the rank tests on Aopt; U, whose columns are the left-singular vectors of Aopt; V, whose columns are the right-singular vectors of Aopt; lambda, whose elements are the singular values of Aopt; UCI, whose columns are the lower and the upper bounds of the confidence intervals of U; VCI, whose columns are the lower and the upper bounds of the confidence intervals of V; lambdaCI, whose columns are the lower and the upper bounds of the confidence intervals of lambda; df.bootstrap, a data frame resulting from the nB bootstrap replications and used to infer the empirical distribution of the estimated objects; lambda.rank.restriction, a positive real number indicating the value of the Lagrange multiplier of the nuclear norm constraint of the affinity matrix, either chosen by the user or through Cross Validation; df.cross.validation, a data frame containing the detailed results of the cross validation exercise.

Examples


# Parameters
Kx = 2; Ky = 2; # number of matching variables on both sides of the market
N = 100 # sample size
mu = rep(0, Kx+Ky) # means of the data generating process
Sigma = matrix(c(1, -0.0244, 0.1489, -0.1301, -0.0244, 1, -0.0553, 0.2717,
                 0.1489, -0.0553, 1, -0.1959, -0.1301, 0.2717, -0.1959, 1),
                 nrow=Kx+Ky)
    # (normalized) variance-covariance matrix of the data generating process
labels_x = c("Height", "BMI") # labels for men's matching variables
labels_y = c("Height", "BMI") # labels for women's matching variables

# Sample
data = MASS::mvrnorm(N, mu, Sigma) # generating sample
X = data[,1:Kx]; Y = data[,Kx+1:Ky] # men's and women's sample data
w = sort(runif(N-1)); w = c(w,1) - c(0,w) # sample weights

# Main estimation
res = estimate.affinity.matrix.lowrank(X, Y, w = w, tol_level = 1e-03,
                                       nB = 50, Nfolds = 2)

# Summarize results
show.affinity.matrix(res, labels_x = labels_x, labels_y = labels_y)
show.diagonal(res, labels = labels_x)
show.test(res)
show.saliency(res, labels_x = labels_x, labels_y = labels_y,
              ncol_x = 2, ncol_y = 2)
show.cross.validation(res)
show.correlations(res, labels_x = labels_x, labels_y = labels_y,
                  label_x_axis = "Husband", label_y_axis = "Wife", ndims = 2)