Skip to content

Commit

Permalink
more work on the documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
BERENZ committed Mar 4, 2025
1 parent 3491a47 commit a6f2aa1
Show file tree
Hide file tree
Showing 8 changed files with 201 additions and 80 deletions.
16 changes: 13 additions & 3 deletions R/method_glm.R
Original file line number Diff line number Diff line change
Expand Up @@ -9,13 +9,15 @@
#' @importFrom survey svymean
#'
#' @description
#' Model for the outcome for the mass imputation estimator
#' Model for the outcome for the mass imputation estimator using generalized linear
#' models via the `stats::glm` function. Estimation of the mean is done using \mjseqn{S_B}
#' probability sample or known population totals.
#'
#' @details Analytical variance
#'
#' The variance of the mean is estimated based on the following approach
#'
#' (a) non-probability part (\mjseqn{S_A})
#' (a) non-probability part (\mjseqn{S_A} with size \mjseqn{n_A}; denoted as `var_nonprob` in the result)
#'
#' \mjsdeqn{
#' \hat{V}_1 = \frac{1}{n_A^2}\sum_{i=1}^{n_A} \hat{e}_i \left\lbrace \boldsymbol{h}(\boldsymbol{x}_i; \hat{\boldsymbol{\beta}})^\prime\hat{\boldsymbol{c}}\right\rbrace,
Expand All @@ -27,7 +29,7 @@
#' Under the linear regression model \mjseqn{\boldsymbol{h}\left(\boldsymbol{x}_i ; \widehat{\boldsymbol{\beta}}\right)=\boldsymbol{x}_i} and \mjseqn{\widehat{\boldsymbol{c}}=\left(n_A^{-1} \sum_{i \in A} \boldsymbol{x}_i \boldsymbol{x}_i^{\prime}\right)^{-1} N^{-1} \sum_{i \in B} w_i \boldsymbol{x}_i .}
#'
#'
#' (b) probability part (\mjseqn{S_B})
#' (b) probability part (\mjseqn{S_B} with size \mjseqn{n_B}; denoted as `var_prob` in the result)
#'
#' This part uses functionalities of the `{survey}` package and the variance is estimated using the following
#' equation:
Expand All @@ -37,6 +39,9 @@
#' \frac{m(\boldsymbol{x}_i; \hat{\boldsymbol{\beta}})}{\pi_i} \frac{m(\boldsymbol{x}_i; \hat{\boldsymbol{\beta}})}{\pi_j}.
#' }
#'
#' Note that \mjseqn{\hat{V}_2} in principle can be estimated in various ways depending on the type of the design and whether population size is known or not.
#'
#' Furthermore, if only population totals/means are known and assumed to be fixed we set \mjseqn{\hat{V}_2=0}.
#'
#' @param y_nons target variable from non-probability sample
#' @param X_nons a `model.matrix` with auxiliary variables from non-probability sample
Expand Down Expand Up @@ -70,6 +75,11 @@
#' \item{family}{family type (character `"glm"`)}
#' }
#'
#' @references
#' Kim, J. K., Park, S., Chen, Y., & Wu, C. (2021). Combining non-probability and probability survey samples
#' through mass imputation. Journal of the Royal Statistical Society Series A: Statistics in Society,
#' 184(3), 941-963.
#'
#' @examples
#'
#' data(admin)
Expand Down
49 changes: 23 additions & 26 deletions R/method_nn.R
Original file line number Diff line number Diff line change
Expand Up @@ -9,56 +9,52 @@
#' @importFrom RANN nn2
#'
#'
#' @description Mass imputation using nearest neighbours approach as described in Yang (2021).
#' @description Mass imputation using nearest neighbours approach as described in Yang et al. (2021).
#' The implementation is currently based on [RANN::nn2] function and thus it uses
#' Euclidean distance to matching.
#'
#' This implementation extends Yang (2021) approach as described in Chlebicki et al. (2025), namely:
#'
#' \describe{
#' \item{pmm_weights}{if k>1 weighted aggregation of the mean for a given unit is used. We use distance
#' matrix returned by [RANN::nn2] function (`pmm_weights` from the [control_out()] function)}
#' \item{nn_exact_se}{if the non-probability sample is small we recommend using a mini-bootstrap
#' approach to estimate variance from the non-probability sample (`nn_exact_se` from the [control_inf()] function)}
#' \item{pmm_k_choice}{the main `nonprob` function allows for dynamic selection of `k` neighbours based on the
#' variance minimization procedure (`pmm_k_choice` from the [control_out()] function)}
#' }
#' Euclidean distance for matching units from \mjseqn{S_A} (nonprobability) to \mjseqn{S_B} (probability).
#' Estimation of the mean is done using \mjseqn{S_B} sample.
#'
#'
#' @details Analytical variance
#'
#' The variance of the mean is estimated based on the following approach
#'
#' (a) non-probability part
#' (a) non-probability part (\mjseqn{S_A} with size \mjseqn{n_A}; denoted as `var_nonprob` in the result)
#'
#' This may be estimated using
#'
#' \mjsdeqn{
#' \hat{V}_1 = \frac{1}{N^2}\sum_{i=1}^{S_A}\frac{1-\hat{\pi}_B(\boldsymbol{x}_i)}{\hat{\pi}_B(\boldsymbol{x}_i)}\hat{\sigma}^2(\boldsymbol{x}_i)
#' }
#'
#' or
#' where \mjseqn{\hat{\pi}_B(\boldsymbol{x}_i)} is an estimator of propensity scores which
#' we currently estimate using \mjseqn{n_A/N} (constant) and \mjseqn{\hat{\sigma}^2(\boldsymbol{x}_i)} is
#' estimated using based on the average of \mjseqn{(y_i - y_i^*)^2}.
#'
#' \mjsdeqn{
#' \hat{V}_1 = \frac{1}{N^2}\sum_{i=1}^{S_A}\frac{1}{\hat{\pi}_B(\boldsymbol{x}_i)}(y_i - y_i^*)
#' }
#' Chlebicki et al. (2025, Algorithm 2) proposed non-parametric mini-bootstrap estimator
#' (without assuming that it is consistent) but with good finite population properties.
#' This bootstrap can be applied using `control_inference(nn_exact_se=TRUE)` and
#' can be summarized as follows:
#'
#' or mini-bootstrap
#' 1. Sample \mjseqn{n_A} units from \mjseqn{S_A} with replacement to create \mjseqn{S_A'} (if pseudo-weights are present inclusion probabilities should be proportional to their inverses).
#' 2. Match units from \mjseqn{S_B} to \mjseqn{S_A'} to obtain predictions \mjseqn{y^*}=\mjseqn{{k}^{-1}\sum_{k}y_k}.
#' 3. Estimate \mjseqn{\hat{\mu}=\frac{1}{N} \sum_{i \in S_B} d_i y_i^*}.
#' 4. Repeat steps 1-3 \mjseqn{M} times (we set \mjseqn{M=50} in our simulations; this is hard-coded).
#' 5. Estimate \mjseqn{\hat{V}_1=\text{var}{\hat{\boldsymbol{\mu}}}} obtained from simulations and save it as `var_nonprob`.
#'
#'
#'
#' (b) probability part
#' (b) probability part (\mjseqn{S_B} with size \mjseqn{n_B}; denoted as `var_prob` in the result)
#'
#' This part uses functionalities of the `{survey}` package and the variance is estimated using the following
#' equation:
#'
#' \mjsdeqn{
#' \hat{V}_1=\frac{1}{N^2} \sum_{i=1}^n \sum_{j=1}^n \frac{\pi_{i j}-\pi_i \pi_j}{\pi_{i j}}
#' \hat{V}_2=\frac{1}{N^2} \sum_{i=1}^n \sum_{j=1}^n \frac{\pi_{i j}-\pi_i \pi_j}{\pi_{i j}}
#' \frac{y_i^*}{\pi_i} \frac{y_j^*}{\pi_j}
#' }
#'
#'where \mjseqn{y^*_i} and \mjseqn{y_j^*} are values imputed imputed as an average of \mjseqn{k}-nearest neighbour,
#'i.e. \mjseqn{{k}^{-1}\sum_{k}y_k}.
#'where \mjseqn{y^*_i} and \mjseqn{y_j^*} are values imputed imputed as an average
#' of \mjseqn{k}-nearest neighbour, i.e. \mjseqn{{k}^{-1}\sum_{k}y_k}. Note that \mjseqn{\hat{V}_2} in principle can be estimated in various ways depending on the type of the design and whether population size is known or not.
#'
#'
#' @param y_nons target variable from non-probability sample
Expand Down Expand Up @@ -98,7 +94,7 @@
#' big found data for finite population inference using mass imputation.
#' Survey Methodology, June 2021 29 Vol. 47, No. 1, pp. 29-58
#'
#' Chlebicki, P., Chrostowski, Ł., & Beręsewicz, M. (2025). Data integration of non-probability
#' Chlebicki, P., Chrostowski, Ł., & Beręsewicz, M. (2025). Data integration of non-probability
#' and probability samples with predictive mean matching. arXiv preprint arXiv:2403.13750.
#'
#' @examples
Expand Down Expand Up @@ -186,11 +182,12 @@ method_nn <- function(y_nons,
y_mi_hat <- as.numeric(svydesign_mean)

if (se) {

var_prob <- as.vector(attr(svydesign_mean, "var"))

sigma_hat <- mean((y_nons - y_nons_pred)^2)
est_ps <- NROW(X_nons) / pop_size
var_nonprob <- NROW(X_rand) / pop_size^2 * (1 - est_ps) / est_ps * sigma_hat
var_nonprob <- 1/ pop_size^2 * (1 - est_ps) / est_ps * sigma_hat
var_total <- var_prob + var_nonprob

if (control_inference$nn_exact_se) {
Expand Down
40 changes: 37 additions & 3 deletions R/method_npar.R
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
#'
#' @description
#' Model for the outcome for the mass imputation estimator using loess via `stats::loess`.
#'
#' Estimation of the mean is done using the \mjseqn{S_B} probability sample.
#'
#' @details Analytical variance
#'
Expand All @@ -23,7 +23,8 @@
#' }
#'
#' where \mjseqn{\hat{e}_i=y_i - \hat{m}(x_i)} is the residual and \mjseqn{\hat{g}_B(\boldsymbol{x}_i) = \left\lbrace \pi_B(\boldsymbol{x}_i) \right\rbrace^{-1}} can be estimated
#' various ways. In our case we estimate it using \mjseqn{\pi_B(\boldsymbol{x}_i)=E(R | \boldsymbol{x})} as suggested by Chen et al. (2022, p. 6). Currently, this is estimated using `stats::loess` with `"gaussian"` family.
#' various ways. In the package we estimate \mjseqn{\hat{g}_B(\boldsymbol{x}_i)} using \mjseqn{\pi_B(\boldsymbol{x}_i)=E(R | \boldsymbol{x})} as suggested by Chen et al. (2022, p. 6). In particular,
#' we currently support this using stats::loess` with `"gaussian"` family.
#'
#' (b) probability part (\mjseqn{S_B} with size \mjseqn{n_B}; denoted as `var_prob` in the result)
#'
Expand All @@ -32,11 +33,12 @@
#'
#' \mjsdeqn{
#' \hat{V}_2=\frac{1}{N^2} \sum_{i=1}^{n_B} \sum_{j=1}^{n_B} \frac{\pi_{i j}-\pi_i \pi_j}{\pi_{i j}}
#' \frac{y_i}{\pi_i} \frac{y_j}{\pi_j}.
#' \frac{\hat{m}(x_i)}{\pi_i} \frac{\hat{m}(x_j)}{\pi_j}.
#' }
#'
#' Note that \mjseqn{\hat{V}_2} in principle can be estimated in various ways depending on the type of the design and whether population size is known or not.
#'
#'
#' @param y_nons target variable from non-probability sample
#' @param X_nons a `model.matrix` with auxiliary variables from non-probability sample
#' @param X_rand a `model.matrix` with auxiliary variables from non-probability sample
Expand Down Expand Up @@ -66,6 +68,38 @@
#' \item{var_nonprob}{variance for the non-probability sampl component}
#' \item{model}{model type (character `"npar"`)}
#' }
#'
#' @references
#' Chen, S., Yang, S., & Kim, J. K. (2022). Nonparametric mass imputation for data integration.
#' Journal of Survey Statistics and Methodology, 10(1), 1-24.
#'
#' @examples
#'
#' set.seed(123123123)
#' N <- 10000
#' n_a <- 500
#' n_b <- 1000
#' n_b1 <- 0.7*n_b
#' n_b2 <- 0.3*n_b
#' x1 <- rnorm(N, 2, 1)
#' x2 <- rnorm(N, 2, 1)
#' y1 <- rnorm(N, 0.3 + 2*x1+ 2*x2, 1)
#' y2 <- rnorm(N, 0.3 + 0.5*x1^2+ 0.5*x2^2, 1)
#' strata <- x1 <= 2
#' pop <- data.frame(x1, x2, y1, y2, strata)
#' sample_a <- pop[sample(1:N, n_a),]
#' sample_a$w_a <- N/n_a
#' sample_a_svy <- svydesign(ids=~1, weights=~w_a, data=sample_a)
#' pop1 <- subset(pop, strata == TRUE)
#' pop2 <- subset(pop, strata == FALSE)
#' sample_b <- rbind(pop1[sample(1:nrow(pop1), n_b1), ],
#' pop2[sample(1:nrow(pop2), n_b2), ])
#' res_y_npar <- nonprob(outcome = y1 + y2 ~ x1 + x2,
#' data = sample_b,
#' svydesign = sample_a_svy,
#' method_outcome = "npar")
#' res_y_npar
#'
#' @export
method_npar <- function(y_nons,
X_nons,
Expand Down
36 changes: 27 additions & 9 deletions R/method_pmm.R
Original file line number Diff line number Diff line change
Expand Up @@ -3,28 +3,46 @@
#' \loadmathjax
#'
#' @description
#' Model for the outcome for the mass imputation estimator
#' Model for the outcome for the mass imputation estimator. The implementation is currently based on [RANN::nn2] function and thus it uses Euclidean distance for matching units from \mjseqn{S_A} (nonprobability) to \mjseqn{S_B} (probability) based on predicted values from model \mjseqnm{\boldsymbol{x}_i)} based
#' either on `method_glm` or `method_npar`. Estimation of the mean is done using \mjseqn{S_B} sample.
#'
#' This implementation extends Yang et al. (2021) approach as described in Chlebicki et al. (2025), namely:
#'
#' \describe{
#' \item{pmm_weights}{if k>1 weighted aggregation of the mean for a given unit is used. We use distance
#' matrix returned by [RANN::nn2] function (`pmm_weights` from the [control_out()] function)}
#' \item{nn_exact_se}{if the non-probability sample is small we recommend using a mini-bootstrap
#' approach to estimate variance from the non-probability sample (`nn_exact_se` from the [control_inf()] function)}
#' \item{pmm_k_choice}{the main `nonprob` function allows for dynamic selection of `k` neighbours based on the
#' variance minimization procedure (`pmm_k_choice` from the [control_out()] function)}
#' }
#'
#' @details Analytical variance
#'
#' The variance of the mean is estimated based on the following approach

#' (a) non-probability part (\mjseqn{S_A} with size \mjseqn{n_A}; denoted as `var_nonprob` in the result) is currently estimated using the non-parametric mini-bootstrap estimator proposed by
#' Chlebicki et al. (2025, Algorithm 2). It is not proved to be consistent but with good finite population properties.
#' This bootstrap can be applied using `control_inference(nn_exact_se=TRUE)` and
#' can be summarized as follows:
#'
#' (a) probability part
#' 1. Sample \mjseqn{n_A} units from \mjseqn{S_A} with replacement to create \mjseqn{S_A'} (if pseudo-weights are present inclusion probabilities should be proportional to their inverses).
#' 2. --
#' 3. --
#' 4. --
#' 5. Estimate \mjseqn{\hat{V}_1=\text{var}{\hat{\boldsymbol{\mu}}}} obtained from simulations and save it as `var_nonprob`.
#'
#' (b) probability part (\mjseqn{S_B} with size \mjseqn{n_B}; denoted as `var_prob` in the result)
#'
#' This part uses functionalities of the `{survey}` package and the variance is estimated using the following
#' equation:
#'
#' \mjsdeqn{
#' \hat{V}_1=\frac{1}{N^2} \sum_{i=1}^n \sum_{j=1}^n \frac{\pi_{i j}-\pi_i \pi_j}{\pi_{i j}}
#' \frac{y_i}{\pi_i} \frac{y_j}{\pi_j}
#' \hat{V}_2=\frac{1}{N^2} \sum_{i=1}^{n_B} \sum_{j=1}^{n_B} \frac{\pi_{i j}-\pi_i \pi_j}{\pi_{i j}}
#' \frac{m(\boldsymbol{x}_i; \hat{\boldsymbol{\beta}})}{\pi_i} \frac{m(\boldsymbol{x}_i; \hat{\boldsymbol{\beta}})}{\pi_j}.
#' }
#'
#' (b) non-probability part
#'
#' \mjsdeqn{
#' \hat{V}_2 = ..
#' }
#' Note that \mjseqn{\hat{V}_2} in principle can be estimated in various ways depending on the type of the design and whether population size is known or not.
#'
#' @param y_nons target variable from non-probability sample
#' @param X_nons a `model.matrix` with auxiliary variables from non-probability sample
Expand Down
17 changes: 14 additions & 3 deletions man/method_glm.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading

0 comments on commit a6f2aa1

Please sign in to comment.