more work on the documentation

ncn-foreigners · Mar 4, 2025 · a6f2aa1 · a6f2aa1
1 parent 3491a47
commit a6f2aa1
Show file tree

Hide file tree

Showing 8 changed files with 201 additions and 80 deletions.
diff --git a/R/method_glm.R b/R/method_glm.R
@@ -9,13 +9,15 @@
 #' @importFrom survey svymean
 #'
 #' @description
-#' Model for the outcome for the mass imputation estimator
+#' Model for the outcome for the mass imputation estimator using generalized linear
+#' models via the `stats::glm` function. Estimation of the mean is done using \mjseqn{S_B}
+#' probability sample or known population totals.
 #'
 #' @details Analytical variance
 #'
 #' The variance of the mean is estimated based on the following approach
 #'
-#' (a) non-probability part (\mjseqn{S_A})
+#' (a) non-probability part  (\mjseqn{S_A} with size \mjseqn{n_A}; denoted as `var_nonprob` in the result)
 #'
 #' \mjsdeqn{
 #' \hat{V}_1 = \frac{1}{n_A^2}\sum_{i=1}^{n_A} \hat{e}_i \left\lbrace \boldsymbol{h}(\boldsymbol{x}_i; \hat{\boldsymbol{\beta}})^\prime\hat{\boldsymbol{c}}\right\rbrace,
@@ -27,7 +29,7 @@
 #' Under the linear regression model \mjseqn{\boldsymbol{h}\left(\boldsymbol{x}_i ; \widehat{\boldsymbol{\beta}}\right)=\boldsymbol{x}_i} and \mjseqn{\widehat{\boldsymbol{c}}=\left(n_A^{-1} \sum_{i \in A} \boldsymbol{x}_i \boldsymbol{x}_i^{\prime}\right)^{-1} N^{-1} \sum_{i \in B} w_i \boldsymbol{x}_i .}
 #'
 #'
-#' (b) probability part (\mjseqn{S_B})
+#' (b) probability part (\mjseqn{S_B} with size \mjseqn{n_B}; denoted as `var_prob` in the result)
 #'
 #'  This part uses functionalities of the `{survey}` package and the variance is estimated using the following
 #'  equation:
@@ -37,6 +39,9 @@
 #' \frac{m(\boldsymbol{x}_i; \hat{\boldsymbol{\beta}})}{\pi_i} \frac{m(\boldsymbol{x}_i; \hat{\boldsymbol{\beta}})}{\pi_j}.
 #' }
 #'
+#' Note that \mjseqn{\hat{V}_2} in principle can be estimated in various ways depending on the type of the design and whether population size is known or not.
+#'
+#' Furthermore, if only population totals/means are known and assumed to be fixed we set \mjseqn{\hat{V}_2=0}.
 #'
 #' @param y_nons target variable from non-probability sample
 #' @param X_nons a `model.matrix` with auxiliary variables from non-probability sample
@@ -70,6 +75,11 @@
 #'   \item{family}{family type (character `"glm"`)}
 #' }
 #'
+#' @references
+#' Kim, J. K., Park, S., Chen, Y., & Wu, C. (2021). Combining non-probability and probability survey samples
+#' through mass imputation. Journal of the Royal Statistical Society Series A: Statistics in Society,
+#' 184(3), 941-963.
+#'
 #' @examples
 #'
 #' data(admin)

diff --git a/R/method_nn.R b/R/method_nn.R
@@ -9,56 +9,52 @@
 #' @importFrom RANN nn2
 #'
 #'
-#' @description Mass imputation using nearest neighbours approach as described in Yang (2021).
+#' @description Mass imputation using nearest neighbours approach as described in Yang et al. (2021).
 #' The implementation is currently based on [RANN::nn2] function and thus it uses
-#' Euclidean distance to matching.
-#'
-#' This implementation extends Yang (2021) approach as described in Chlebicki et al. (2025), namely:
-#'
-#' \describe{
-#'  \item{pmm_weights}{if k>1 weighted aggregation of the mean for a given unit is used. We use distance
-#'  matrix returned by [RANN::nn2] function (`pmm_weights` from the [control_out()] function)}
-#'  \item{nn_exact_se}{if the non-probability sample is small we recommend using a mini-bootstrap
-#'  approach to estimate variance from the non-probability sample  (`nn_exact_se` from the [control_inf()] function)}
-#'  \item{pmm_k_choice}{the main `nonprob` function allows for dynamic selection of `k` neighbours based on the
-#'  variance minimization procedure (`pmm_k_choice` from the [control_out()] function)}
-#' }
+#' Euclidean distance for matching units from \mjseqn{S_A} (nonprobability) to \mjseqn{S_B} (probability).
+#' Estimation of the mean is done using \mjseqn{S_B} sample.
 #'
 #'
 #' @details Analytical variance
 #'
 #' The variance of the mean is estimated based on the following approach
 #'
-#' (a) non-probability part
+#' (a) non-probability part  (\mjseqn{S_A} with size \mjseqn{n_A}; denoted as `var_nonprob` in the result)
 #'
 #' This may be estimated using
 #'
 #' \mjsdeqn{
 #' \hat{V}_1 = \frac{1}{N^2}\sum_{i=1}^{S_A}\frac{1-\hat{\pi}_B(\boldsymbol{x}_i)}{\hat{\pi}_B(\boldsymbol{x}_i)}\hat{\sigma}^2(\boldsymbol{x}_i)
 #' }
 #'
-#' or
+#' where \mjseqn{\hat{\pi}_B(\boldsymbol{x}_i)} is an estimator of propensity scores which
+#' we currently estimate using \mjseqn{n_A/N} (constant) and \mjseqn{\hat{\sigma}^2(\boldsymbol{x}_i)} is
+#' estimated using based on the average of \mjseqn{(y_i - y_i^*)^2}.
 #'
-#' \mjsdeqn{
-#' \hat{V}_1 = \frac{1}{N^2}\sum_{i=1}^{S_A}\frac{1}{\hat{\pi}_B(\boldsymbol{x}_i)}(y_i - y_i^*)
-#' }
+#' Chlebicki et al. (2025, Algorithm 2) proposed non-parametric mini-bootstrap estimator
+#' (without assuming that it is consistent) but with good finite population properties.
+#' This bootstrap can be applied using `control_inference(nn_exact_se=TRUE)` and
+#' can be summarized as follows:
 #'
-#' or mini-bootstrap
+#' 1. Sample \mjseqn{n_A} units from \mjseqn{S_A} with replacement to create \mjseqn{S_A'} (if pseudo-weights are present inclusion probabilities should be proportional to their inverses).
+#' 2. Match units from \mjseqn{S_B} to \mjseqn{S_A'} to obtain predictions \mjseqn{y^*}=\mjseqn{{k}^{-1}\sum_{k}y_k}.
+#' 3. Estimate \mjseqn{\hat{\mu}=\frac{1}{N} \sum_{i \in S_B} d_i y_i^*}.
+#' 4. Repeat steps 1-3 \mjseqn{M} times (we set \mjseqn{M=50} in our simulations; this is hard-coded).
+#' 5. Estimate \mjseqn{\hat{V}_1=\text{var}{\hat{\boldsymbol{\mu}}}} obtained from simulations and save it as `var_nonprob`.
 #'
 #'
-#'
-#' (b) probability part
+#' (b) probability part (\mjseqn{S_B} with size \mjseqn{n_B}; denoted as `var_prob` in the result)
 #'
 #'  This part uses functionalities of the `{survey}` package and the variance is estimated using the following
 #'  equation:
 #'
 #' \mjsdeqn{
-#' \hat{V}_1=\frac{1}{N^2} \sum_{i=1}^n \sum_{j=1}^n \frac{\pi_{i j}-\pi_i \pi_j}{\pi_{i j}}
+#' \hat{V}_2=\frac{1}{N^2} \sum_{i=1}^n \sum_{j=1}^n \frac{\pi_{i j}-\pi_i \pi_j}{\pi_{i j}}
 #' \frac{y_i^*}{\pi_i} \frac{y_j^*}{\pi_j}
 #' }
 #'
-#'where \mjseqn{y^*_i} and \mjseqn{y_j^*} are values imputed imputed as an average of \mjseqn{k}-nearest neighbour,
-#'i.e. \mjseqn{{k}^{-1}\sum_{k}y_k}.
+#'where \mjseqn{y^*_i} and \mjseqn{y_j^*} are values imputed imputed as an average
+#' of \mjseqn{k}-nearest neighbour, i.e. \mjseqn{{k}^{-1}\sum_{k}y_k}. Note that \mjseqn{\hat{V}_2} in principle can be estimated in various ways depending on the type of the design and whether population size is known or not.
 #'
 #'
 #' @param y_nons target variable from non-probability sample
@@ -98,7 +94,7 @@
 #'  big found data for finite population inference using mass imputation.
 #'  Survey Methodology, June 2021 29 Vol. 47, No. 1, pp. 29-58
 #'
-#' Chlebicki, P., Chrostowski, Ł., & Beręsewicz, M. (2025). Data integration of non-probability
+#'  Chlebicki, P., Chrostowski, Ł., & Beręsewicz, M. (2025). Data integration of non-probability
 #' and probability samples with predictive mean matching. arXiv preprint arXiv:2403.13750.
 #'
 #' @examples
@@ -186,11 +182,12 @@ method_nn <- function(y_nons,
   y_mi_hat <- as.numeric(svydesign_mean)
 
   if (se) {
+
     var_prob <- as.vector(attr(svydesign_mean, "var"))
 
     sigma_hat <- mean((y_nons - y_nons_pred)^2)
     est_ps <- NROW(X_nons) / pop_size
-    var_nonprob <- NROW(X_rand) / pop_size^2 * (1 - est_ps) / est_ps * sigma_hat
+    var_nonprob <- 1/ pop_size^2 * (1 - est_ps) / est_ps * sigma_hat
     var_total <- var_prob + var_nonprob
 
     if (control_inference$nn_exact_se) {

diff --git a/R/method_npar.R b/R/method_npar.R
@@ -10,7 +10,7 @@
 #'
 #' @description
 #' Model for the outcome for the mass imputation estimator using loess via `stats::loess`.
-#'
+#' Estimation of the mean is done using the \mjseqn{S_B} probability sample.
 #'
 #' @details Analytical variance
 #'
@@ -23,7 +23,8 @@
 #' }
 #'
 #' where \mjseqn{\hat{e}_i=y_i - \hat{m}(x_i)} is the residual and \mjseqn{\hat{g}_B(\boldsymbol{x}_i) = \left\lbrace \pi_B(\boldsymbol{x}_i) \right\rbrace^{-1}} can be estimated
-#' various ways. In our case we estimate it using \mjseqn{\pi_B(\boldsymbol{x}_i)=E(R | \boldsymbol{x})} as suggested by Chen et al. (2022, p. 6). Currently, this is estimated using `stats::loess` with `"gaussian"` family.
+#' various ways. In the package we estimate \mjseqn{\hat{g}_B(\boldsymbol{x}_i)} using \mjseqn{\pi_B(\boldsymbol{x}_i)=E(R | \boldsymbol{x})} as suggested by Chen et al. (2022, p. 6). In particular,
+#' we currently support this using stats::loess` with `"gaussian"` family.
 #'
 #' (b) probability part (\mjseqn{S_B} with size \mjseqn{n_B}; denoted as `var_prob` in the result)
 #'
@@ -32,11 +33,12 @@
 #'
 #' \mjsdeqn{
 #' \hat{V}_2=\frac{1}{N^2} \sum_{i=1}^{n_B} \sum_{j=1}^{n_B} \frac{\pi_{i j}-\pi_i \pi_j}{\pi_{i j}}
-#' \frac{y_i}{\pi_i} \frac{y_j}{\pi_j}.
+#' \frac{\hat{m}(x_i)}{\pi_i} \frac{\hat{m}(x_j)}{\pi_j}.
 #' }
 #'
 #' Note that \mjseqn{\hat{V}_2} in principle can be estimated in various ways depending on the type of the design and whether population size is known or not.
 #'
+#'
 #' @param y_nons target variable from non-probability sample
 #' @param X_nons a `model.matrix` with auxiliary variables from non-probability sample
 #' @param X_rand a `model.matrix` with auxiliary variables from non-probability sample
@@ -66,6 +68,38 @@
 #'   \item{var_nonprob}{variance for the non-probability sampl component}
 #'   \item{model}{model type (character `"npar"`)}
 #' }
+#'
+#' @references
+#' Chen, S., Yang, S., & Kim, J. K. (2022). Nonparametric mass imputation for data integration.
+#' Journal of Survey Statistics and Methodology, 10(1), 1-24.
+#'
+#' @examples
+#'
+#' set.seed(123123123)
+#' N <- 10000
+#' n_a <- 500
+#' n_b <- 1000
+#' n_b1 <- 0.7*n_b
+#' n_b2 <- 0.3*n_b
+#' x1 <- rnorm(N, 2, 1)
+#' x2 <- rnorm(N, 2, 1)
+#' y1 <- rnorm(N, 0.3 + 2*x1+ 2*x2, 1)
+#' y2 <- rnorm(N, 0.3 + 0.5*x1^2+ 0.5*x2^2, 1)
+#' strata <- x1 <= 2
+#' pop <- data.frame(x1, x2, y1, y2, strata)
+#' sample_a <- pop[sample(1:N, n_a),]
+#' sample_a$w_a <- N/n_a
+#' sample_a_svy <- svydesign(ids=~1, weights=~w_a, data=sample_a)
+#' pop1 <- subset(pop, strata == TRUE)
+#' pop2 <- subset(pop, strata == FALSE)
+#' sample_b <- rbind(pop1[sample(1:nrow(pop1), n_b1), ],
+#'                   pop2[sample(1:nrow(pop2), n_b2), ])
+#' res_y_npar <- nonprob(outcome = y1 + y2 ~ x1 + x2,
+#'                       data = sample_b,
+#'                       svydesign = sample_a_svy,
+#'                       method_outcome = "npar")
+#' res_y_npar
+#'
 #' @export
 method_npar <- function(y_nons,
                         X_nons,

diff --git a/R/method_pmm.R b/R/method_pmm.R
@@ -3,28 +3,46 @@
 #' \loadmathjax
 #'
 #' @description
-#' Model for the outcome for the mass imputation estimator
+#' Model for the outcome for the mass imputation estimator. The implementation is currently based on [RANN::nn2] function and thus it uses Euclidean distance for matching units from \mjseqn{S_A} (nonprobability) to \mjseqn{S_B} (probability) based on predicted values from model \mjseqnm{\boldsymbol{x}_i)} based
+#' either on `method_glm` or `method_npar`.  Estimation of the mean is done using \mjseqn{S_B} sample.
 #'
+#' This implementation extends Yang et al. (2021) approach as described in Chlebicki et al. (2025), namely:
+#'
+#' \describe{
+#'  \item{pmm_weights}{if k>1 weighted aggregation of the mean for a given unit is used. We use distance
+#'  matrix returned by [RANN::nn2] function (`pmm_weights` from the [control_out()] function)}
+#'  \item{nn_exact_se}{if the non-probability sample is small we recommend using a mini-bootstrap
+#'  approach to estimate variance from the non-probability sample  (`nn_exact_se` from the [control_inf()] function)}
+#'  \item{pmm_k_choice}{the main `nonprob` function allows for dynamic selection of `k` neighbours based on the
+#'  variance minimization procedure (`pmm_k_choice` from the [control_out()] function)}
+#' }
 #'
 #' @details Analytical variance
 #'
 #' The variance of the mean is estimated based on the following approach
+
+#' (a) non-probability part  (\mjseqn{S_A} with size \mjseqn{n_A}; denoted as `var_nonprob` in the result) is currently estimated using the non-parametric mini-bootstrap estimator proposed by
+#' Chlebicki et al. (2025, Algorithm 2). It is not proved to be consistent but with good finite population properties.
+#' This bootstrap can be applied using `control_inference(nn_exact_se=TRUE)` and
+#' can be summarized as follows:
 #'
-#' (a) probability part
+#' 1. Sample \mjseqn{n_A} units from \mjseqn{S_A} with replacement to create \mjseqn{S_A'} (if pseudo-weights are present inclusion probabilities should be proportional to their inverses).
+#' 2. --
+#' 3. --
+#' 4. --
+#' 5. Estimate \mjseqn{\hat{V}_1=\text{var}{\hat{\boldsymbol{\mu}}}} obtained from simulations and save it as `var_nonprob`.
+#'
+#' (b) probability part (\mjseqn{S_B} with size \mjseqn{n_B}; denoted as `var_prob` in the result)
 #'
 #'  This part uses functionalities of the `{survey}` package and the variance is estimated using the following
 #'  equation:
 #'
 #' \mjsdeqn{
-#' \hat{V}_1=\frac{1}{N^2} \sum_{i=1}^n \sum_{j=1}^n \frac{\pi_{i j}-\pi_i \pi_j}{\pi_{i j}}
-#' \frac{y_i}{\pi_i} \frac{y_j}{\pi_j}
+#' \hat{V}_2=\frac{1}{N^2} \sum_{i=1}^{n_B} \sum_{j=1}^{n_B} \frac{\pi_{i j}-\pi_i \pi_j}{\pi_{i j}}
+#' \frac{m(\boldsymbol{x}_i; \hat{\boldsymbol{\beta}})}{\pi_i} \frac{m(\boldsymbol{x}_i; \hat{\boldsymbol{\beta}})}{\pi_j}.
 #' }
 #'
-#' (b) non-probability part
-#'
-#' \mjsdeqn{
-#' \hat{V}_2 = ..
-#' }
+#' Note that \mjseqn{\hat{V}_2} in principle can be estimated in various ways depending on the type of the design and whether population size is known or not.
 #'
 #' @param y_nons target variable from non-probability sample
 #' @param X_nons a `model.matrix` with auxiliary variables from non-probability sample

diff --git a/man/method_glm.Rd b/man/method_glm.Rd