Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drop a single coefficient (one level from categorical variable) from formula/modelmatrix for Likelihood ratio test #745

Open
MaximilianNuber opened this issue Feb 19, 2024 · 6 comments

Comments

@MaximilianNuber
Copy link

Dear all,

I am fitting MixedModels for large single cell datasets.
One approach to test for covariates is to drop a single coefficient from the modelmatrix, which belongs to one level of a categorical variable.
I have not been able to find this possibility in Julia.

As an example, let me use the iris dataset. It may not be a mixed model, but in principle it is the same.

We have the full model:
full = lm(@formula(SepalLength ~ Species), iris)
with results

SepalLength ~ 1 + Species

Coefficients:
─────────────────────────────────────────────────────────────────────────────
                     Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
─────────────────────────────────────────────────────────────────────────────
(Intercept)          5.936   0.0728022  81.54    <1e-99   5.79213    6.07987
Species: setosa     -0.93    0.102958   -9.03    <1e-15  -1.13347   -0.726531
Species: virginica   0.652   0.102958    6.33    <1e-08   0.448531   0.855469
─────────────────────────────────────────────────────────────────────────────

My goal would be, to drop Species: setosa in the formula or model matrix for another linear model.
This nested model would be compared to the full model by likelihood ratio test.

Everything I found so far would be dropping the Species variable entirely, but it is not what I want.

I have played around with contrasts, where in the example the modelmatrix column for Species: setosa would be all zeros, but got the warning that the model is not full rank.

Is there any solution to this, or possibly a workaround?

Thank you for any help,
Max

@MaximilianNuber
Copy link
Author

P.S. I am posting this question in MixedModels.jl, as in GLM.jl I would be able to use X, y as input for lm or glm.
I haven't found this possibility for MixedModels, but I do see that for specifying random effects, we need the formula language. I just don´t know to do it in this case.

@dmbates
Copy link
Collaborator

dmbates commented Feb 20, 2024

What you want to do is related to profiling the log-likelihood with respect to the fixed-effects coefficients, for which the code is in src/profile/fixefpr.jl. That code is more general than you need but it shows that you can create a reduced model by creating a fixed-effects term and then copying the response, the reterm (which needs a deepcopy or the special copy method in the file mentioned above) and the formula. Try this.

using MixedModels

function dropcol(m::LinearMixedModel, j::Integer)
    fe = m.feterm        # model matrix X, column names, rank, etc.
    x = fe.x             # model matrix
    fe.rank == size(x, 2) || throw(ArgumentError("m.X is not of full column rank"))
    notj = deleteat!(collect(axes(x, 2)), j) # indirectly checks that j is in range
    feterm = MixedModels.FeTerm(x[:, notj], fe.cnames[notj])
    θbak = m.θ           # save a copy of the parameter estimates
    setθ!(m, m.optsum.initial) # ensure m is at initial values
    mnew = LinearMixedModel(collect(m.y), feterm, deepcopy(m.reterms), m.formula)
    objective!(m, θbak)  # restore parameter values in m
    return mnew
end

m1 = let f = @formula y ~ 1 + dept + service + (1|s) + (1|d)
    fit(MixedModel, f, MixedModels.dataset(:insteval))
end

m1a = fit!(dropcol(m1, 2))

MixedModels.likelihoodratiotest(m1a, m1)

@MaximilianNuber
Copy link
Author

Dear @dmbates,

Thank you so much, this works perfectly.

If I may, I would like to ask a follow-up question, about the correctness of my application to GeneralizedLinearMixedModel :

I applied your dropcol to GeneralizedLinearMixedModel by adjusting the function to include a target-vector y.
I assume that the LMM within a GLMM carries a y-vector adjusted to the Distribution of the GLMM, therefore we need to include the original response.

function dropcol2(m::LinearMixedModel, j::Integer, y2)
    fe = m.feterm        # model matrix X, column names, rank, etc.
    x = fe.x             # model matrix
    fe.rank == size(x, 2) || throw(ArgumentError("m.X is not of full column rank"))
    notj = deleteat!(collect(axes(x, 2)), j) # indirectly checks that j is in range
    feterm = MixedModels.FeTerm(x[:, notj], fe.cnames[notj])
    θbak = m.θ           # save a copy of the parameter estimates
    setθ!(m, m.optsum.initial) # ensure m is at initial values
    mnew = LinearMixedModel(collect(y2), feterm, deepcopy(m.reterms), m.formula)
    MixedModels.objective!(m, θbak)  # restore parameter values in m
    return mnew
end

Then, we contruct a new GeneralizedLinearMixedModel as in MixedModels.jl/src/generalizedlinearmixedmodel.jl:

function dropcol_glmm(glmm::GeneralizedLinearMixedModel, j::Integer)
    MixedModels.unfit!(glmm)
    LMM = dropcol2(glmm.LMM, j, glmm.y)
    wts = []
    offset = []
    T = eltype(LMM.Xymat)
    y = copy(LMM.y)
    d = Bernoulli()
    l = canonicallink(d)
    dofit = size(LMM.X, 2) != 0 # GLM.jl kwarg
    gl = glm(LMM.X, y, d, l;
            wts=convert(Vector{T}, wts),
            dofit,
            offset=convert(Vector{T}, offset)
    )
    β = dofit ? coef(gl) : T[]
    u = [fill(zero(eltype(y)), MixedModels.vsize(t), MixedModels.nlevs(t)) for t in LMM.reterms]
    vv = length(u) == 1 ? vec(first(u)) : similar(y, 0)

    res = GeneralizedLinearMixedModel{T,typeof(d)}(
        LMM,
        β,
        copy(β),
        LMM.θ,
        copy.(u),
        u,
        zero.(u),
        gl.rr,
        similar(y),
        oftype(y, wts),
        similar(vv),
        similar(vv),
        similar(vv),
        similar(vv),
    )
    return res
end

As an adhoc example:

data = MixedModels.dataset(:insteval)
data = DataFrame(data)
data.service = map(x -> ifelse(x == "N", 0, 1), data.service)

fm = @formula(service ~ studage + lectage + (1|s))
m = GeneralizedLinearMixedModel(fm, data, Bernoulli())
fit!(m; fast = true)

Generalized Linear Mixed Model fit by maximum likelihood (nAGQ = 1)
  service ~ 1 + studage + lectage + (1 | s)
  Distribution: Bernoulli{Float64}
  Link: LogitLink()

   logLik     deviance      AIC         AICc        BIC     
 -47205.8133  94411.6267  94431.6267  94431.6297  94523.6663

Variance components:
     Column   Variance Std.Dev. 
s (Intercept)  0.437793 0.661659

 Number of obs: 73421; levels of grouping factors: 2972

Fixed-effects parameters:
────────────────────────────────────────────────────
                 Coef.  Std. Error       z  Pr(>|z|)
────────────────────────────────────────────────────
(Intercept)  -0.123563   0.0285926   -4.32    <1e-04
studage: 4   -0.387704   0.0419093   -9.25    <1e-19
studage: 6   -1.04746    0.0421811  -24.83    <1e-99
studage: 8   -1.27521    0.0452258  -28.20    <1e-99
lectage: 2    0.184468   0.0233174    7.91    <1e-14
lectage: 3    0.530911   0.0279009   19.03    <1e-80
lectage: 4    0.718081   0.0297538   24.13    <1e-99
lectage: 5    1.16544    0.0340265   34.25    <1e-99
lectage: 6    1.52207    0.0301753   50.44    <1e-99
────────────────────────────────────────────────────

m2 = dropcol_glmm(m, 2)
fit!(m2, fast = true)

Generalized Linear Mixed Model fit by maximum likelihood (nAGQ = 1)
  service ~ 1 + studage + lectage + (1 | s)
  Distribution: Bernoulli{Float64}
  Link: LogitLink()

   logLik     deviance      AIC         AICc        BIC     
 -47294.2634  94588.5267  94606.5267  94606.5292  94689.3624

Variance components:
     Column   Variance Std.Dev. 
s (Intercept)  0.406962 0.637936

 Number of obs: 73421; levels of grouping factors: 2972

Fixed-effects parameters:
────────────────────────────────────────────────────
                 Coef.  Std. Error       z  Pr(>|z|)
────────────────────────────────────────────────────
(Intercept)  -0.298664   0.0107902  -27.68    <1e-99
studage: 6   -0.926317   0.0167307  -55.37    <1e-99
studage: 8   -1.0441     0.0178362  -58.54    <1e-99
lectage: 2    0.193725   0.0108125   17.92    <1e-71
lectage: 3    0.475043   0.0127252   37.33    <1e-99
lectage: 4    0.674609   0.0137594   49.03    <1e-99
lectage: 5    1.11943    0.0158697   70.54    <1e-99
lectage: 6    1.46943    0.0137901  106.56    <1e-99
────────────────────────────────────────────────────

My point in all of this is to recreate the approach of the MAST package for single-cell analysis, where I need to fit both a LinearMixedModel and a Logistic regression.

Again, thank you for your help.

Best,
Max

@dmbates
Copy link
Collaborator

dmbates commented Feb 22, 2024

I think that will work. We could try to make it easier to construct a Generalized Linear Mixed Model from the feterm and reterms but I am a bit reluctant to make that too easy. In general it is not a good idea to concentrate on individual coefficients from a categorical covariate but I can understand that in single-cell analysis it may be a convenient way to phrase that problem.

I did write a reduced set of types and methods to fit a logistic GLMM in https://embraceuncertaintybook.com/aGHQ.html If you were interested we could try to modify those for a special-purpose GLMM implementation. If this seems worthwhile to you, could you suggest where I should start reading about the MAST approach? Is the 2015 Genome Biology paper the best place to start or should I start with the vignettes in the BioConductor package?

Feel free to contact me at my gmail.com (same username as on github) address to take this discussion offline.

@dmbates
Copy link
Collaborator

dmbates commented Feb 22, 2024

@MaximilianNuber Is it correct that there is only one random-effects term in the LMMs and GLMMs that you wish to fit and that it is of the form (1|samp)? If so, there are considerable simplifications that can be made in fitting the model and refitting it with individual columns in X removed.

@MaximilianNuber
Copy link
Author

@dmbates Yes, the only random effect in the model is in the form (1|sample), with the goal to retain sample structure. In larger or several datasets as my own I have been wondering if the pool/sequencing run/dataset should be added as well, but I haven´t found this attempt elsewhere so I would keep it at the sample level for now.

For more in-depth information on your question I am preparing a mail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants