From bba0fb22ae7e6acadbe4c58ef2e6ebea5535d1be Mon Sep 17 00:00:00 2001 From: Mikis Stasinopoulos Date: Mon, 21 Oct 2024 12:00:21 +0100 Subject: [PATCH] selection more --- vignettes/selection.qmd | 24 +++++++++++++++++++++++- 1 file changed, 23 insertions(+), 1 deletion(-) diff --git a/vignettes/selection.qmd b/vignettes/selection.qmd index 7e11580..ce1adc1 100644 --- a/vignettes/selection.qmd +++ b/vignettes/selection.qmd @@ -32,7 +32,7 @@ g({\theta}_{ki}) &=& b_0 + s_1({x}_{1i}) + \ldots, s_p({x}_{pi}) \end{split} $$ {#eq-GAMLSS} where ${D}( )$ is the assumed distribution which depends on parameters $\theta_{1i}, \ldots, \theta_{ki}$ and where all the parameters can be functions of the explanatory variables $({x}_{1i}, \ldots, {x}_{pi})$. -In reality we do not know the distribution ${D}( )$ and also we do not know **which** and **how** the variables $({x}_{1i}, \ldots, {x}_{pi})$ effect the parameters $\theta_{1i}, \ldots, \theta_{ki}$. So the model selection in a distributional regression model takes the form of; +In reality we do not know the distribution ${D}( )$ and also we do not know **which** and **how** the variables $({x}_{1i}, \ldots, {x}_{pi})$ effect the parameters $\theta_{1i}, \ldots, \theta_{ki}$. So the model selection in a distributional regression model could takes the form ; * select the _best_ fitting distribution; @@ -40,8 +40,25 @@ In reality we do not know the distribution ${D}( )$ and also we do not know **w * select the _relevant_ variables for the parameters and how they effect the parameters. +So a **general algorithm** for searching for a _best_ model could be; + +- **START** by defining a set of appropriate distributions for the response ${D_J()}$ for $j=1,\ldots, J.$ + +- **FOR** $J$ in $j=1,\ldots, J$ + +- **SELECT** appropriate variables $({x}_{1i}, \ldots, {x}_{pi})$. + +- **SELECT** the distribution $\hat{D}_J()$ and variables with a minimum values of a selected criterion. + +The selection criterion could be a criterion as AIC defined on the training data or a criterion defined in the **out of bag** data. While the above algorithm could work reasonable with data having a relative small number or explanatory variables could be very slow for data with a lot of explanatory variables. Cutting some corners could improve the speed of the algorithm. + + + + ## Select a distribution +### The ranse of the response + The first thing to take into the account in the selection of the distribution is that the distribution should be defined in the range of the response variable. @fig-responseType shows the different possibilities depending on whether the response is `continuous`, `discrete` of `factor` If the response is continuous and has negative values a distribution in the real line is appropriate. For positive responses a positive real line distribution is appropriate. For bounded continuous response we have the options to transform the response to values between 0 and 1 or to create an appropriate truncated distribution. For count response the consideration is whether the counts are finite or not. For infinity counts a distribution similar to the Poisson distribution can be used. For finite counts binomial type distributions can be used. The case in which the response is a categorical variable (factor) is called `classification` regression. If the factor is an `ordered` factor appropriate models exist but we will not deal with them here. For unordered factor responses a binomial distribution can be use if the classification is binary otherwise a multinomial distribution. Note that for classification problems, there is a vast literature in machine learning to deal with the problem. @@ -66,6 +83,11 @@ flowchart LR K --> N[binary] ``` + +### Select appropriate distribution + + + ## Select appropriate variables\ hu