Skip to content

Commit

Permalink
selection more
Browse files Browse the repository at this point in the history
  • Loading branch information
mstasinopoulos committed Oct 21, 2024
1 parent de21067 commit bba0fb2
Showing 1 changed file with 23 additions and 1 deletion.
24 changes: 23 additions & 1 deletion vignettes/selection.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -32,16 +32,33 @@ g({\theta}_{ki}) &=& b_0 + s_1({x}_{1i}) + \ldots, s_p({x}_{pi})
\end{split}
$$ {#eq-GAMLSS}
where ${D}( )$ is the assumed distribution which depends on parameters $\theta_{1i}, \ldots, \theta_{ki}$ and where all the parameters can be functions of the explanatory variables $({x}_{1i}, \ldots, {x}_{pi})$.
In reality we do not know the distribution ${D}( )$ and also we do not know **which** and **how** the variables $({x}_{1i}, \ldots, {x}_{pi})$ effect the parameters $\theta_{1i}, \ldots, \theta_{ki}$. So the model selection in a distributional regression model takes the form of;
In reality we do not know the distribution ${D}( )$ and also we do not know **which** and **how** the variables $({x}_{1i}, \ldots, {x}_{pi})$ effect the parameters $\theta_{1i}, \ldots, \theta_{ki}$. So the model selection in a distributional regression model could takes the form ;
* select the _best_ fitting distribution;
* select the _relevant_ variables for the parameters and how they effect the parameters.
So a **general algorithm** for searching for a _best_ model could be;
- **START** by defining a set of appropriate distributions for the response ${D_J()}$ for $j=1,\ldots, J.$
- **FOR** $J$ in $j=1,\ldots, J$
- **SELECT** appropriate variables $({x}_{1i}, \ldots, {x}_{pi})$.
- **SELECT** the distribution $\hat{D}_J()$ and variables with a minimum values of a selected criterion.
The selection criterion could be a criterion as AIC defined on the training data or a criterion defined in the **out of bag** data. While the above algorithm could work reasonable with data having a relative small number or explanatory variables could be very slow for data with a lot of explanatory variables. Cutting some corners could improve the speed of the algorithm.
## Select a distribution
### The ranse of the response
The first thing to take into the account in the selection of the distribution is that the distribution should be defined in the range of the response variable. @fig-responseType shows the different possibilities depending on whether the response is `continuous`, `discrete` of `factor` If the response is continuous and has negative values a distribution in the real line is appropriate. For positive responses a positive real line distribution is appropriate. For bounded continuous response we have the options to transform the response to values between 0 and 1 or to create an appropriate truncated distribution. For count response the consideration is whether the counts are finite or not. For infinity counts a distribution similar to the Poisson distribution can be used. For finite counts binomial type distributions can be used. The case in which the response is a categorical variable (factor) is called `classification` regression. If the factor is an `ordered` factor appropriate models exist but we will not deal with them here. For unordered factor responses a binomial distribution can be use if the classification is binary otherwise a multinomial distribution. Note that for classification problems, there is a vast literature in machine learning to deal with the problem.
Expand All @@ -66,6 +83,11 @@ flowchart LR
K --> N[binary]
```
### Select appropriate distribution
## Select appropriate variables\
hu
Expand Down

0 comments on commit bba0fb2

Please sign in to comment.