Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

an alternative distance metric, categorical variable handling, and optimization ideas #19

Open
dylanbeaudette opened this issue Sep 1, 2021 · 0 comments

Comments

@dylanbeaudette
Copy link

dylanbeaudette commented Sep 1, 2021

First of all, thank you for the excellent package and companion articles.

While looking over the aoa code it occurred to me that some of the complexity associated with handling categorical variables can be simplified by switching to a different distance metric. Gower's generalized distance metric is ideal because it can integrate mixtures of ratio, nominal, and ordinal data types. Also, the metric automatically includes scaling / centering. There are a couple of implementations:

It would appear that the knnx.dist function does all of the heavy lifting in aoa.

A quick benchmark of a couple candidate methods.

library(gower)
library(cluster)
library(FNN)
library(microbenchmark)

set.seed(10101)
n <- 1000
a <- rnorm(n = n, mean = 0, sd = 2)
x <- rnorm(n = n, mean = 0, sd = 2)
y <- rnorm(n = n, mean = 0, sd = 2)

z <- data.frame(x, y, a)

microbenchmark(
  gower = gower_dist(z[1:10, ], z),
  knn = knnx.dist(data = z, query = z[1:10, ], k = 1),
  daisy = daisy(z, metric = 'gower')
)

The interface and resulting objects aren't directly compatible, but it does seem like gower::gower_dist() is a reasonable candidate in terms of speed. The main reason to consider cluster::daisy is that it can accommodate all variable types, while gower::gower_dist() does not yet differentiate between nominal / ordinal factors.

Unit: microseconds
  expr     min       lq       mean   median      uq      max neval cld
 gower   395.7   444.70    523.737   497.35   559.0    874.3   100  a 
   knn   772.6   794.05    892.615   842.70   925.2   1382.7   100  a 
 daisy 56398.0 73496.70 100253.478 78571.80 88727.8 276262.1   100   b

Profiling data for aoa run in a single thred:
image

This was performed with a model based on 1,030 observations as applied to a raster stack
dimensions : 3628, 2351, 8529428, 18 (nrow, ncol, ncell, nlayers)

I'll follow-up with a small example dataset that contains nominal and ordinal variables.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant