Categorical Features Pairwise Euclidean Distances

A python package to compute pairwise Euclidean distances on datasets with categorical features in little time

Motivation

In machine learning model development I often ran into datasets with categorical features. Most times dealing with these categorical features was fairly straight forward (I would use the pandas get_dummies() function to convert each feature into a one-hot-encoded representaion).

But when the number of categories embedded in these categorical features became massive, I ran into the problem of extremely slow Euclidean distance computation between each sample and every other sample.

This is where this package comes in. Running my own tests, I concluded that this code runs significantly faster than the SKLearn pairwise Euclidean distances function on one-hot-encoded categorical features.

Prerequisites

See requirements.txt for the full list of prerequisite libraries.

Installation

To start using this package, simply run this command in terminal

pip install cfed

Usage

import pandas as pd
from cfed.pairwise import euclidean_distances
from cfed.pairwise import euclidean_distances_from_split

df1 = pd.DataFrame.from_dict({
    'col1': [1, 2, 3],
    'col2': [4, 5, 6],
    'col3': [7, 8, 9]
})
df2 = pd.DataFrame.from_dict({
    'col1': [1, 4, 7],
    'col2': [2, 5, 8],
    'col3': [3, 6, 9],
})

distances = euclidean_distances(df1, df2, categorical_columns=['col1'])

Or without specifying categorical_columns

import pandas as pd
from cfed.pairwise import euclidean_distances
from cfed.pairwise import euclidean_distances_from_split

df1 = pd.DataFrame.from_dict({
    'col1': ['c1', 'c2', 'c1'],
    'col2': [4, 5, 6],
    'col3': [7, 8, 9]
})
df2 = pd.DataFrame.from_dict({
    'col1': ['c1', 'c3', 'c2'],
    'col2': [2, 5, 8],
    'col3': [3, 6, 9],
})

distances = euclidean_distances(df1, df2)

Or

import pandas as pd
from cfed.pairwise import euclidean_distances
from cfed.pairwise import euclidean_distances_from_split

df1_numerical = pd.DataFrame.from_dict({
    'col1': [1, 2, 3],
    'col2': [4, 5, 6],
    'col3': [7, 8, 9]
})
df2_numerical = pd.DataFrame.from_dict({
    'col1': [1, 4, 7],
    'col2': [2, 5, 8],
    'col3': [3, 6, 9],
})

df1_categorical = pd.DataFrame.from_dict({
    'col4': ['c1', 'c1', 'c2'],
})
df2_categorical = pd.DataFrame.from_dict({
    'col4': ['c1', 'c2', 'c2'],
})

distances = euclidean_distances_from_split(df1_numerical, df1_categorical, df2_numerical, df2_categorical)

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.github/workflows		.github/workflows
cfed		cfed
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Categorical Features Pairwise Euclidean Distances

Motivation

Prerequisites

Installation

Usage

About

Packages

Languages

License

ItsWajdy/categorical_features_euclidean_distance

Folders and files

Latest commit

History

Repository files navigation

Categorical Features Pairwise Euclidean Distances

Motivation

Prerequisites

Installation

Usage

About

Topics

Resources

License

Stars

Watchers

Forks

Packages 0

Languages

Packages