Issue Facing While Fitting The Model With Huge Data #64

ganesh-morsu · 2023-08-29T09:34:21Z

I have data contains around 166793 Records, I want to fit this records for TF-IDF Model

from polyfuzz.models import TFIDF
from polyfuzz import PolyFuzz

data=[] # Data contains total **166793 Records**
tfidf = TFIDF(n_gram_range=(1, 1), model_id="TF-IDF")
model = PolyFuzz(tfidf)

model.fit(data)

Here i am facing the issue while fitting the model ,The server getting killed (I have tried with configuration of 20 gb ram).
Is there any solution?

MaartenGr · 2023-08-29T11:50:02Z

That is most likely the result of a large vocabulary. Setting min_df to have a value higher than 1 will reduce the necessary RAM. You can do that by using a custom TF-IDF model.

ganesh-morsu · 2023-09-01T09:18:43Z

I have created custom TF-IDF model ,Tried with increasing min_df value, Still i am facing same issue.

Below is the code i have created custom model.

from polyfuzz.models import TFIDF
from sklearn.feature_extraction.text import TfidfVectorizer

class CustomTFIDF(TFIDF):
    def __init__(self,
                 n_gram_range=(3, 3),
                 clean_string=True,
                 min_similarity=0.75,
                 top_n=1,
                 cosine_method="sparse",
                 model_id=None,
                 min_df_custom=2):  # Add a custom parameter for min_df
        super().__init__(n_gram_range, clean_string, min_similarity, top_n, cosine_method, model_id)
        self.min_df_custom = min_df_custom  # Set the custom min_df value

    def _extract_tf_idf(self,
                        from_list,
                        to_list=None,
                        re_train=True):
        if to_list:
            if re_train:
                # Customize the TfidfVectorizer with min_df
                self.vectorizer = TfidfVectorizer(min_df=self.min_df_custom, analyzer=self._create_ngrams).fit(
                    to_list + from_list)
                self.tf_idf_to = self.vectorizer.transform(to_list)
            tf_idf_from = self.vectorizer.transform(from_list)
        else:
            if re_train:
                # Customize the TfidfVectorizer with min_df
                self.vectorizer = TfidfVectorizer(min_df=self.min_df_custom, analyzer=self._create_ngrams).fit(
                    from_list)
                self.tf_idf_to = self.vectorizer.transform(from_list)
            tf_idf_from = self.tf_idf_to

        return tf_idf_from, self.tf_idf_to

MaartenGr · 2023-09-02T09:00:00Z

You can try setting the min_df value to much higher than 2. Setting it to at least 10 is most likely to help out.

ganesh-morsu · 2023-09-02T09:40:31Z

I am facing same issue ,even after i have changed higher value.
I have tried with min_df = 10 and min_df = 15 and min_df = 20

The error i am getting
MemoryError: Unable to allocate 207. GiB for an array with shape (27815314339,) and data type int64

MaartenGr · 2023-09-03T07:41:02Z

Have you tried using pip install polyfuzz[fast]? I believe it should reduce the memory allocation here. Also, you can use "knn" instead of "sparse" to reduce memory. I would advise trying out these two options.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue Facing While Fitting The Model With Huge Data #64

Issue Facing While Fitting The Model With Huge Data #64

ganesh-morsu commented Aug 29, 2023 •

edited

Loading

MaartenGr commented Aug 29, 2023

ganesh-morsu commented Sep 1, 2023

MaartenGr commented Sep 2, 2023

ganesh-morsu commented Sep 2, 2023

MaartenGr commented Sep 3, 2023

Issue Facing While Fitting The Model With Huge Data #64

Issue Facing While Fitting The Model With Huge Data #64

Comments

ganesh-morsu commented Aug 29, 2023 • edited Loading

MaartenGr commented Aug 29, 2023

ganesh-morsu commented Sep 1, 2023

MaartenGr commented Sep 2, 2023

ganesh-morsu commented Sep 2, 2023

MaartenGr commented Sep 3, 2023

ganesh-morsu commented Aug 29, 2023 •

edited

Loading