Generate raw data #54

franciscojavierarceo · 2024-10-10T17:56:24Z

Resolves #53

Signed-off-by: Francisco Javier Arceo <[email protected]>

data/generate_raw_data.py

Co-authored-by: Helber Belmiro <[email protected]>

RHRolun · 2025-02-11T08:52:09Z

data/generate_raw_data.py

+        transactions_before["transaction_timestamp"] < transactions_before["created_x"]
+    ]
+    transactions_before["days_between_transactions"] = (
+        transactions_before["transaction_timestamp"] - transactions_before["created_x"]


Suggested change

transactions_before["transaction_timestamp"] - transactions_before["created_x"]

abs(transactions_before["transaction_timestamp"] - transactions_before["created_x"])

This currently generates negative values, causing the "days_since_last_transaction" and "days_since_first_transaction" to get mixed up.

RHRolun · 2025-02-11T08:56:03Z

data/generate_raw_data.py

+    df = pd.concat([train, test, valid], axis=0).reset_index(drop=True)
+
+    df["user_id"] = [f"user_{i}" for i in range(df.shape[0])]
+    df["transaction_id"] = [f"txn_{i}" for i in range(df.shape[0])]


This transaction_id seems to not be used again later and is not part of the output parquets.
Should it be added as a column in transactions_list in generate_random_transactions?

RHRolun · 2025-02-11T09:06:21Z

data/generate_raw_data.py

+
+    print("generating transaction level data...")
+    user_purchase_history = generate_random_transactions(
+        users_df=df[df["repeat_retailer"] == 1].reset_index(drop=True),


Im curious, why we only are using data from when they have purchased multiple times from the same retailer?

RHRolun · 2025-02-11T09:51:21Z

data/generate_raw_data.py

+            days_since_first_transaction=("days_between_transactions", "max"),
+        )
+        .reset_index()
+        .fillna(0)


Suggested change

.fillna(0)

This still leaves NaNs in the dataframe, applying fillna(0) on the final_df fixes this issue

RHRolun · 2025-02-11T09:51:53Z

data/generate_raw_data.py

+        .reset_index(drop=True)
+        .drop("created_x", axis=1)
+    )
+


Suggested change

final_df = final_df.fillna(0)

RHRolun · 2025-02-11T12:07:01Z

data/generate_raw_data.py

+
+
+if __name__ == "__main__":
+    main()


Great PR!
Just so I understand this correctly - do you think this should come in as its own data prep section, or that the parquet files this code produces should exist ahead of time and just be used during training/inference?

franciscojavierarceo added 6 commits October 10, 2024 13:55

feat: Adding additional mechanism to generate raw data

f629f4b

Signed-off-by: Francisco Javier Arceo <[email protected]>

Renaming file

89ad706

Signed-off-by: Francisco Javier Arceo <[email protected]>

updated to use absolute path so script can be run in main repo

71eeeef

Signed-off-by: Francisco Javier Arceo <[email protected]>

exporting data using absolute path as well

ab1ac97

Signed-off-by: Francisco Javier Arceo <[email protected]>

Merged back the original features and added more progress output

bf33f5d

Signed-off-by: Francisco Javier Arceo <[email protected]>

adding the set column

4a87143

Signed-off-by: Francisco Javier Arceo <[email protected]>

hbelmiro reviewed Jan 7, 2025

View reviewed changes

data/generate_raw_data.py Outdated Show resolved Hide resolved

Update data/generate_raw_data.py

0f73d75

Co-authored-by: Helber Belmiro <[email protected]>

RHRolun reviewed Feb 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate raw data #54

Generate raw data #54

franciscojavierarceo commented Oct 10, 2024

RHRolun Feb 11, 2025

RHRolun Feb 11, 2025

RHRolun Feb 11, 2025

RHRolun Feb 11, 2025

RHRolun Feb 11, 2025

RHRolun Feb 11, 2025 •

edited

Loading

RHRolun Feb 11, 2025

RHRolun Feb 11, 2025

	transactions_before["transaction_timestamp"] - transactions_before["created_x"]
	abs(transactions_before["transaction_timestamp"] - transactions_before["created_x"])

Generate raw data #54

Are you sure you want to change the base?

Generate raw data #54

Conversation

franciscojavierarceo commented Oct 10, 2024

RHRolun Feb 11, 2025

Choose a reason for hiding this comment

RHRolun Feb 11, 2025

Choose a reason for hiding this comment

RHRolun Feb 11, 2025

Choose a reason for hiding this comment

RHRolun Feb 11, 2025

Choose a reason for hiding this comment

RHRolun Feb 11, 2025

Choose a reason for hiding this comment

RHRolun Feb 11, 2025 • edited Loading

Choose a reason for hiding this comment

RHRolun Feb 11, 2025

Choose a reason for hiding this comment

RHRolun Feb 11, 2025

Choose a reason for hiding this comment

RHRolun Feb 11, 2025 •

edited

Loading