Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate raw data #54

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

franciscojavierarceo
Copy link

Resolves #53

transactions_before["transaction_timestamp"] < transactions_before["created_x"]
]
transactions_before["days_between_transactions"] = (
transactions_before["transaction_timestamp"] - transactions_before["created_x"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
transactions_before["transaction_timestamp"] - transactions_before["created_x"]
abs(transactions_before["transaction_timestamp"] - transactions_before["created_x"])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This currently generates negative values, causing the "days_since_last_transaction" and "days_since_first_transaction" to get mixed up.

df = pd.concat([train, test, valid], axis=0).reset_index(drop=True)

df["user_id"] = [f"user_{i}" for i in range(df.shape[0])]
df["transaction_id"] = [f"txn_{i}" for i in range(df.shape[0])]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This transaction_id seems to not be used again later and is not part of the output parquets.
Should it be added as a column in transactions_list in generate_random_transactions?


print("generating transaction level data...")
user_purchase_history = generate_random_transactions(
users_df=df[df["repeat_retailer"] == 1].reset_index(drop=True),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im curious, why we only are using data from when they have purchased multiple times from the same retailer?

days_since_first_transaction=("days_between_transactions", "max"),
)
.reset_index()
.fillna(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.fillna(0)

Copy link
Contributor

@RHRolun RHRolun Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still leaves NaNs in the dataframe, applying fillna(0) on the final_df fixes this issue

.reset_index(drop=True)
.drop("created_x", axis=1)
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
final_df = final_df.fillna(0)



if __name__ == "__main__":
main()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great PR!
Just so I understand this correctly - do you think this should come in as its own data prep section, or that the parquet files this code produces should exist ahead of time and just be used during training/inference?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enhance underlying example data to showcase more complicated features
3 participants