Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

positive item sampling and fix infinite loop #7

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

isipalma
Copy link
Collaborator

  • Put a limit in sampling loop to prevent infinite loop
  • Change method for sampling. Always the positive item is in the profile

@isipalma isipalma marked this pull request as draft June 19, 2021 16:37
Copy link
Collaborator

@aaossa aaossa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions about the code and caught a possible bug

Comment on lines +137 to +144
"# Mark interactions used for evaluation procedure if needed\n",
"if \"evaluation\" not in interactions_df:\n",
" print(\"\\nApply evaluation split...\")\n",
" interactions_df = mark_evaluation_rows(interactions_df)\n",
" # Check if new column exists and has boolean dtype\n",
" assert interactions_df[\"evaluation\"].dtype.name == \"bool\"\n",
" print(f\">> Interactions: {interactions_df.shape}\")\n",
"\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot, why was this needed here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed, the code was not present in this repository but it was in mine, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correctly.

@@ -202,22 +210,27 @@
"metadata": {},
"outputs": [],
"source": [
"def random_triplet_sampling(samples_per_user, hashes_container, desc=None):\n",
"def random_triplet_sampling(samples_per_user, hashes_container, desc=None, limit_iteration=10000):\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 10000? Maybe we could use the number of interaction as limit, or a proportion of said number. If I have a million records, and need to sample an important number of it, a proportion of len(interactions_df) (or interactions_df.size, not sure which one is better) would be more appropriate than a fixed number

Comment on lines +221 to +224
" aux_limit = limit_iteration\n",
" while n > 0:\n",
" if aux_limit == 0:\n",
" break\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aux_limit does not change its value, in line 247 we should use aux_limit instead of limit_iteration and that may be a fix

Comment on lines -258 to -260
"assert len(samples_training) >= TOTAL_SAMPLES_TRAIN\n",
"assert len(samples_testing) >= TOTAL_SAMPLES_VALID\n",
"\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this removed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants