Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to passthrough num_seqs in PostprocessingDataset #1677

Merged
merged 10 commits into from
Jan 19, 2025

Conversation

dorian-K
Copy link
Contributor

I'm open to adjusting the name of the variable

My use case for this is to get Laplace Ordering on a Dataset, so map_seq_stream wouldn't drop any sequences.

@dorian-K dorian-K requested review from albertz, NeoLegends and a team as code owners January 17, 2025 15:28
Copy link
Member

@NeoLegends NeoLegends left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hah, I implemented the exact same thing about a month ago. Convergent evolution at work!

I don't think that this is a very good idea though, because it invites buggy behavior if your mapper function doesn't actually cohere to the guarantees you made by specifying iterator_preserves_sequences. We could catch this using an assert once the underlying dataset has ran out of seqs (comparing actual vs. forwarded num_seqs), but in case there is a bug this assertion would trigger only at the very end of a subepoch, which would be quite annoying.

If you simply want to do laplace ordering there is the LaplaceOrdering combinator in this very module that you can compose with your postprocessor function using Sequential (also in this module) or by doing another PostprocessingDataset-wrap. It uses a small buffer to do the ordering instead. It's efficient enough though as long as the buffer size doesn't grow too large.

@albertz
Copy link
Member

albertz commented Jan 17, 2025

We could catch this using an assert once the underlying dataset has ran out of seqs (comparing actual vs. forwarded num_seqs), but in case there is a bug this assertion would trigger only at the very end of a subepoch, which would be quite annoying.

Yes, we should definitely have such a check.

I don't think it's too annoying. The user should anyway hopefully know whether the map_seq_stream has this property, and usually, the user would probably be correct about it. I expect this to be a quite rare case, that the user sets this, but it was wrong. And for this rare case, it's totally fine. On the other side, this is a very simple solution here, and actually even better than the passthrough of complete_frac (epoch_continuous) (#1673), because the complete_frac is more incorrect.

albertz

This comment was marked as resolved.

@albertz

This comment was marked as resolved.

@dorian-K dorian-K requested a review from albertz January 17, 2025 19:07
@albertz

This comment was marked as resolved.

@albertz

This comment was marked as resolved.

@dorian-K
Copy link
Contributor Author

I have also changed the message to show the actual value of _map_seq_stream_preserves_num_seqs, which may not necessarily be True if someone decided to mess with _num_seqs without setting the attribute

@albertz albertz merged commit d4343d7 into master Jan 19, 2025
62 checks passed
@albertz albertz deleted the doriank_postprocessing_passthrough_numseqs branch January 19, 2025 19:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants