Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dedup module #83

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Dedup module #83

wants to merge 2 commits into from

Conversation

Ronaldho80
Copy link

I implemented a "dedup" module to remove duplicated rows from a CSV.

It first sorts the CSV before finding neighboring rows and keeping only those which are no duplicates.

It can be used with the select option to remove rows only if the selected columns are duplicated.

With the --no-case option, it ignores the case when comparing the strings of the CSV.

@Techcable
Copy link

Is it signifigantly slower to use a seahash HashSet in order to support streaming?
If so, by how much?

@Ronaldho80
Copy link
Author

The dedup function I implemented is a quite simple solution: it sorts the entries and removes the neighbors. I don't know how this compares to a hashing algorithm, in particular with seahash.

But it sounds interesting. Why don't you just try and implement it and check the differences?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants