Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch from 'cut' to 'head'; cut isn't doing what it should here #20

Merged
merged 1 commit into from
Jan 22, 2025

Conversation

compilingEntropy
Copy link
Contributor

Based on the comment on line 66-67, the cut command is supposed to just grab the first 1024 bytes of the file and pass that to awk. Instead, the cut does... something else. I didn't bother figuring out what, but here's some tests that paint the picture:

$ time cut -b -1024 ./Downloads/World\ War\ Hulk.pdf | wc -c
 403378749
cut -b -1024 ./Downloads/World\ War\ Hulk.pdf  32.49s user 0.28s system 99% cpu 32.899 total
wc -c  3.28s user 0.19s system 10% cpu 32.898 total
$ time cat ./Downloads/World\ War\ Hulk.pdf | wc -c
 421636877
cat ./Downloads/World\ War\ Hulk.pdf  0.02s user 0.21s system 6% cpu 3.516 total
wc -c  3.44s user 0.07s system 99% cpu 3.515 total
$ time head -c 1024 ./Downloads/World\ War\ Hulk.pdf | wc -c
    1024
head -c 1024 ./Downloads/World\ War\ Hulk.pdf  0.00s user 0.00s system 71% cpu 0.004 total
wc -c  0.00s user 0.00s system 67% cpu 0.004 total
  • The cut command chops out some of the file, sure, but not the intended portion (>95% of the file remains) and for my (large) file, it takes 33 seconds to do so.
  • A simple cat in the second command is 10x faster than the cut, completing in 3.5 seconds and containing the entire file.
  • If you use head instead, you get the intended first 1024 bytes and the command completes effectively instantly.

This commit moves away from cut to the head command, which accomplishes the intended file slice and improves performance (particularly on large files).

@aklomp
Copy link
Owner

aklomp commented Jan 19, 2025

Yikes, nice catch. According to cut's man page, the -b switch should select byte mode and the -1024 should select all bytes from the 0'th to the 1024'th. But as you pointed out, and as I was able to confirm locally, it appears to do no such thing.

I agree that replacing that command with head is the way to go here. It even fits the intent of the code better. Will merge this.

The question I have for you is if you'd like a mention in the copyright header at the top of the file, also in light of your other merge request? I can add one in the rebase.

@compilingEntropy
Copy link
Contributor Author

I don't really know why the cut isn't working either, I agree it should work based on the the man page and I played around with it for a minute but couldn't get it working.

I don't mind being left out of the copyright header; I waive all rights to the code contained in this commit.

@aklomp aklomp merged commit 92a92f2 into aklomp:master Jan 22, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants