-
Notifications
You must be signed in to change notification settings - Fork 930
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OSSFuzz Integration #949
OSSFuzz Integration #949
Conversation
Updated build Updated build Updated build
@pietermarsman Hey Pieter, pinging for visibility. Looking forward to getting this integrated and uncovering bugs! |
@pietermarsman Following up on this, I spent a good amount of time on this and would love to see it integrated! |
@goulu @jstockwin @pudo @tataganesh @pietermarsman I hope all is well, I would really appreciate a review of this PR |
Hi @capuanob, Thanks for your time on this. This repo is maintained on a very slow pace. But it is maintained, so your work won't go to waste. I haven't ran the code yet, will do so later today. But it looks good. Great that you already were able to find and fix some vulnerabilities. I've two initial comments:
|
Thank you for your reply!
- The testing files are used, in the `build.sh` script that is used by
ClusterFuzz (which will be a subsequent PR to google/oss-fuzz), the
directory is zipped to a directory within a Docker container that the
fuzzing environment expects to find its seeds. I could update this to copy
from the *samples *directory at run-time, rather than hosting them twice
- For the projects that I have done, I've either had fuzzing be a
top-level directory or a sub-directory of testing. I can adjust to
whichever you prefer
- While it is possible, I would discourage having it be a local test for
a few reasons. Since fuzzing isn't entirely deterministic, you won't get
the same kind of consistency as you would from unit tests. Secondly, there
isn't a defined end-point for fuzzing so it'd be difficult to set an
arbitrary 'timeout' and be able to definitively say that something has been
sufficiently tested
…On Mon, Jun 24, 2024 at 2:46 AM Pieter Marsman ***@***.***> wrote:
Hi @capuanob <https://github.com/capuanob>,
Thanks for your time on this. This repo is maintained on a very slow pace.
But it is maintained, so your work won't go to waste.
I haven't ran the code yet, will do so later today. But it looks good.
Great that you already were able to find and fix some vulnerabilities.
I've two initial comments:
- It looks like you have copied the testing pdf's. Are these even
used? Can you also use their equivalents from the *samples* directory?
- Is it common to have *fuzzing* as a top-level directory? I guess it
has the same status as the tests and tools, so it seems like the right
place. But I'm always reluctant to add top-level files and directories.
- Is it also possible and useful to run the tests locally? In that
case we can/should perhaps add the commands to the noxfile.py.
—
Reply to this email directly, view it on GitHub
<#949 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AHXFHTNNDBGGWS37IID6K63ZI66D3AVCNFSM6AAAAABEPH7OO6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBVG4ZTKMRWGQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Yes, that is preferable. Maybe use a glob to specify them.
I guess I prefer to have it as a top-level directory. Since the tests directory is really the "pytest directory". So no change required here.
Ok, good to know. I've ran out of time for today, so will continue on this next Monday. I noticed that my understanding of fuzzing is very minimal, do you have any good resources that I can use to improve my understanding? Edit: |
@pietermarsman I've removed the corpus files from this commit. Now, instead, the Dockerfile that I will submit to Google's OSS-Fuzz project after this PR is integrated (you can see this Dockerfile here if you are interested) will glob the simple*.pdf files into a corpus. You've got the right idea on how it makes mutations to the input, since it typically requires gazillions of mutations. To give a bit more context, the With time, the fuzzing corpus evolves to be more and more robust. ClusterFuzz also provides great insights on current fuzz-blockers that can be overcome with future PRs to improve the fuzzers. Another thing I'd be interested in exploring in the future is grammar-based fuzzing. I've never implemented one myself, but am aware of the technique. You can use a grammar (say, the grammar for a PDF) to guide smarter mutations. Happy to provide more context if desired! |
@pietermarsman I just saw your question about resources on fuzz-testing. The LibFuzzer docs are great, I'd also suggest: https://google.github.io/oss-fuzz/ (More details on this specific program and ClusterFuzz - which you will get access to) The researchers at Trail of Bits have a good guide on fuzzing as well. https://appsec.guide/docs/fuzzing/python/ The Python section isn't fully built-out, but atheris also uses LibFuzzer (which is commonly used for C/C++ fuzzing). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Top! It is looking great!
Code is running smooth on my machine and I've already found a couple of more bugs. So I'm curious about the results from ClusterFuzz.
I've scattered the MR with a bunch of micro-management comments to get it into the same shape as the rest of the code base. Most importantly is adding the fuzzing
directory to the noxfile.py
testing dirs so that all the code quality checks run on it.
I can do a couple of small commits to help if you're ok with that.
Good morning,
Feel free to contribute any commits you would like, I want to make sure
this contribution integrates well into the larger project!
I will contribute the rest by the end of this weekend.
…On Thu, Jun 27, 2024 at 2:37 AM Pieter Marsman ***@***.***> wrote:
***@***.**** requested changes on this pull request.
Top! It is looking great!
Code is running smooth on my machine and I've already found a couple of
more bugs. So I'm curious about the results from ClusterFuzz.
I've scattered the MR with a bunch of micro-management comments to get it
into the same shape as the rest of the code base. Most importantly is
adding the fuzzing directory to the noxfile.py testing dirs so that all
the code quality checks run on it.
I can do a couple of small commits to help if you're ok with that.
------------------------------
In CHANGELOG.md
<#949 (comment)>
:
> @@ -8,6 +8,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
### Added
- Support for zipped jpeg's ([#938](#938))
+- Added fuzzing harnesses for integration into Google's OSS-Fuzz
⬇️ Suggested change
-- Added fuzzing harnesses for integration into Google's OSS-Fuzz
+- Fuzzing harnesses for integration into Google's OSS-Fuzz ([949](#949))
------------------------------
In fuzzing/extract_text_fuzzer.py
<#949 (comment)>
:
> @@ -0,0 +1,45 @@
+import sys
+
+import atheris
+
+from fuzz_helpers import EnhancedFuzzedDataProvider
+
+with atheris.instrument_imports():
+ from pdf_utils import PDFValidator, prepare_pdfminer_fuzzing
+ from pdfminer.high_level import extract_text
+
+from pdfminer.psparser import PSException
+
+
+def TestOneInput(data: bytes):
What do you think of using snake_case for function names?
I know the atheris docs use CamelCase for Python code, but the rest of
this repo uses snake_case.
------------------------------
In fuzzing/extract_text_fuzzer.py
<#949 (comment)>
:
> +
+
+def TestOneInput(data: bytes):
+ if not PDFValidator.is_valid_byte_stream(data):
+ # Not worth continuing with this test case
+ return -1
+
+ fdp = EnhancedFuzzedDataProvider(data)
+
+ try:
+ with fdp.ConsumeMemoryFile() as f:
+ max_pages = fdp.ConsumeIntInRange(0, 1000)
+ extract_text(
+ f,
+ maxpages=max_pages,
+ page_numbers=fdp.ConsumeIntList(fdp.ConsumeIntInRange(0, max_pages), 2),
The page_numbers can also be None. I'm wondering if we can test that as
well.
Also in the other fuzzers.
------------------------------
In fuzzing/extract_text_fuzzer.py
<#949 (comment)>
:
> +
+def TestOneInput(data: bytes):
+ if not PDFValidator.is_valid_byte_stream(data):
+ # Not worth continuing with this test case
+ return -1
+
+ fdp = EnhancedFuzzedDataProvider(data)
+
+ try:
+ with fdp.ConsumeMemoryFile() as f:
+ max_pages = fdp.ConsumeIntInRange(0, 1000)
+ extract_text(
+ f,
+ maxpages=max_pages,
+ page_numbers=fdp.ConsumeIntList(fdp.ConsumeIntInRange(0, max_pages), 2),
+ laparams=PDFValidator.generate_layout_parameters(fdp)
laparams can also be None.
Also in the other fuzzers.
------------------------------
In fuzzing/extract_text_fuzzer.py
<#949 (comment)>
:
> @@ -0,0 +1,45 @@
+import sys
+
+import atheris
+
+from fuzz_helpers import EnhancedFuzzedDataProvider
Would it be possible to start all imports from the root of the project?
Usually that makes it easier to get the imports working.
So that would require from fuzzing.fuzz_helpers import ... here.
Add a __init__.py file to the fuzzing dir should get it to work.
------------------------------
In fuzzing/extract_text_fuzzer.py
<#949 (comment)>
:
> @@ -0,0 +1,45 @@
+import sys
+
+import atheris
+
+from fuzz_helpers import EnhancedFuzzedDataProvider
+
+with atheris.instrument_imports():
+ from pdf_utils import PDFValidator, prepare_pdfminer_fuzzing
+ from pdfminer.high_level import extract_text
+
+from pdfminer.psparser import PSException
+
+
+def TestOneInput(data: bytes):
The rest of pdfminer is type checked with mypy. That would error on the
missing return type here. What do you think of adding the fuzzing
directory to the noxfile.py?
That would also enable black formatting and linting.
------------------------------
In pdfminer/pdfdocument.py
<#949 (comment)>
:
> @@ -977,7 +977,7 @@ def find_xref(self, parser: PDFParser) -> int:
else:
raise PDFNoValidXRef("Unexpected EOF")
log.debug("xref found: pos=%r", prev)
- assert prev is not None
+ assert prev is not None and prev.isdigit()
I'm assuming this is already the first bug that you find by using fuzzing.
While running the code myself I found a couple of others.
What do you think of separating fixing errors into separate MR's. I'm
hoping that we get some great statistics from ossfuzz on how many errors we
fixed this way.
—
Reply to this email directly, view it on GitHub
<#949 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AHXFHTLUGS64V4QMKDVZPLDZJOXJPAVCNFSM6AAAAABEPH7OO6VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDCNBUGM2DINRXHE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
…ey are test cases
… except can catch all
I've fixed all my own comments and think this is now ready. One big change I did was to subclass all exceptions that pdfminer raises from If you confirm that the current code still works with ClusterFuzz I'll merge it. |
Awesome, thank you so much for helping out with this effort!
Also definitely a great addition to have a base class for raised
exceptions, I tend to look out for that to make the catch block clearer.
Looks good to me, whenever you merge I’ll contribute to Google’s upstream.
Thank You, Bailey Capuano
…On Thu, Jun 27, 2024 at 4:18 PM Pieter Marsman ***@***.***> wrote:
I've fixed all my own comments and think this is now ready.
One big change I did was to subclass all exceptions that pdfminer raises
from PSException. Such that the exception handling is now a bit easier.
Encapsulating this is also good for the package.
—
Reply to this email directly, view it on GitHub
<#949 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AHXFHTPGV66AMQ4G3VQQD43ZJRXQ5AVCNFSM6AAAAABEPH7OO6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJVGYYDANBZGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@pietermarsman Can confirm that all builds succeed with ClusterFuzz |
Done. Thanks for everything! 👏 |
This pull requests integrates the Dockerfile needed to build the fuzzers for pdfminer.six, as merged into upstream in this [PR](pdfminer/pdfminer.six#949).
Hi Pieter, I've got this on my docket to look into. I may have some time
this weekend, but I am in the middle of a move so my time is limited. If
you happen to find anything before then, please do let me know!
Best,
Bailey+
…On Wed, Jul 3, 2024 at 11:33 AM Pieter Marsman ***@***.***> wrote:
Hi @capuanob <https://github.com/capuanob> ,
I've checked out OSS-Fuzz <https://oss-fuzz.com/> and monorail
<https://bugs.chromium.org/p/oss-fuzz/issues/list?q=&can=2> but could not
find any useful output yet. The build seems to succeed. But there are no
issues opened yet. And the coverage suggests that nothing gets past the
is_valid_byte_stream check. Maybe the corpus is not loaded correctly.
—
Reply to this email directly, view it on GitHub
<#949 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AHXFHTL74RVNEQBFYNXOHBDZKQKVZAVCNFSM6AAAAABEPH7OO6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBWGU3TONJTGA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Take your time. Good luck with the move! Some thoughts:
|
Good morning,
From a quick review, I think you’re right- it looks like I just copied the
corpus without zipping it to the $OUT directory with the proper name. I’ll
fix that in a subsequent PR to pdfminer. I am undergoing the rewards
process as well. As for the GitHub issues, I will integrate those!
Thank You, Bailey Capuano
…On Sat, Jul 6, 2024 at 12:11 PM Pieter Marsman ***@***.***> wrote:
@capuanob <https://github.com/capuanob>
Take your time. Good luck with the move!
Some thoughts:
1. Not sure what changed, but somethings seems to be working now.
There are 3 issues
<https://oss-fuzz.com/testcases?project=pdfminersix&open=yes> now.
2. But the coverage
<https://storage.googleapis.com/oss-fuzz-introspector/pdfminersix/inspector-report/20240706/fuzz_report.html#High-level-conclusions>
is still very low.
3. I noticed here
<https://github.com/google/oss-fuzz/blob/master/projects/pdfminersix/Dockerfile>
the corpus is copied to $SRC/corpus but the working dir is
$SRC/pdfminer.six. I'm not sure if that is correct. Seems like the
corpus is outside the workdir.
4. I noticed there are integration rewards
<https://google.github.io/oss-fuzz/getting-started/integration-rewards/>.
Are you applying for those?
5. I noticed there is the option to [file GitHub issues] as well. Can
you set that up for pdfminer.six?
—
Reply to this email directly, view it on GitHub
<#949 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AHXFHTPW7NUUSIYMPYJTJVDZLAJKXAVCNFSM6AAAAABEPH7OO6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJRHAYDQNZSGM>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@pietermarsman Put in a PR with Google to add GitHub issues, feel free to follow here |
Thanks for integrating the GitHub issues. And 🤞 your latest PR fixes the coverage. As for the reward integration rewards, am I eligible as well? A reward for working on OSS, that almost seams to good to be true 😃 |
I’m unsure, I’ve only done this side of the house. You’d probably have to
reach out to OSSFuzz about the maintainers side of things!
Thank You, Bailey Capuano
…On Mon, Jul 8, 2024 at 2:00 AM Pieter Marsman ***@***.***> wrote:
Thanks for integrating the GitHub issues. And 🤞 your latest PR fixes the
coverage. As for the reward integration rewards, am I eligible as well? A
reward for working on OSS, that almost seams to good to be true 😃
—
Reply to this email directly, view it on GitHub
<#949 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AHXFHTMOSUUEUO5MPMMPF3LZLITGBAVCNFSM6AAAAABEPH7OO6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJTGA4TCOBZGQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
This feature was requested by the project maintainer, as seen [here](pdfminer/pdfminer.six#949 (comment))
Pull request
This Pull-Request includes the necessary changes to integrate fuzzing into pdfminer.six for OSS-Fuzz continuous fuzzing, as discussed in Issue 918.
In short, this PR adds atheris (the fuzzing framework) as a development dependency, and a new fuzzing directory containing a corpus, some initial harnesses, the necessary CI file to integrate the project into OSSFuzz, and a build script to be used by ClusterFuzz to prepare for nightly fuzzing.
In addition to the above, two simple bug-fixes are resolved to address crashes that were occurring too early into fuzzing, preventing progress.
How Has This Been Tested?
The fuzzing harnesses are tests in and of themselves, so they were tested via coverage analysis and allowing them to run.
NOTE: The CIFuzz.yml job will fail until Google merges the necessary pdfminer Dockerfile into their OSS-Fuzz repository. This can only be done after this PR is merged.
Checklist