OSSFuzz Integration #949

capuanob · 2024-03-10T20:44:21Z

Pull request

This Pull-Request includes the necessary changes to integrate fuzzing into pdfminer.six for OSS-Fuzz continuous fuzzing, as discussed in Issue 918.

In short, this PR adds atheris (the fuzzing framework) as a development dependency, and a new fuzzing directory containing a corpus, some initial harnesses, the necessary CI file to integrate the project into OSSFuzz, and a build script to be used by ClusterFuzz to prepare for nightly fuzzing.

In addition to the above, two simple bug-fixes are resolved to address crashes that were occurring too early into fuzzing, preventing progress.

How Has This Been Tested?

The fuzzing harnesses are tests in and of themselves, so they were tested via coverage analysis and allowing them to run.

NOTE: The CIFuzz.yml job will fail until Google merges the necessary pdfminer Dockerfile into their OSS-Fuzz repository. This can only be done after this PR is merged.

Checklist

I have read CONTRIBUTING.md.
I have added a concise human-readable description of the change to CHANGELOG.md.
I have tested that this fix is effective or that this feature works.
I have added docstrings to newly created methods and classes.
I have updated the README.md and the readthedocs documentation. Or verified that this is not necessary.

…cing

Updated build Updated build Updated build

capuanob · 2024-03-15T00:29:28Z

@pietermarsman Hey Pieter, pinging for visibility. Looking forward to getting this integrated and uncovering bugs!

capuanob · 2024-05-19T18:28:32Z

@pietermarsman Following up on this, I spent a good amount of time on this and would love to see it integrated!

capuanob · 2024-06-10T00:36:52Z

@goulu @jstockwin @pudo @tataganesh @pietermarsman I hope all is well, I would really appreciate a review of this PR

pietermarsman · 2024-06-24T06:45:59Z

Hi @capuanob,

Thanks for your time on this. This repo is maintained on a very slow pace. But it is maintained, so your work won't go to waste.

I haven't ran the code yet, will do so later today. But it looks good. Great that you already were able to find and fix some vulnerabilities.

I've two initial comments:

It looks like you have copied the testing pdf's. Are these even used? Can you also use their equivalents from the samples directory?
Is it common to have fuzzing as a top-level directory? I guess it has the same status as the tests and tools, so it seems like the right place. But I'm always reluctant to add top-level files and directories.
Is it also possible and useful to run the tests locally? In that case we can/should perhaps add the commands to the noxfile.py.

capuanob · 2024-06-24T12:47:55Z

Thank you for your reply! - The testing files are used, in the `build.sh` script that is used by ClusterFuzz (which will be a subsequent PR to google/oss-fuzz), the directory is zipped to a directory within a Docker container that the fuzzing environment expects to find its seeds. I could update this to copy from the *samples *directory at run-time, rather than hosting them twice - For the projects that I have done, I've either had fuzzing be a top-level directory or a sub-directory of testing. I can adjust to whichever you prefer - While it is possible, I would discourage having it be a local test for a few reasons. Since fuzzing isn't entirely deterministic, you won't get the same kind of consistency as you would from unit tests. Secondly, there isn't a defined end-point for fuzzing so it'd be difficult to set an arbitrary 'timeout' and be able to definitively say that something has been sufficiently tested

…

On Mon, Jun 24, 2024 at 2:46 AM Pieter Marsman ***@***.***> wrote: Hi @capuanob <https://github.com/capuanob>, Thanks for your time on this. This repo is maintained on a very slow pace. But it is maintained, so your work won't go to waste. I haven't ran the code yet, will do so later today. But it looks good. Great that you already were able to find and fix some vulnerabilities. I've two initial comments: - It looks like you have copied the testing pdf's. Are these even used? Can you also use their equivalents from the *samples* directory? - Is it common to have *fuzzing* as a top-level directory? I guess it has the same status as the tests and tools, so it seems like the right place. But I'm always reluctant to add top-level files and directories. - Is it also possible and useful to run the tests locally? In that case we can/should perhaps add the commands to the noxfile.py. — Reply to this email directly, view it on GitHub <#949 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHXFHTNNDBGGWS37IID6K63ZI66D3AVCNFSM6AAAAABEPH7OO6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBVG4ZTKMRWGQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

pietermarsman · 2024-06-24T17:08:23Z

I could update this to copy from the *samples *directory at run-time, rather than hosting them twice

Yes, that is preferable. Maybe use a glob to specify them.

For the projects that I have done, I've either had fuzzing be a
top-level directory or a sub-directory of testing. I can adjust to
whichever you prefer.

I guess I prefer to have it as a top-level directory. Since the tests directory is really the "pytest directory". So no change required here.

While it is possible, I would discourage having it be a local test for
a few reasons. Since fuzzing isn't entirely deterministic, you won't get
the same kind of consistency as you would from unit tests. Secondly, there
isn't a defined end-point for fuzzing so it'd be difficult to set an
arbitrary 'timeout' and be able to definitively say that something has been
sufficiently tested

Ok, good to know.

I've ran out of time for today, so will continue on this next Monday. I noticed that my understanding of fuzzing is very minimal, do you have any good resources that I can use to improve my understanding?

Edit:
I've found this tutorial which gave me a good understanding. I do yet fully understand how the fuzzer can efficiently mutate the corpus pdf files to generate new valid ones though. But perhaps it just tries a gazillion of times.

capuanob · 2024-06-27T01:37:20Z

@pietermarsman I've removed the corpus files from this commit.

Now, instead, the Dockerfile that I will submit to Google's OSS-Fuzz project after this PR is integrated (you can see this Dockerfile here if you are interested) will glob the simple*.pdf files into a corpus.

You've got the right idea on how it makes mutations to the input, since it typically requires gazillions of mutations. To give a bit more context, the atheris Python library will "instrument" the library with extra instructions in each code block (ie, any branch) at runtime. The fuzzing framework strives to achieve depth and breadth in its coverage and will analyze its current code block boundaries to determine what kind of intelligent changes could be made to get deeper into the parsing code.

With time, the fuzzing corpus evolves to be more and more robust. ClusterFuzz also provides great insights on current fuzz-blockers that can be overcome with future PRs to improve the fuzzers.

Another thing I'd be interested in exploring in the future is grammar-based fuzzing. I've never implemented one myself, but am aware of the technique. You can use a grammar (say, the grammar for a PDF) to guide smarter mutations.

Happy to provide more context if desired!

capuanob · 2024-06-27T01:42:32Z

@pietermarsman I just saw your question about resources on fuzz-testing. The LibFuzzer docs are great, I'd also suggest:

https://google.github.io/oss-fuzz/ (More details on this specific program and ClusterFuzz - which you will get access to)

The researchers at Trail of Bits have a good guide on fuzzing as well. https://appsec.guide/docs/fuzzing/python/

The Python section isn't fully built-out, but atheris also uses LibFuzzer (which is commonly used for C/C++ fuzzing).

pietermarsman

Top! It is looking great!

Code is running smooth on my machine and I've already found a couple of more bugs. So I'm curious about the results from ClusterFuzz.

I've scattered the MR with a bunch of micro-management comments to get it into the same shape as the rest of the code base. Most importantly is adding the fuzzing directory to the noxfile.py testing dirs so that all the code quality checks run on it.

I can do a couple of small commits to help if you're ok with that.

CHANGELOG.md

fuzzing/extract_text_fuzzer.py

pdfminer/pdfdocument.py

capuanob · 2024-06-27T13:49:53Z

Good morning, Feel free to contribute any commits you would like, I want to make sure this contribution integrates well into the larger project! I will contribute the rest by the end of this weekend.

…

On Thu, Jun 27, 2024 at 2:37 AM Pieter Marsman ***@***.***> wrote: ***@***.**** requested changes on this pull request. Top! It is looking great! Code is running smooth on my machine and I've already found a couple of more bugs. So I'm curious about the results from ClusterFuzz. I've scattered the MR with a bunch of micro-management comments to get it into the same shape as the rest of the code base. Most importantly is adding the fuzzing directory to the noxfile.py testing dirs so that all the code quality checks run on it. I can do a couple of small commits to help if you're ok with that. ------------------------------ In CHANGELOG.md <#949 (comment)> : > @@ -8,6 +8,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/). ### Added - Support for zipped jpeg's ([#938](#938)) +- Added fuzzing harnesses for integration into Google's OSS-Fuzz ⬇️ Suggested change -- Added fuzzing harnesses for integration into Google's OSS-Fuzz +- Fuzzing harnesses for integration into Google's OSS-Fuzz ([949](#949)) ------------------------------ In fuzzing/extract_text_fuzzer.py <#949 (comment)> : > @@ -0,0 +1,45 @@ +import sys + +import atheris + +from fuzz_helpers import EnhancedFuzzedDataProvider + +with atheris.instrument_imports(): + from pdf_utils import PDFValidator, prepare_pdfminer_fuzzing + from pdfminer.high_level import extract_text + +from pdfminer.psparser import PSException + + +def TestOneInput(data: bytes): What do you think of using snake_case for function names? I know the atheris docs use CamelCase for Python code, but the rest of this repo uses snake_case. ------------------------------ In fuzzing/extract_text_fuzzer.py <#949 (comment)> : > + + +def TestOneInput(data: bytes): + if not PDFValidator.is_valid_byte_stream(data): + # Not worth continuing with this test case + return -1 + + fdp = EnhancedFuzzedDataProvider(data) + + try: + with fdp.ConsumeMemoryFile() as f: + max_pages = fdp.ConsumeIntInRange(0, 1000) + extract_text( + f, + maxpages=max_pages, + page_numbers=fdp.ConsumeIntList(fdp.ConsumeIntInRange(0, max_pages), 2), The page_numbers can also be None. I'm wondering if we can test that as well. Also in the other fuzzers. ------------------------------ In fuzzing/extract_text_fuzzer.py <#949 (comment)> : > + +def TestOneInput(data: bytes): + if not PDFValidator.is_valid_byte_stream(data): + # Not worth continuing with this test case + return -1 + + fdp = EnhancedFuzzedDataProvider(data) + + try: + with fdp.ConsumeMemoryFile() as f: + max_pages = fdp.ConsumeIntInRange(0, 1000) + extract_text( + f, + maxpages=max_pages, + page_numbers=fdp.ConsumeIntList(fdp.ConsumeIntInRange(0, max_pages), 2), + laparams=PDFValidator.generate_layout_parameters(fdp) laparams can also be None. Also in the other fuzzers. ------------------------------ In fuzzing/extract_text_fuzzer.py <#949 (comment)> : > @@ -0,0 +1,45 @@ +import sys + +import atheris + +from fuzz_helpers import EnhancedFuzzedDataProvider Would it be possible to start all imports from the root of the project? Usually that makes it easier to get the imports working. So that would require from fuzzing.fuzz_helpers import ... here. Add a __init__.py file to the fuzzing dir should get it to work. ------------------------------ In fuzzing/extract_text_fuzzer.py <#949 (comment)> : > @@ -0,0 +1,45 @@ +import sys + +import atheris + +from fuzz_helpers import EnhancedFuzzedDataProvider + +with atheris.instrument_imports(): + from pdf_utils import PDFValidator, prepare_pdfminer_fuzzing + from pdfminer.high_level import extract_text + +from pdfminer.psparser import PSException + + +def TestOneInput(data: bytes): The rest of pdfminer is type checked with mypy. That would error on the missing return type here. What do you think of adding the fuzzing directory to the noxfile.py? That would also enable black formatting and linting. ------------------------------ In pdfminer/pdfdocument.py <#949 (comment)> : > @@ -977,7 +977,7 @@ def find_xref(self, parser: PDFParser) -> int: else: raise PDFNoValidXRef("Unexpected EOF") log.debug("xref found: pos=%r", prev) - assert prev is not None + assert prev is not None and prev.isdigit() I'm assuming this is already the first bug that you find by using fuzzing. While running the code myself I found a couple of others. What do you think of separating fixing errors into separate MR's. I'm hoping that we get some great statistics from ossfuzz on how many errors we fixed this way. — Reply to this email directly, view it on GitHub <#949 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHXFHTLUGS64V4QMKDVZPLDZJOXJPAVCNFSM6AAAAABEPH7OO6VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDCNBUGM2DINRXHE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

…ey are test cases

… except can catch all

pietermarsman · 2024-06-27T20:18:00Z

I've fixed all my own comments and think this is now ready.

One big change I did was to subclass all exceptions that pdfminer raises from PSException. Such that the exception handling is now a bit easier. Encapsulating this is also good for the package.

If you confirm that the current code still works with ClusterFuzz I'll merge it.

capuanob · 2024-06-28T00:47:29Z

Awesome, thank you so much for helping out with this effort! Also definitely a great addition to have a base class for raised exceptions, I tend to look out for that to make the catch block clearer. Looks good to me, whenever you merge I’ll contribute to Google’s upstream. Thank You, Bailey Capuano

…

On Thu, Jun 27, 2024 at 4:18 PM Pieter Marsman ***@***.***> wrote: I've fixed all my own comments and think this is now ready. One big change I did was to subclass all exceptions that pdfminer raises from PSException. Such that the exception handling is now a bit easier. Encapsulating this is also good for the package. — Reply to this email directly, view it on GitHub <#949 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHXFHTPGV66AMQ4G3VQQD43ZJRXQ5AVCNFSM6AAAAABEPH7OO6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJVGYYDANBZGY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

capuanob · 2024-06-28T11:51:57Z

@pietermarsman Can confirm that all builds succeed with ClusterFuzz

pietermarsman · 2024-06-28T15:32:23Z

Done. Thanks for everything! 👏

This pull requests integrates the Dockerfile needed to build the fuzzers for pdfminer.six, as merged into upstream in this [PR](pdfminer/pdfminer.six#949).

pietermarsman · 2024-07-03T15:33:26Z

Hi @capuanob ,

I've checked out OSS-Fuzz and monorail but could not find any useful output yet. The build seems to succeed. But there are no issues opened yet. And the coverage suggests that nothing gets past the is_valid_byte_stream check. Maybe the corpus is not loaded correctly.

capuanob · 2024-07-05T13:17:55Z

Hi Pieter, I've got this on my docket to look into. I may have some time this weekend, but I am in the middle of a move so my time is limited. If you happen to find anything before then, please do let me know! Best, Bailey+

…

On Wed, Jul 3, 2024 at 11:33 AM Pieter Marsman ***@***.***> wrote: Hi @capuanob <https://github.com/capuanob> , I've checked out OSS-Fuzz <https://oss-fuzz.com/> and monorail <https://bugs.chromium.org/p/oss-fuzz/issues/list?q=&can=2> but could not find any useful output yet. The build seems to succeed. But there are no issues opened yet. And the coverage suggests that nothing gets past the is_valid_byte_stream check. Maybe the corpus is not loaded correctly. — Reply to this email directly, view it on GitHub <#949 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHXFHTL74RVNEQBFYNXOHBDZKQKVZAVCNFSM6AAAAABEPH7OO6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBWGU3TONJTGA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

pietermarsman · 2024-07-06T16:11:01Z

@capuanob

Take your time. Good luck with the move!

Some thoughts:

Not sure what changed, but somethings seems to be working now. There are 3 issues now.
But the coverage is still very low.
I noticed here the corpus is copied to $SRC/corpus but the working dir is $SRC/pdfminer.six. I'm not sure if that is correct. Seems like the corpus is outside the workdir.
I noticed there are integration rewards. Are you applying for those?
I noticed there is the option to [file GitHub issues] as well. Can you set that up for pdfminer.six?

capuanob · 2024-07-07T13:18:36Z

Good morning, From a quick review, I think you’re right- it looks like I just copied the corpus without zipping it to the $OUT directory with the proper name. I’ll fix that in a subsequent PR to pdfminer. I am undergoing the rewards process as well. As for the GitHub issues, I will integrate those! Thank You, Bailey Capuano

…

On Sat, Jul 6, 2024 at 12:11 PM Pieter Marsman ***@***.***> wrote: @capuanob <https://github.com/capuanob> Take your time. Good luck with the move! Some thoughts: 1. Not sure what changed, but somethings seems to be working now. There are 3 issues <https://oss-fuzz.com/testcases?project=pdfminersix&open=yes> now. 2. But the coverage <https://storage.googleapis.com/oss-fuzz-introspector/pdfminersix/inspector-report/20240706/fuzz_report.html#High-level-conclusions> is still very low. 3. I noticed here <https://github.com/google/oss-fuzz/blob/master/projects/pdfminersix/Dockerfile> the corpus is copied to $SRC/corpus but the working dir is $SRC/pdfminer.six. I'm not sure if that is correct. Seems like the corpus is outside the workdir. 4. I noticed there are integration rewards <https://google.github.io/oss-fuzz/getting-started/integration-rewards/>. Are you applying for those? 5. I noticed there is the option to [file GitHub issues] as well. Can you set that up for pdfminer.six? — Reply to this email directly, view it on GitHub <#949 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHXFHTPW7NUUSIYMPYJTJVDZLAJKXAVCNFSM6AAAAABEPH7OO6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJRHAYDQNZSGM> . You are receiving this because you were mentioned.Message ID: ***@***.***>

capuanob · 2024-07-08T01:21:12Z

@pietermarsman Put in a PR with Google to add GitHub issues, feel free to follow here

pietermarsman · 2024-07-08T05:59:38Z

Thanks for integrating the GitHub issues. And 🤞 your latest PR fixes the coverage. As for the reward integration rewards, am I eligible as well? A reward for working on OSS, that almost seams to good to be true 😃

capuanob · 2024-07-08T17:56:51Z

I’m unsure, I’ve only done this side of the house. You’d probably have to reach out to OSSFuzz about the maintainers side of things! Thank You, Bailey Capuano

…

On Mon, Jul 8, 2024 at 2:00 AM Pieter Marsman ***@***.***> wrote: Thanks for integrating the GitHub issues. And 🤞 your latest PR fixes the coverage. As for the reward integration rewards, am I eligible as well? A reward for working on OSS, that almost seams to good to be true 😃 — Reply to this email directly, view it on GitHub <#949 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHXFHTMOSUUEUO5MPMMPF3LZLITGBAVCNFSM6AAAAABEPH7OO6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJTGA4TCOBZGQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

This feature was requested by the project maintainer, as seen [here](pdfminer/pdfminer.six#949 (comment))

capuanob added 8 commits March 5, 2024 11:37

Pushing progress

4f4ce5a

Initial fuzzing integration

aa73014

Two proposed simple bug-fixes that were preventing fuzzing from advan…

4063d56

…cing

Fixed build script

0f41ba6

Updated build Updated build Updated build

Added CIFuzz workflow

2df272e

Removed atheris from Python 3.12 dependencies

9dc4949

Removed atheris from Python 3.12 dependencies

ef620c3

Updated changelog

db33a38

Removed corpus files

8cbc4f9

pietermarsman requested changes Jun 27, 2024

View reviewed changes

pietermarsman added 11 commits June 27, 2024 18:17

Update CHANGELOG.md

a0b5e06

Added fuzzing directory to noxfile.py

ec70d62

Fix flake8

c095594

Fix mypy

a946a7e

Rename test_one_input to fuzz_one_input so PyCharm is not thinking th…

a13bc78

…ey are test cases

Undo fixes, so we can monitor them

cd4d715

Also fuzz None values

9b55c8d

Use relative imports, just like pdfminer.six package

dc2d032

Fix imports

971f402

Simplify pdf_utils.py by removing class, just functions

067f94f

Subclassing all internal exceptions to PSException such that a single…

2c371db

… except can catch all

pietermarsman enabled auto-merge June 28, 2024 15:24

pietermarsman disabled auto-merge June 28, 2024 15:24

Merge branch 'master' into master

0cb509d

pietermarsman approved these changes Jun 28, 2024

View reviewed changes

pietermarsman added this pull request to the merge queue Jun 28, 2024

Merged via the queue into pdfminer:master with commit ff359dc Jun 28, 2024
9 of 11 checks passed

capuanob mentioned this pull request Jun 28, 2024

PDFMiner.Six Initial Integration google/oss-fuzz#12139

Merged

This was referenced Jul 7, 2024

Feat/improve fuzz perf #975

Closed

Updated pdfminer yaml to file github issues for bugs google/oss-fuzz#12169

Merged

DavidKorczynski pushed a commit to google/oss-fuzz that referenced this pull request Jul 8, 2024

Updated pdfminer yaml to file github issues for bugs (#12169)

8e57283

This feature was requested by the project maintainer, as seen [here](pdfminer/pdfminer.six#949 (comment))

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OSSFuzz Integration #949

OSSFuzz Integration #949

capuanob commented Mar 10, 2024 •

edited

Loading

capuanob commented Mar 15, 2024

capuanob commented May 19, 2024

capuanob commented Jun 10, 2024 •

edited

Loading

pietermarsman commented Jun 24, 2024

capuanob commented Jun 24, 2024 via email

pietermarsman commented Jun 24, 2024 •

edited

Loading

capuanob commented Jun 27, 2024

capuanob commented Jun 27, 2024

pietermarsman left a comment

capuanob commented Jun 27, 2024 via email

pietermarsman commented Jun 27, 2024 •

edited

Loading

capuanob commented Jun 28, 2024 via email

capuanob commented Jun 28, 2024

pietermarsman commented Jun 28, 2024

pietermarsman commented Jul 3, 2024 •

edited

Loading

capuanob commented Jul 5, 2024 via email

pietermarsman commented Jul 6, 2024

capuanob commented Jul 7, 2024 via email

capuanob commented Jul 8, 2024

pietermarsman commented Jul 8, 2024

capuanob commented Jul 8, 2024 via email

OSSFuzz Integration #949

OSSFuzz Integration #949

Conversation

capuanob commented Mar 10, 2024 • edited Loading

capuanob commented Mar 15, 2024

capuanob commented May 19, 2024

capuanob commented Jun 10, 2024 • edited Loading

pietermarsman commented Jun 24, 2024

capuanob commented Jun 24, 2024 via email

pietermarsman commented Jun 24, 2024 • edited Loading

capuanob commented Jun 27, 2024

capuanob commented Jun 27, 2024

pietermarsman left a comment

Choose a reason for hiding this comment

capuanob commented Jun 27, 2024 via email

pietermarsman commented Jun 27, 2024 • edited Loading

capuanob commented Jun 28, 2024 via email

capuanob commented Jun 28, 2024

pietermarsman commented Jun 28, 2024

pietermarsman commented Jul 3, 2024 • edited Loading

capuanob commented Jul 5, 2024 via email

pietermarsman commented Jul 6, 2024

capuanob commented Jul 7, 2024 via email

capuanob commented Jul 8, 2024

pietermarsman commented Jul 8, 2024

capuanob commented Jul 8, 2024 via email

capuanob commented Mar 10, 2024 •

edited

Loading

capuanob commented Jun 10, 2024 •

edited

Loading

pietermarsman commented Jun 24, 2024 •

edited

Loading

pietermarsman commented Jun 27, 2024 •

edited

Loading

pietermarsman commented Jul 3, 2024 •

edited

Loading