In-memory database option #634

kamakazikamikaze · 2022-07-21T23:55:34Z

kamakazikamikaze
Jul 21, 2022

The idea

While waiting to get approval to submit a PR to close out #624, I've been working on another feature that allows the user to specify the use of an in-memory database with eventual persistence to disk.

Why?

A need for this arose when we found ourselves restricted to using a station with a slower HDD and a low-IOPS NAS. Since SQLite creates files to lock the database, calls via NFS/CIFS induced a lot of overhead, and the HDD experienced high latency when we attempted to switch over to it. Tests slowed down considerably in both scenarios. Granted, the situation we were under for this very specific scenario is unlikely to be encountered by others, but even for SSDs there is a significant performance boost possible if the API is not the choke point.

Overview of the implementation

I am unable to share the source code for now but the gist of the design is as follows:

boofuzz/sessions.py -> Session adds three new optional parameters
- db_in_memory: Boolean (default False) to indicate that an in-memory database is desired. Documentation notes that this is discouraged except in scenarios where IOPS for a local disk is a significant choke point.
- db_in_memory_flush_timer: Int (default 180) to indicate the number of seconds between flushes to disk. We will refer to this as 'X'
- db_in_memory_check_timer: Int (default 5) to indicate the number of seconds the thread that does flushes should wait between checks for whether it is supposed to exit. We will refer to this as 'Y'
boofuzz\fuzz_logger_db.py -> FuzzLoggerDb adds the same three parameters (renamed) as above
- Additional properties are created if in-memory database is requested for the purpose of tracking changes, termination status, and referencing the running sync thread
Session passes the parameters to FuzzLoggerDb
FuzzLoggerDb creates the in-memory database and retains the target filename where the persistent database is saved to
Session._main_fuzz_loop makes a call to self._db_logger to begin the synchronization thread when fuzzing begins
FuzzLoggerDb initializes a "token bucket" semaphore which only allows acquiring when the token is replenished every X seconds. The thread enters a while-loop on condition of the termination flag being set to False.
- The semaphore uses threading.Timer to replenish the bucket
- The Timer callback essentially replaces the Timer with a new instance
- The semaphore tracks whether database backup is occurring and prevents a new timer to be made until database backup is completed
FuzzLoggerDb tracks when the last disk flush took place and the time of the last written testcase. Flushing to disk is skipped in the loop if the latest flush has occurred after the most recent testcase to avoid unnecessary disk utilization
FuzzLoggerDb blocks while trying to acquire the semaphore. A timeout of Y is set. If the timeout is reached the loop starts back at the top.
If the semaphore is acquired, a second in-memory database is created.
- The primary in-memory database is then backed up to the secondary memory database. This prevents blocking writes to the primary database (allowing tests to continue)
- The secondary in-memory database is then backed up to disk for persistence. .backup() releases the GIL, so this thread is suspended to allow testcases to continue running
- The secondary in-memory database is deleted when the backup is complete in order to free RAM back to the system
Session will instruct FuzzLoggerDb to stop the sync thread when fuzzing completes (added to finally statement for graceful close)
When fuzzing terminates, FuzzLoggerDb will stop the loop on the next semaphore .acquire(...) timeout and perform a final flush to disk

Performance

Using the same TCP Echo Server from #622 and the definitions for FTP from the Quickstart section of the documentation, I've compared ~20 minutes of runtime performance between on-disk and in-memory databases for Linux. This was done on an Ubuntu 22.04 VM with 4 i7-10810U cores, 8GB RAM, and an SK Hynix PC711 SSD.

As you can see, there's nearly a 6x performance boost from using an in-memory database. We reach the same amount of testcases in 3 minutes with an in-memory database as we do after almost 20 minutes with an on-disk database. There are certainly some oddities though with performance being rather volatile. Though I've attempted to minimize blocking of the fuzzing thread, some calls may be "stealing" a little too much time before releasing the GIL. Further testing and refactoring will be needed.

I should note that memory consumption grows linearly with the number of testcases which is not ideal for long-running fuzzing sessions. You can easily surpass 1GB of RAM after 10 minutes. I will look into minimizing memory footprint by "rotating" the database like logfiles and combining them at the end of the fuzzing session.

Interest?

I hope to submit this as a separate PR once approval comes through. I'm unsure though if the maintainers would see any value from this. While the increased rate of completed testcases may seem enticing, the additional threads means more complexity to be aware of, and chunking out the database for the sake of reducing the memory footprint may make it unappealing to accept. If the maintainers aren't interested I can still submit a PR for it to be rejected so others may reference/consider its use, otherwise I may put it into a gist.

Additional performance graphics

Average performance of in-memory database without partitioning

FTP with TCP echo server, 10 runs, 15 minutes, flush to disk every 60 seconds with checks every 15 seconds to terminate

Performance obviously suffers over time as the database grows with hundreds of thousands of testcases. This would indicate that partitioning would be more ideal.

Average performance of in-memory database with partitioning to disk

FTP with TCP echo server, 10 runs, 15 minutes, flush to disk every 60 seconds with checks every 15 seconds to terminate

Performance is much more promising. It's still rather volatile with random dips to <100 testcases/sec.

Average performance of the various options, cumulative testcase view

Partitioning the database has a linear performance, whereas having it completely in-memory suffers as it grows. The drawback of partitioning is that stopping the session requires "zipping" all the partitions together, which essentially just delays our original issue to the end. On an SSD, it took 4 minutes to combine all the partitions.

How does this translate to real-world performance?

Replaced the TCP echo server with the jtpereyda/boofuzz-ftp repo, using uFTP and ProcessMonitorLocal (and slight changes to boofuzz/utils/dbugger_thread_simple.py). Four erroneous responses were caught at the start which resulted in the initial delays that you see. May want to make the wait time for process spawning to be more configurable, but as you can see the performance is still significantly better using an in-memory database than it would be on-disk.

SR4ven · 2022-08-01T20:15:10Z

SR4ven
Aug 1, 2022
Collaborator

First of all, thank you for your detailed explanation and in depth analysis @kamakazikamikaze!

The results do look very promising but I also agree with you that the complexity will increase.
I'd say go for it, but in the end that's up to @jtpereyda to decide.

As for the implementation, I agree with you in all but one point, which is the extra Session arguments.
I'd suggest omitting the extra arguments and add them in FuzzLoggerDb exclusively. Then make it possible to overwrite the default db-logger by passing your own to Sessions fuzz_loggers argument.
Currently this is not possible because a default db-logger is always appended to the logger list but it should be possible to change that.

The idea behind this is to not clutter Session with even more arguments. It already has too many.

Looking forward to your PR!

5 replies

kamakazikamikaze Aug 2, 2022
Author

Thanks for the feedback. I'll revise my implementations so that Session does not get bloated with arguments. Perhaps an instance method for assigning a new DB Logger would be most intuitive. Maybe I should make the in-memory capability a child class of FuzzLoggerDb so it looks more clean...

I've got a few updated graphics that I'm going to include in the original post to give an idea about some improvements made, like partitioning the database out so the memory pressure/consumption isn't so high. Hopefully I'll get them uploaded in a few minutes.

kamakazikamikaze Aug 2, 2022
Author

@SR4ven, any idea if https://github.com/jtpereyda/boofuzz-ftp still functions as intended? Instead of using a TCP echo server I could probably use it as a real-world benchmark representation given the callbacks and additional commands.

SR4ven Aug 3, 2022
Collaborator

As far as I know it should still work as intended.

Very interesting statistics! I always thought that the socket connections + the DUT were the bottle neck.
Even though the the test cases per second seem to fluctuate a lot, the average numbers look really promising!

The partitioning will still help us to minimize the memory footprint, even though it might take a while to merge the partitions in the end. Maybe we could make the partitioning optional with a switch argument.

kamakazikamikaze Aug 3, 2022
Author

I'll give it a go with the example FTP repo and see how much changes.

Partitioning is set by a boolean parameter and is disabled by default. I'll see what I can do to also capture the memory consumption of the running process and graph it out for all to see the impact.

I'll probably need to document each feature and compare their pros and cons against the original implementation. Everything here has some kind of tradeoff (write calls, RAM consumption, or cleanup delays) that need to be clearly stated so the user understands what they're sacrificing in each mode.

SR4ven Aug 6, 2022
Collaborator

Sounds good. Looking forward to review!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

In-memory database option #634

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

In-memory database option #634

kamakazikamikaze Jul 21, 2022

The idea

Why?

Overview of the implementation

Performance

Interest?

Additional performance graphics

Average performance of in-memory database without partitioning

Average performance of in-memory database with partitioning to disk

Average performance of the various options, cumulative testcase view

How does this translate to real-world performance?

Replies: 1 comment · 5 replies

SR4ven Aug 1, 2022 Collaborator

kamakazikamikaze Aug 2, 2022 Author

kamakazikamikaze Aug 2, 2022 Author

SR4ven Aug 3, 2022 Collaborator

kamakazikamikaze Aug 3, 2022 Author

SR4ven Aug 6, 2022 Collaborator

kamakazikamikaze
Jul 21, 2022

Replies: 1 comment 5 replies

SR4ven
Aug 1, 2022
Collaborator

kamakazikamikaze Aug 2, 2022
Author

kamakazikamikaze Aug 2, 2022
Author

SR4ven Aug 3, 2022
Collaborator

kamakazikamikaze Aug 3, 2022
Author

SR4ven Aug 6, 2022
Collaborator