Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Pcode emulator performance and reduce memory fragmentation #5325

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

mrexodia
Copy link

Before this change I got ~150k instructions/second, after ~380k (2.5x).

The main bottleneck was caused by separate allocations of the VarnodeData and PcodeOpRaw objects, but there was also an unnecessary vector in PcodeOpRaw (since the maximum amount of input is 3 this can be an array).

This does introduce a potential issue with iterator invalidation. The data for VarnodeData and PcodeOpRaw is now stored directly in a vector, so you cannot cache the pointers or use them as a unique identifier. This could be solved by replacing vector with a custom deque-like container, but I was not able to find any uses of the code and it looks like the getter methods are always used to access the pointers...

Before this change I got ~150k instructions/second, after ~380k (2.5x)
@astrelsky
Copy link
Contributor

@mrexodia did you profile the decompiler and if so, how? I've always wanted to profile it to figure out why it is so slow sometimes but never really knew where to start.

@mrexodia
Copy link
Author

@mrexodia did you profile the decompiler and if so, how? I've always wanted to profile it to figure out why it is so slow sometimes but never really knew where to start.

I didn't profile the decompiler, in fact I didn't compile or run Ghidra with this change at all. I am using the sleigh project and directly use the EmulatePcodeCache class: mrexodia/sleigh@295a7ac#diff-dda293caa87bc61aebe36b487a9cfb17d4d601a7a02dfd7cc1ee84d248404b5fR433.

The shellcode is the following C++:

__forceinline void
xtea_encipher(uint32_t v[2], uint32_t const key[4])
{
	const unsigned int num_rounds = 64;
	unsigned int i;
	uint32_t v0 = v[0], v1 = v[1], sum = 0, delta = 0x9E3779B9;
	for (i = 0; i < num_rounds; i++)
	{
		v0 += (((v1 << 4) ^ (v1 >> 5)) + v1) ^ (sum + key[sum & 3]);
		sum += delta;
		v1 += (((v0 << 4) ^ (v0 >> 5)) + v0) ^ (sum + key[(sum >> 11) & 3]);
	}
	v[0] = v0; v[1] = v1;
}

__forceinline void
xtea_decipher(uint32_t v[2], uint32_t const key[4])
{
	const unsigned int num_rounds = 64;
	unsigned int i;
	uint32_t v0 = v[0], v1 = v[1], delta = 0x9E3779B9, sum = delta * num_rounds;
	for (i = 0; i < num_rounds; i++)
	{
		v1 -= (((v0 << 4) ^ (v0 >> 5)) + v0) ^ (sum + key[(sum >> 11) & 3]);
		sum -= delta;
		v0 -= (((v1 << 4) ^ (v1 >> 5)) + v1) ^ (sum + key[sum & 3]);
	}
	v[0] = v0; v[1] = v1;
}

uint32_t shellcode() {
	uint32_t block[2] = { 0x676E6962, 0x61657263 };
	uint32_t key[4] = { 0x1337, 0x69, 0x420, 0x666 };
	for (size_t i = 0; i < 10000000; i++)
	{
		xtea_encipher(block, key);
		xtea_decipher(block, key);
	}
	return block[0] ^ block[1];
}

Then simply count the amount of instructions executed per second and used Intel VTune to profile and find bottlenecks. Likely you could find a similar setup, maybe by decompiling extremely large functions or decompiling the same few functions in a loop.

My plan was to reuse Ghidra's processor definitions to write a (good) alternative to unicorn, but unfortunately this will require a complete rewrite to reach any kind of meaningful speeds (unicorn does ~15M/second with a trace callback for every instruction). The comparison isn't fair since unicorn JITs the instructions, but still Ghidra's lifting to Pcode is too slow...

@astrelsky
Copy link
Contributor

I see, thank you for the explanation.

@XVilka
Copy link
Contributor

XVilka commented Jun 2, 2023

I apologize for the intrusion, but since it doesn't exactly related to the Ghidra main purpose, I think nobody would be offended by such a shameless plug.

@mrexodia since you don't need decompilation facilities for this kind of emulation, have you considered using RzIL instead (based on the Binary Analysis Platform's new IL called Core Theory)? And we test/verify it against QEMU (or other emulators) using rz-tracetest.

Currently Rizin implemented RzIL uplifting for these platforms:

  • ARMv7 & AARCH64
  • x86 (most common instructions, not everything yet)
  • PowerPC
  • AVR
  • SuperH
  • 8051
  • 6502
  • Brainfuck

Plus there are two currently work-in-progress:

On a side note, I don't think anything can beat TCG used by QEMU, which is the base for Unicorn.

@mrexodia
Copy link
Author

mrexodia commented Jun 6, 2023

Thanks for the info @XVilka! I’m aware of ESIL, but it was my understanding that instruction support for RZIL was still limited. I’ll take a look at the test frameworks though!

For my personal project I switched to using https://github.com/icicle-emu/icicle-emu, which seems to be able to match QEMU’s performance (with the added advantage of being maintainable and readable code).

@jobermayr
Copy link
Contributor

This change leads to an error with example from (#5872 (comment))

[decomp]> decompile
Decompiling reallocInit
Low-level ERROR: Illegal p-code in executable snippet
Unable to proceed with function: reallocInit

Without:

[decomp]> decompile
Decompiling reallocInit
Decompilation complete
[decomp]> print C
...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature: Decompiler Status: Triage Information is being gathered
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants