Improve Pcode emulator performance and reduce memory fragmentation #5325

mrexodia · 2023-05-13T11:48:11Z

Before this change I got ~150k instructions/second, after ~380k (2.5x).

The main bottleneck was caused by separate allocations of the VarnodeData and PcodeOpRaw objects, but there was also an unnecessary vector in PcodeOpRaw (since the maximum amount of input is 3 this can be an array).

This does introduce a potential issue with iterator invalidation. The data for VarnodeData and PcodeOpRaw is now stored directly in a vector, so you cannot cache the pointers or use them as a unique identifier. This could be solved by replacing vector with a custom deque-like container, but I was not able to find any uses of the code and it looks like the getter methods are always used to access the pointers...

Before this change I got ~150k instructions/second, after ~380k (2.5x)

astrelsky · 2023-05-16T10:44:34Z

@mrexodia did you profile the decompiler and if so, how? I've always wanted to profile it to figure out why it is so slow sometimes but never really knew where to start.

mrexodia · 2023-05-17T09:10:58Z

@mrexodia did you profile the decompiler and if so, how? I've always wanted to profile it to figure out why it is so slow sometimes but never really knew where to start.

I didn't profile the decompiler, in fact I didn't compile or run Ghidra with this change at all. I am using the sleigh project and directly use the EmulatePcodeCache class: mrexodia/sleigh@295a7ac#diff-dda293caa87bc61aebe36b487a9cfb17d4d601a7a02dfd7cc1ee84d248404b5fR433.

The shellcode is the following C++:

__forceinline void
xtea_encipher(uint32_t v[2], uint32_t const key[4])
{
	const unsigned int num_rounds = 64;
	unsigned int i;
	uint32_t v0 = v[0], v1 = v[1], sum = 0, delta = 0x9E3779B9;
	for (i = 0; i < num_rounds; i++)
	{
		v0 += (((v1 << 4) ^ (v1 >> 5)) + v1) ^ (sum + key[sum & 3]);
		sum += delta;
		v1 += (((v0 << 4) ^ (v0 >> 5)) + v0) ^ (sum + key[(sum >> 11) & 3]);
	}
	v[0] = v0; v[1] = v1;
}

__forceinline void
xtea_decipher(uint32_t v[2], uint32_t const key[4])
{
	const unsigned int num_rounds = 64;
	unsigned int i;
	uint32_t v0 = v[0], v1 = v[1], delta = 0x9E3779B9, sum = delta * num_rounds;
	for (i = 0; i < num_rounds; i++)
	{
		v1 -= (((v0 << 4) ^ (v0 >> 5)) + v0) ^ (sum + key[(sum >> 11) & 3]);
		sum -= delta;
		v0 -= (((v1 << 4) ^ (v1 >> 5)) + v1) ^ (sum + key[sum & 3]);
	}
	v[0] = v0; v[1] = v1;
}

uint32_t shellcode() {
	uint32_t block[2] = { 0x676E6962, 0x61657263 };
	uint32_t key[4] = { 0x1337, 0x69, 0x420, 0x666 };
	for (size_t i = 0; i < 10000000; i++)
	{
		xtea_encipher(block, key);
		xtea_decipher(block, key);
	}
	return block[0] ^ block[1];
}

Then simply count the amount of instructions executed per second and used Intel VTune to profile and find bottlenecks. Likely you could find a similar setup, maybe by decompiling extremely large functions or decompiling the same few functions in a loop.

My plan was to reuse Ghidra's processor definitions to write a (good) alternative to unicorn, but unfortunately this will require a complete rewrite to reach any kind of meaningful speeds (unicorn does ~15M/second with a trace callback for every instruction). The comparison isn't fair since unicorn JITs the instructions, but still Ghidra's lifting to Pcode is too slow...

astrelsky · 2023-05-17T11:12:28Z

I see, thank you for the explanation.

XVilka · 2023-06-02T06:08:09Z

I apologize for the intrusion, but since it doesn't exactly related to the Ghidra main purpose, I think nobody would be offended by such a shameless plug.

@mrexodia since you don't need decompilation facilities for this kind of emulation, have you considered using RzIL instead (based on the Binary Analysis Platform's new IL called Core Theory)? And we test/verify it against QEMU (or other emulators) using rz-tracetest.

Currently Rizin implemented RzIL uplifting for these platforms:

ARMv7 & AARCH64
x86 (most common instructions, not everything yet)
PowerPC
AVR
SuperH
8051
6502
Brainfuck

Plus there are two currently work-in-progress:

Tricore
MIPS

On a side note, I don't think anything can beat TCG used by QEMU, which is the base for Unicorn.

mrexodia · 2023-06-06T07:52:43Z

Thanks for the info @XVilka! I’m aware of ESIL, but it was my understanding that instruction support for RZIL was still limited. I’ll take a look at the test frameworks though!

For my personal project I switched to using https://github.com/icicle-emu/icicle-emu, which seems to be able to match QEMU’s performance (with the added advantage of being maintainable and readable code).

jobermayr · 2023-10-13T18:27:07Z

This change leads to an error with example from (#5872 (comment))

[decomp]> decompile
Decompiling reallocInit
Low-level ERROR: Illegal p-code in executable snippet
Unable to proceed with function: reallocInit

Without:

[decomp]> decompile
Decompiling reallocInit
Decompilation complete
[decomp]> print C
...

Improve Pcode emulator performance and reduce memory fragmentation

5b5084d

Before this change I got ~150k instructions/second, after ~380k (2.5x)

ryanmkurtz assigned caheckman May 18, 2023

ryanmkurtz added Feature: Decompiler Status: Triage Information is being gathered labels May 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Pcode emulator performance and reduce memory fragmentation #5325

Improve Pcode emulator performance and reduce memory fragmentation #5325

mrexodia commented May 13, 2023

astrelsky commented May 16, 2023

mrexodia commented May 17, 2023

astrelsky commented May 17, 2023

XVilka commented Jun 2, 2023 •

edited

Loading

mrexodia commented Jun 6, 2023

jobermayr commented Oct 13, 2023

Improve Pcode emulator performance and reduce memory fragmentation #5325

Are you sure you want to change the base?

Improve Pcode emulator performance and reduce memory fragmentation #5325

Conversation

mrexodia commented May 13, 2023

astrelsky commented May 16, 2023

mrexodia commented May 17, 2023

astrelsky commented May 17, 2023

XVilka commented Jun 2, 2023 • edited Loading

mrexodia commented Jun 6, 2023

jobermayr commented Oct 13, 2023

XVilka commented Jun 2, 2023 •

edited

Loading