-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Pcode emulator performance and reduce memory fragmentation #5325
base: master
Are you sure you want to change the base?
Improve Pcode emulator performance and reduce memory fragmentation #5325
Conversation
Before this change I got ~150k instructions/second, after ~380k (2.5x)
@mrexodia did you profile the decompiler and if so, how? I've always wanted to profile it to figure out why it is so slow sometimes but never really knew where to start. |
I didn't profile the decompiler, in fact I didn't compile or run Ghidra with this change at all. I am using the sleigh project and directly use the The shellcode is the following C++: __forceinline void
xtea_encipher(uint32_t v[2], uint32_t const key[4])
{
const unsigned int num_rounds = 64;
unsigned int i;
uint32_t v0 = v[0], v1 = v[1], sum = 0, delta = 0x9E3779B9;
for (i = 0; i < num_rounds; i++)
{
v0 += (((v1 << 4) ^ (v1 >> 5)) + v1) ^ (sum + key[sum & 3]);
sum += delta;
v1 += (((v0 << 4) ^ (v0 >> 5)) + v0) ^ (sum + key[(sum >> 11) & 3]);
}
v[0] = v0; v[1] = v1;
}
__forceinline void
xtea_decipher(uint32_t v[2], uint32_t const key[4])
{
const unsigned int num_rounds = 64;
unsigned int i;
uint32_t v0 = v[0], v1 = v[1], delta = 0x9E3779B9, sum = delta * num_rounds;
for (i = 0; i < num_rounds; i++)
{
v1 -= (((v0 << 4) ^ (v0 >> 5)) + v0) ^ (sum + key[(sum >> 11) & 3]);
sum -= delta;
v0 -= (((v1 << 4) ^ (v1 >> 5)) + v1) ^ (sum + key[sum & 3]);
}
v[0] = v0; v[1] = v1;
}
uint32_t shellcode() {
uint32_t block[2] = { 0x676E6962, 0x61657263 };
uint32_t key[4] = { 0x1337, 0x69, 0x420, 0x666 };
for (size_t i = 0; i < 10000000; i++)
{
xtea_encipher(block, key);
xtea_decipher(block, key);
}
return block[0] ^ block[1];
} Then simply count the amount of instructions executed per second and used Intel VTune to profile and find bottlenecks. Likely you could find a similar setup, maybe by decompiling extremely large functions or decompiling the same few functions in a loop. My plan was to reuse Ghidra's processor definitions to write a (good) alternative to unicorn, but unfortunately this will require a complete rewrite to reach any kind of meaningful speeds (unicorn does ~15M/second with a trace callback for every instruction). The comparison isn't fair since unicorn JITs the instructions, but still Ghidra's lifting to Pcode is too slow... |
I see, thank you for the explanation. |
I apologize for the intrusion, but since it doesn't exactly related to the Ghidra main purpose, I think nobody would be offended by such a shameless plug. @mrexodia since you don't need decompilation facilities for this kind of emulation, have you considered using RzIL instead (based on the Binary Analysis Platform's new IL called Core Theory)? And we test/verify it against QEMU (or other emulators) using rz-tracetest. Currently Rizin implemented RzIL uplifting for these platforms:
Plus there are two currently work-in-progress: On a side note, I don't think anything can beat TCG used by QEMU, which is the base for Unicorn. |
Thanks for the info @XVilka! I’m aware of ESIL, but it was my understanding that instruction support for RZIL was still limited. I’ll take a look at the test frameworks though! For my personal project I switched to using https://github.com/icicle-emu/icicle-emu, which seems to be able to match QEMU’s performance (with the added advantage of being maintainable and readable code). |
This change leads to an error with example from (#5872 (comment))
Without:
|
Before this change I got ~150k instructions/second, after ~380k (2.5x).
The main bottleneck was caused by separate allocations of the
VarnodeData
andPcodeOpRaw
objects, but there was also an unnecessary vector inPcodeOpRaw
(since the maximum amount of input is 3 this can be an array).This does introduce a potential issue with iterator invalidation. The data for
VarnodeData
andPcodeOpRaw
is now stored directly in a vector, so you cannot cache the pointers or use them as a unique identifier. This could be solved by replacing vector with a customdeque
-like container, but I was not able to find any uses of the code and it looks like the getter methods are always used to access the pointers...