-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change: Use simdjson's ondemand API #5
base: main
Are you sure you want to change the base?
Conversation
e8089ca
to
5117fa2
Compare
|
Also sorry for the late response, I had a lot of stuff to do for school. I really appreciate your help. |
Thanks. The functions that have to do with simdjson don’t have to be declared in your header file, do they? Or, at least, they don’t have to be in a public header file? |
If you are compiling under Windows, I recommend using Clang if possible. See |
Thanks for the read. I've gone ahead and I've just run my benchmarks using Clang-CL 16 (I used clang-cl from the LLVM downloads). It has actually increased the runtime speed of the fastgltf cases by often fourfold with both master and this branch. Though this PR's branch is still lacking behind consistently. |
I see something like a factor of two... but yeah, Visual Studio has disappointing performance in some cases I care about. |
ca17e96
to
fc5b3c2
Compare
@lemire I've revisited this PR because I want to add support for #41, which I think I want to do using raw JSON tokens, as provided by the on demand API. Using verbose logging I already managed to restructure my code to have no skips at all during the entire parsing process. I've looked through some things using Xcode Instruments, and have found some issues. I fixed some of the issues with allocations and vector resizing where it wasn't needed, but this one stumps me (there's more than just the preview shown, and ignore the return at the top of the function I accidentally committed that): Lines 1330 to 1400 in fc5b3c2
This lambda function takes up 36% of the entire average runtime of the parseAccessors function. It essentially just reads a number from an array, and then checks if it's a double, float, or integer and then converts them appropriately. Is there some better way to do this with simdjson, so that its runtime is not so excessive? Do you have any suggestions on improving that function? |
@spnda Having a look. |
src/fastgltf.cpp
Outdated
break; | ||
} | ||
case ondemand::number_type::unsigned_integer: { | ||
// Note that the glTF spec doesn't care about any integer larger than 32-bits, so |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should only trigger unsigned_integer
if the integer is larger or equal to 2**63, so maybe this case is in error for you, or needs to be treated as a double?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The data included in the arrays can at most be 2**32 (or 32-bit floats), so that that case is just never reached probably. If it's only ever taken with values larger or equal to 2**63 I guess I would be safe to just remove it. Though for correctness, would it really be bad to keep it? Or are you saying that you think I should always error on this case because such large numbers shouldn't exist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it really be bad to keep it?
I recommend that you simplify your code prior to optimizing it. It is much easier to optimize simpler code.
I tried dumping the assembly from this function and I got several pages of instructions. It is just not possible to understand that much code at once. If you can trim it down to something simpler, then it might be much easier to understand where the performance challenges are.
|
||
for (auto element : array) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I understand this code, but I would expect that you could do...
if(accessor.componentType == ComponentType::Double) {
auto v = std::get<double_vec>(variant);
for (auto element : array) {
// error handing as needed
v.push_back(element.get_double());
}
}
And so forth. The number type is a fallback for complicated scenarios, where you need to branch according to the number type, but if you know that a double is good enough in some instances, then just grab that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, so a few things first:
-
I should actually be down casting to single precision in all cases and never use double directly:
For floating-point components, JSON-stored minimum and maximum values represent single precision floats and SHOULD be rounded to single precision before usage to avoid any potential boundary mismatches.
-
JSON does not make the distinction between integers and floating point numbers, as you know. Therefore, based on
accessor.componentType
, I have to cast either to float or to an integer type. The glTF spec specifically says this:Array elements MUST be treated as having the same data type as accessor’s componentType.
From that, I understand that if the component type specifies an integer but the value is
3.5
for whatever reason in the JSON, I have to just cast that to an integer. The specification makes no further clarifications on how the values should be treated as far as I can tell.
As I don't know how simdjson parses or identifies the numbers internally, would you say it'd be safe to just have this instead to avoid the ondemand::number
object?
switch (accessor.componentType) {
case ComponentType::Double:
v.emplace_back(element.get_double();
break;
case ComponentType::Float:
...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, and, most importantly, because I am doing order-agnostic parsing there's a chance the componentType field hasn't been parsed yet when this function executes. Which makes me have to rely on the number type identification from simdjson. So the proposed code isn't a possibility, and also reveals an issue with the current implementation because I shouldn't be using anything from the accessor
object while parsing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Converting to float from double is typically done with a single instruction (e.g., cvtsd2ss). It is a high latency operation, but not overly expensive. So if you don't need to immediately check the result, it should be quite cheap.
would you say it'd be safe to just have this
I wouldn't use a switch case if I could avoid it to handle individual elements. I would have distinct code paths based on componentType
. It depends on the context, with repeated calls to a switch/case can be detrimental to the performance. Not always, it depends on how the compiler decides to handle it and how precisely you wrote the code, but to be safe, assuming that a switch/case adds significant overhead because it might. It is not the switch case per se that's the problem, it is that it prevents some useful optimizations by the compiler.
Of course, trying to parse "3.5"
as an integer will cause an error, but you could handle this with an exceptional case (using a cold path if needed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the proposed code isn't a possibility
My proposal is that you try it and benchmark. I submit to you that you need to run experiments to find the right design, even if it means temporarily having code that does not solve the right problem. You need to identify the bottleneck and that's damn difficult if you are just starting at high-level code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote a small little benchmark which tried to parse the same JSON array which contains integers and doubles (50-50 distribution). Turns out get_number
is ever so slightly faster in my little experiment, even with the switch-case covering those three cases. Though the difference is so small I think it makes no difference in real world scenarios (0.1us to 0.3us), and is probably also just within margin of error. I'll try and do some other experiments and see what I can find...
The whole function does a lot of complicated work. If you turn it into something like ...
You should be pretty close to be limited by the number parsing speed. If not, then you have other overhead. My recommendation is to try to break down your code into simpler functions. There is a lot going on. It may help also to include a standard benchmark as part of the project. If I could run a benchmark easily, I might be able to give you more insights. Just looking at high level code buys only so much. I need to know what the code is doing in practice. |
This part is important: I recommend including a standard benchmark as part of your project. If I could just do...
That would tremendously useful. Not just to me but to anyone who cares about performance. |
Feel free to steal some of my benchmarking code: https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/tree/master/2023/11/28/benchmarks |
My recommendation is to rely more on experiments and repeated high-quality benchmarks, than on profiling. Profilers mislead us all the time. They may point at a potential direction, but that's all. There are exceptions. If you do pure numerical analysis, you may find 3 lines of code where your code is spending 99% of its time, and then you rewrite them in assembly and you are done... but for a problem such as this one, experiments are very likely to be the way forward. One needs to gain insights into what is costing what. For example, how much does the get_number() function costs as opposed to get_double()? This takes just one experiment to find it out. |
I already include benchmarks (which I am also using right now locally) which use the Catch2 testing framework. I should probably add something about how to build tests and run them in the README... Though this is how you'd build and run the benchmarks. Note that these do require external downloads of sample assets because I currently just bench an entire run over a JSON asset. I mostly just wrote them to work well for me locally and for the CI.
(note that the quotation marks in the last line might just be a requirement of zsh which I'm using locally, so you might have to remove those). |
@spnda Here is my recommendation. Take your current code, do all the parsing and error handling, but do not touch your data structures. Do not call reserve, emplace_back, push_back at least as far as the numbers are concerned. Next do the reverse. Do not call get_number... iterate through the array, but do not parse the values. Just add random numbers in your data structures (say random numbers between 0 and 1). And then compare with the current code. Lay out the three results. This should tell you a lot about where the time goes. My generally point is that you need to run experiments, and collect data. |
Well, big thanks for taking the time to look over my project. I'll do some testing along the lines of what you suggested. I'm not knowledgeable about these low-level optimizations, so much of this is still new to me. Using profiling I managed to find a few issues in my code apart from the |
@spnda The reason I raise these questions is that software performance is complex and requires hard numbers and experimentation. For example, just by tracing the code without too much attention, I noticed that you are often enough storing a single integer in a std::vector which is itself a std::variant. This is fine if that's what you need, but constructing these data structures is expensive. And parsing routine themselves may not carry much weight. But it is impossible to tell without experimentations and profiling won't tell you. |
In 0fe58d7 we switched from on-demand -> DOM API as it was slightly faster. This switches back to the on-demand API while using a different approach to parsing to hopefully increase speeds over the current code using the DOM API.
Design
The first iteration of the new parsing interface also helped with the DOM API, which is why that was merged separately already. However, this patch also tries to port that idea of iterating over all fields linearly in every function. Because currently, we only iterate over fields for the root of the JSON document and use hashed strings to speed up the switch. For every other function the order of fields may still be in the exact opposite (worse case) as how we read from them, which is why I want to use this idea everywhere in fastgltf.
Performance
Performance has gotten worse which is why this is a draft PR and I will look into this drawback later on. From a quick profiling run it seems like the ondemand::parser::iterate function is taking a considerable amount of time while the rest executes within at most 500microseconds.