Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipweights #1

Open
wants to merge 21 commits into
base: linearbinning
Choose a base branch
from
Open

Pipweights #1

wants to merge 21 commits into from

Conversation

adematti
Copy link
Member

@adematti adematti commented Sep 28, 2021

Attempt to implement PIP / angular weights into Corrfunc, for DESI usage.
We first focus on theory DD fallback kernel.
PIP weights require passing integer weights. We will require that these weights occupy the same number of bytes as coordinate arrays (i.e. we request int32 arrays if coordinate arrays are float32, similarly for float64), which may help with SIMD operations.
Then we can keep most of the C code as is.
We add the flag num_integer_weights to weight_struct to keep track of the number of integer bitwise weights. Other weights are considered individual weights (floating type). A list of weights should then be provided to the Python API.
This all goes fine with fallback kernel, but I could not find an obvious correspondence in terms of SIMD instructions, e.g. this is only provided by AVX512... Any idea @lgarrison, @manodeep?
In addition we need to apply angular weights, i.e. weights depending on the cosine angle between the two galaxies. I also drafted an implementation for those, tab_weight_struct. This requires a bit of plumbing, however. Good news is, compared to the linearbinning branch, there is no real increase of computing time for weight_type = None, pair_product (+ 5%).

@lgarrison, @manodeep, any opinion on all of this? These changes are somewhat specific, so I'd understand you would prefer to keep those separate from the main repo/master.

@lgarrison
Copy link

This looks great, PIP weights are obviously a common use case so it's definitely worth considering supporting them natively.

I've only skimmed this, but I'm not entirely sure I follow the angular weights. I would naively expect this could just be implemented as a custom weighting function, without passing down new arrays from the Python wrappers. But is the idea that one needs to specify a function of costheta, and it needs to be set at runtime?

On the 5% increase in runtime, is that with PIP weights or angular weights (or both)? Does it scale with number of PIP weights?

@adematti
Copy link
Member Author

adematti commented Sep 29, 2021

But is the idea that one needs to specify a function of costheta, and it needs to be set at runtime?

Yes, one should be able to provide an arbitrary function of costheta, hence through tabulated values. In practice, we want to input DD_parent(costheta)/DD_pip(costheta), see e.g. eq. 9 of https://arxiv.org/pdf/2007.09005.pdf

On the 5% increase in runtime, is that with PIP weights or angular weights (or both)? Does it scale with number of PIP weights?

The 5% increase in runtime, was w.r.t. the previous implementation using weight_type = "pair_product". This represents the overhead due to the implementation of the more flexible weight structure (increased number of "if" statements/larger memory footprint). The runtime scales with the number of PIP weights, indeed. Typically, for a box size of 1000 Mpc/h, linear binning up to 200 Mpc/h, 1e5 objects:

  • 4 s with weight_type = "pair_product"
  • 4.4 s with weight_type = "inverse_bitwise", 1 individual weight and no PIP weight
  • 4.8 s with weight_type = "inverse_bitwise", 1 individual weight and 64-bit PIP weight
  • 9.1 s with weight_type = "inverse_bitwise", 1 individual weight and 8 64-bit PIP weights (typically the maximum we'll use)
  • 10 s with weight_type = "inverse_bitwise", 1 individual weight and 8 64-bit PIP weights and angular weights

That would still make the fastest pair counter with PIP + angular weights on the market I think ;)

@adematti
Copy link
Member Author

adematti commented Oct 7, 2021

Hi @lgarrison, @manodeep,
would you have any advice regarding the implementation of https://github.com/adematti/Corrfunc/blob/e4183eeda155e740cc0fd45d577c5754b05847c8/utils/weight_functions.h.src#L118 with SIMD instructions, e.g. popcount seems only provided by AVX512 here?
Thanks!

@lgarrison
Copy link

This is an interesting question. Most of the implementations online (e.g. https://github.com/WojciechMula/sse-popcount) are focused on the case where you have a long vector of ints and want to popcnt the whole thing, while this data is "transposed" such that you want the individual popcnt result of each 64-bit int in the vector.

I've only skimmed the benchmarks from that repo, but they suggest that it might be sufficient to just always use SSE, e.g. this implementation often seems faster than AVX for 64 bytes: https://github.com/WojciechMula/sse-popcount/blob/6feb3dba32c526b17de01e931c116900e3a23104/popcnt-cpu.cpp#L59

But also, these implementations are summing the results, which we don't want. So that may affect which method is fastest for us.

For AVX-512, vpopcnt* is only provided in Ice Lake and newer, as it's not actually part of AVX-512F.

So I would actually start by trying a scalar popcnt even in the vector kernels, like __builtin_popcount() from GCC. It may be that the data movement (loading from many weight vectors) is more costly than the popcnt, anyway. If the performance is still lacking, I would try an SSE version next.

@adematti
Copy link
Member Author

adematti commented Oct 7, 2021

Thank you very much for the feedback! I will start with the simple __builtin_popcount() from GCC, have everything else properly implemented, and we can come back to this later as you suggest if performance is lacking.

@manodeep
Copy link

manodeep commented Oct 8, 2021

RIght - forgot to comment here. @adematti Since I am not involved in the DESI stuff - don't really know what you are attempting. Could you please outline some pseudo-code for the actual operation you are trying to code up?

In terms of merging in with the main repo, it would depend on the complexity of the involved code. I have not gotten any request for a custom weighting scheme, but that's not to say that such a feature could be broadly useful. However, you might get better mileage out of keeping this as a separate fork - that will potentially allow you to create a pair-counter optimised for your specific use-case.

Hardware popcnt is a special CPU instruction - popcnt32,popcnt64 and has been supported for a while now. Officially you need to compile with -mpopcnt to enable that feature but I have seen -march=native enable that (though it's been a while since I tested that). As Lehman mentioned, the popcnt on vector registers are only available on very recent CPUs and might be worthwhile to pursue the older popcnt32/popcnt64 operations.

I am curious - how important is performance for this PIP pair-counter? And what are the typical particle numbers in the cells?

@adematti
Copy link
Member Author

adematti commented Oct 8, 2021

Thanks for the feedback @manodeep! For the PIP part I am simply trying to do:
weight = popcount(weight1 & weight2)
where weight1 is a 32/64 bit integer weight for particle1, same for particle2. The goal is to compute, for each pair of particles, eq. 6 https://arxiv.org/pdf/2007.09005.pdf
This is obvious to implement for the fallback kernel and I was hoping a similar popcnt was available for vector registers, but as you say it is not - so I will just loop over the vector registers.
Actually performance is not very important, since such PIP operations are only required for DD pair counts, not DR and RR which will be dominant in the computing time.
Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants