Pipweights #1

adematti · 2021-09-28T23:04:38Z

Attempt to implement PIP / angular weights into Corrfunc, for DESI usage.
We first focus on theory DD fallback kernel.
PIP weights require passing integer weights. We will require that these weights occupy the same number of bytes as coordinate arrays (i.e. we request int32 arrays if coordinate arrays are float32, similarly for float64), which may help with SIMD operations.
Then we can keep most of the C code as is.
We add the flag num_integer_weights to weight_struct to keep track of the number of integer bitwise weights. Other weights are considered individual weights (floating type). A list of weights should then be provided to the Python API.
This all goes fine with fallback kernel, but I could not find an obvious correspondence in terms of SIMD instructions, e.g. this is only provided by AVX512... Any idea @lgarrison, @manodeep?
In addition we need to apply angular weights, i.e. weights depending on the cosine angle between the two galaxies. I also drafted an implementation for those, tab_weight_struct. This requires a bit of plumbing, however. Good news is, compared to the linearbinning branch, there is no real increase of computing time for weight_type = None, pair_product (+ 5%).

@lgarrison, @manodeep, any opinion on all of this? These changes are somewhat specific, so I'd understand you would prefer to keep those separate from the main repo/master.

lgarrison · 2021-09-29T03:02:16Z

This looks great, PIP weights are obviously a common use case so it's definitely worth considering supporting them natively.

I've only skimmed this, but I'm not entirely sure I follow the angular weights. I would naively expect this could just be implemented as a custom weighting function, without passing down new arrays from the Python wrappers. But is the idea that one needs to specify a function of costheta, and it needs to be set at runtime?

On the 5% increase in runtime, is that with PIP weights or angular weights (or both)? Does it scale with number of PIP weights?

adematti · 2021-09-29T12:54:08Z

But is the idea that one needs to specify a function of costheta, and it needs to be set at runtime?

Yes, one should be able to provide an arbitrary function of costheta, hence through tabulated values. In practice, we want to input DD_parent(costheta)/DD_pip(costheta), see e.g. eq. 9 of https://arxiv.org/pdf/2007.09005.pdf

On the 5% increase in runtime, is that with PIP weights or angular weights (or both)? Does it scale with number of PIP weights?

The 5% increase in runtime, was w.r.t. the previous implementation using weight_type = "pair_product". This represents the overhead due to the implementation of the more flexible weight structure (increased number of "if" statements/larger memory footprint). The runtime scales with the number of PIP weights, indeed. Typically, for a box size of 1000 Mpc/h, linear binning up to 200 Mpc/h, 1e5 objects:

4 s with weight_type = "pair_product"
4.4 s with weight_type = "inverse_bitwise", 1 individual weight and no PIP weight
4.8 s with weight_type = "inverse_bitwise", 1 individual weight and 64-bit PIP weight
9.1 s with weight_type = "inverse_bitwise", 1 individual weight and 8 64-bit PIP weights (typically the maximum we'll use)
10 s with weight_type = "inverse_bitwise", 1 individual weight and 8 64-bit PIP weights and angular weights

That would still make the fastest pair counter with PIP + angular weights on the market I think ;)

adematti · 2021-10-07T14:04:07Z

Hi @lgarrison, @manodeep,
would you have any advice regarding the implementation of https://github.com/adematti/Corrfunc/blob/e4183eeda155e740cc0fd45d577c5754b05847c8/utils/weight_functions.h.src#L118 with SIMD instructions, e.g. popcount seems only provided by AVX512 here?
Thanks!

lgarrison · 2021-10-07T22:26:30Z

This is an interesting question. Most of the implementations online (e.g. https://github.com/WojciechMula/sse-popcount) are focused on the case where you have a long vector of ints and want to popcnt the whole thing, while this data is "transposed" such that you want the individual popcnt result of each 64-bit int in the vector.

I've only skimmed the benchmarks from that repo, but they suggest that it might be sufficient to just always use SSE, e.g. this implementation often seems faster than AVX for 64 bytes: https://github.com/WojciechMula/sse-popcount/blob/6feb3dba32c526b17de01e931c116900e3a23104/popcnt-cpu.cpp#L59

But also, these implementations are summing the results, which we don't want. So that may affect which method is fastest for us.

For AVX-512, vpopcnt* is only provided in Ice Lake and newer, as it's not actually part of AVX-512F.

So I would actually start by trying a scalar popcnt even in the vector kernels, like __builtin_popcount() from GCC. It may be that the data movement (loading from many weight vectors) is more costly than the popcnt, anyway. If the performance is still lacking, I would try an SSE version next.

adematti · 2021-10-07T22:35:10Z

Thank you very much for the feedback! I will start with the simple __builtin_popcount() from GCC, have everything else properly implemented, and we can come back to this later as you suggest if performance is lacking.

manodeep · 2021-10-08T02:33:25Z

RIght - forgot to comment here. @adematti Since I am not involved in the DESI stuff - don't really know what you are attempting. Could you please outline some pseudo-code for the actual operation you are trying to code up?

In terms of merging in with the main repo, it would depend on the complexity of the involved code. I have not gotten any request for a custom weighting scheme, but that's not to say that such a feature could be broadly useful. However, you might get better mileage out of keeping this as a separate fork - that will potentially allow you to create a pair-counter optimised for your specific use-case.

Hardware popcnt is a special CPU instruction - popcnt32,popcnt64 and has been supported for a while now. Officially you need to compile with -mpopcnt to enable that feature but I have seen -march=native enable that (though it's been a while since I tested that). As Lehman mentioned, the popcnt on vector registers are only available on very recent CPUs and might be worthwhile to pursue the older popcnt32/popcnt64 operations.

I am curious - how important is performance for this PIP pair-counter? And what are the typical particle numbers in the cells?

adematti · 2021-10-08T16:16:28Z

Thanks for the feedback @manodeep! For the PIP part I am simply trying to do:
weight = popcount(weight1 & weight2)
where weight1 is a 32/64 bit integer weight for particle1, same for particle2. The goal is to compute, for each pair of particles, eq. 6 https://arxiv.org/pdf/2007.09005.pdf
This is obvious to implement for the fallback kernel and I was hoping a similar popcnt was available for vector registers, but as you say it is not - so I will just loop over the vector registers.
Actually performance is not very important, since such PIP operations are only required for DD pair counts, not DR and RR which will be dominant in the computing time.
Thanks!

…odeep#270) * Add additional check to tell if it's safe to redirect stdout/err. Closes manodeep#269. * Changelog * Update comment

Arnaud De-Mattia added 3 commits September 24, 2021 22:37

first attempt to implement bitwise weight for theory DD fallback

e294417

Merge branch 'linearbinning' into pipweights

db8b1dd

adding angular weights to theory DD fallback

e4183ee

adematti force-pushed the linearbinning branch from 4b91ef2 to 70797ec Compare October 5, 2021 08:21

Arnaud De-Mattia added 4 commits October 11, 2021 15:33

tiny typo

171224b

Merge branch 'linearbinning' into pipweights

ee8350a

custom binning for angular weights

0180645

pip and angular weights in all pair counters, with fallback kernel

691d86c

adematti force-pushed the pipweights branch from be42045 to 691d86c Compare October 13, 2021 01:56

pip and angular weights in all pair counters, with fallback kernel

8a25b9b

adematti force-pushed the pipweights branch from 1fa1a3d to 8a25b9b Compare October 14, 2021 11:33

allow None weights when angular upweighting

d27b322

adematti force-pushed the pipweights branch from f41005d to d27b322 Compare October 26, 2021 22:08

Arnaud De-Mattia and others added 10 commits October 27, 2021 22:34

allow to specify offset and default value for pip weights

27bbbf0

weighted separation average

a5bcacf

weighted separation average

ae6c266

cleanup

d27e728

some cleanups

4ac0548

PIP and angular weights implemented in all Intel intrinsics

4ef318c

no pair_weights when periodic boundary conditions

595c426

number of pi-bins

f36028d

pass bin array instead of bin file

b6da81c

Add additional check to tell if it's safe to redirect stdout/err (man…

68973d0

…odeep#270) * Add additional check to tell if it's safe to redirect stdout/err. Closes manodeep#269. * Changelog * Update comment

allowing for sep = 0 in rppi/smu

67581af

adematti force-pushed the pipweights branch from 67f26ab to 67581af Compare February 3, 2022 00:34

Merge branch 'manodeep:master' into pipweights

a17f97c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pipweights #1

Pipweights #1

adematti commented Sep 28, 2021 •

edited

Loading

lgarrison commented Sep 29, 2021

adematti commented Sep 29, 2021 •

edited

Loading

adematti commented Oct 7, 2021 •

edited

Loading

lgarrison commented Oct 7, 2021

adematti commented Oct 7, 2021 •

edited

Loading

manodeep commented Oct 8, 2021

adematti commented Oct 8, 2021

Pipweights #1

Are you sure you want to change the base?

Pipweights #1

Conversation

adematti commented Sep 28, 2021 • edited Loading

lgarrison commented Sep 29, 2021

adematti commented Sep 29, 2021 • edited Loading

adematti commented Oct 7, 2021 • edited Loading

lgarrison commented Oct 7, 2021

adematti commented Oct 7, 2021 • edited Loading

manodeep commented Oct 8, 2021

adematti commented Oct 8, 2021

adematti commented Sep 28, 2021 •

edited

Loading

adematti commented Sep 29, 2021 •

edited

Loading

adematti commented Oct 7, 2021 •

edited

Loading

adematti commented Oct 7, 2021 •

edited

Loading