AHF halo ID fix #88

mtremmel · 2018-12-06T16:10:40Z

change the AHF input handler to generate halo IDs based on the position within the file, making all versions of AHF consistent with one another. This is needed because AHF with MPI generates random ID numbers for all halos, which is currently incompatible with, e.g., pynbody's AHF catalog reader. This was done by changing the iter_rows_raw method. Further, deleted unnecessary (and unreachable) line in the filename method for AHFStatFile.

… the ID rather than the raw file ID to be consistent in meaning between MPI and non-MPI AHF versions. Also, remove spurious extra line in filename method.

apontzen · 2018-12-06T22:30:58Z

This is problematic because it copy/pastes from HaloStatFile, so duplicating functionality

Would it be possible to reimplement within the HaloStatFile base class, with the AHF child class just setting a flag to activate the new behaviour?

mtremmel · 2018-12-07T15:26:02Z

makes sense. I'll do that

…thin the file rather than the value in the ID column of the file. Now each handler defines a flad when this needs to occur

apontzen · 2018-12-07T19:41:04Z

Beautiful! Could we just add a test?

apontzen · 2019-03-12T22:00:04Z

Ping on adding a test so that we can merge this in

apontzen · 2019-04-09T09:56:07Z

Looking good... I have a few questions

(1) Is there any way to disable this behaviour?
(2) Does the test really check for anything substantive? Given that the halo numbering is already sequential in the test file, it seems to me like it is not really checking your changes.
(3) Will this break e.g. @TobiBu's work flow and if so what do we do about it (see 1)
(4) On a minor point, please make the test a separate function rather than bundle it in with the IDL stat file test.

…m ones from MPI runs. Add a separate function to test this conversion specifically.

mtremmel · 2019-04-09T16:57:34Z

Ok I have fixed the test to make it a bit more separate and a bit more specific to the issues surrounding MPI outputs from AHF.

As far as turning it off... this is where I'm uncertain what the best course of action is. Currently (and really, from the beginning) it was assumed that each halo catalog could have their own format in terms of finder_id. For AHF the finder_id was originally always offset by 1 from the "raw" group ID (rather than starting at 0 it started at 1) and now it also specifies that the finder_id should be equal to the position within the catalog (still offset by 1). This makes the most sense when combined with how pynbody reads AHF (and it is consistent with the default AHF grp IDs when it isn't run with MPI). However, if we want to make TANGOS as adaptable as possible to any analysis tool, we could consider making all of these defaults able to be overridden by the user. I would think the best way to do that would be either

a runtime argument when the simulation is first added that specifies how finder_id and halo_number translate to the raw group IDs in the halo catalog.

or

something in the config file that can override the relevant class attributes (_id_offset and _id_from_file_pos)

thoughts? I'm not 100% sure how to do this at the moment but it seems doable. I think I lean toward option 2. The argument against this (I think) would be that translating the default tangos ordering to whatever tools the user wants as their output handlers should occur at the level of the output handlers themselves. If your tools read in AHF halos by their raw ID number, then your handler needs to translate the TANGOS finder_id to what is needed. I'm really not sure what the right answer is here.

mtremmel · 2019-04-09T17:07:28Z

as far as whether this will affect current work on merger trees, I looked at the test files and they seem to be approximating AHF files run without MPI. If this is the case, then the finder_id and halo_numbers generated by tangos should be the same as they were before. That is because the non-mpi AHF already makes grp numbers equal to their ordering by n_particles (and TANGOS already assumes that AHF ID numbers are shifted by 1 relative to their raw values)

TobiBu · 2019-04-09T17:13:39Z

Looking good... I have a few questions

(1) Is there any way to disable this behaviour?
(2) Does the test really check for anything substantive? Given that the halo numbering is already sequential in the test file, it seems to me like it is not really checking your changes.
(3) Will this break e.g. @TobiBu's work flow and if so what do we do about it (see 1)

I think this should work for me. During my implementation @apontzen pointed me already to this pull request and I integrated the new AHF IDs in my testing branch. However, I can pull this particular branch and check if something breaks.

TobiBu · 2019-04-09T17:34:18Z

Although, I am not sure how an AHF mtree file would look like if the halo catalogue was run with mpi.
My workflow relies on the finder_id. How is that impacted by the changes?

cheers
Tobias

mtremmel · 2019-04-09T17:39:39Z

So the changes made here would mean that if you were to initialize a tangos database with output from MPI, the finder_id would correspond to the position of the halo within the catalog (1 = first, 2 = second, etc) while halo_number would be the order of the halo in terms of particle number in the simulation (1 = most particles, 2 = second most, etc). The difference with MPI is that neither the halo catalog nor the ID numbers are ordered by particle number. The grp numbers in the catalog itself are large randomly generated unique numbers. TANGOS essentially ignores these numbers right now since they are meaningless. A normal halo catalog run with pynbody can be examined using either number (finder_id by default and halo_number if you run s.halos(dosort=True))

apontzen · 2019-04-12T21:15:19Z

So to be clear, there are basically three numbers

The halo_number, which is a tangos construction which we can ignore in this discussion;
The finder_id, which was supposed to be the identity of the halo in the raw halo finder output;
The offset of the halo within finder catalogues (which we could notionally call offset_id). This is stored as finder_id in @mtremmel's patch

My guess is that the AHF mtrees will use a true finder_id rather than an offset_id, and therefore this patch could break importing mtrees. @TobiBu can you confirm, and do either of you have a good way to proceed from here? Very frustrating...

Do I understand correctly by the way that pynbody expects an offset_id rather than a finder_id?

What a mess... I really don't want to start storing an offset_id and a finder_id, but am beginning to wonder whether it might be necessary.

mtremmel · 2019-04-15T14:47:33Z

I think all we really need is that the Stat file input handler class information is saved at the database level when the information is initiated. So, as a user, I should be able to easily back out what the finder_id offset is. Similarly, whether halo_number was a re-ordering or not (though the former is more important at the moment). If this is true, then the user can easily convert finder_id and/or halo_number to whatever they need, right?

TobiBu · 2019-04-15T19:51:21Z

I think that is the best option. In case of the AHF input handler it would be totally sufficient to either save the original AHF_id or specify how finder_idwould translate into the original AHF_id.

mtremmel · 2019-04-15T19:52:40Z

out of curiosity tough, does your code use the actual AHF_id from the file or just the order within the file?

TobiBu · 2019-04-15T19:56:22Z

It needs the actual AHF_id, since the merger trees of AHF are stored such that between consecutive outputs the halos of given AHF_id are connected. But now that I type, it might actually be that its simply the order within the file. I would need to check the source code of AHF_mtree. Or write a mail to Alexander Knebe.

mtremmel · 2019-04-15T20:11:39Z

ah I see. So, as it is right now, for non-MPI runs, the AHF_id is directly related to the order within the file... it might be worth checking that AHF_id is still the important thing when it is run with MPI, but it makes sense that it would be I think.

Maybe the best solution to all of this is, in fact, to add one more property: the original halo_number (call it, say, orig_grp_number. This is easy enough to implement I believe. @apontzen what do you think? This way TANGOS will have all of the important "original" information (a value directly related to placement within file and the original group numbers assigned) as well as the "modified value" of halo_number which is related to the order by particle number by default. I think this would cover all the bases. Everyone should be happy then... so long as the offset value is accessible, the user will know the position within the halo catalog, the original grp number, and the modified grp number ordered by particle count. From these, any code that needs to interface with the database should be able to do so with ease, I think.

TobiBu · 2019-04-15T20:21:50Z

Sounds good to me.

BTW, I checked the source code of the AHF merger tree code and it says somewhere when writing the _mtree_files. It says: // this is the case where we use the haloid as found in *_particles
Whatever that means for this discussion... I am happy as long as I can reconstruct the original AHF_id as written in the _halos file.

mtremmel · 2020-06-30T14:47:59Z

Hey @apontzen just pinging this. Did we ever come to a consensus on this?

apontzen · 2020-07-09T20:36:02Z

I think that changing the meaning of finder_id is dangerous, and having re-read the whole thread am still a bit unclear/confused what the main issues at play are. To confirm, is there actually a bug now, where MPI-AHF catalogues return the wrong halos because of confusion between AHF's notion of an ID versus just indexing the halos in sequence within its output file?

mtremmel · 2020-07-10T13:00:11Z

Here is what I remember about this issue after looking back on my code changes and the current master branch. The problem comes up with AHF + pynbody specifically. Pynbody's AHF catalog object assumes that when you give it a halo number to extract, that number corresponds to the placement within the file (1 = first halo in the catalog, 2 = second, etc). This is the "finder_id" in tangos, which is the "raw" value taken directly from the stat file's first column (assumed to be the halo ID column) plus some offset value which is defined specifically for each halo catalog class. The assumption is that the ID in AHF is exactly the list index of the halo in the catalog (the first is ID = 0, second is 1, etc). The finder_id is then made to be this value + 1 (1, 2, 3, 4, etc). However, the new version of AHF has nonsense numbers for the IDs in no way connected with their position within the file.

The changes I've made here, ensures that a particular halo stat file class can determine if it wants the finder_id (the number passed to the input handler to extract halos) to represent the raw ID number from the stat file or the position of the halo in the catalog. As it is now, for AHF at least the default is for finder_id to represent the location of the halo in the catalog in order to smoothly interface with pynbody. However, the way I've tried to set it up has it such that each stat file class can determine individually how it wants finder_id to work with the _id_from_file_pos class attribute. If true, finder_id = 0, 1, 2, 3. If false, finder_id = raw data from stat file.

In contrast, the halo_number is always meant to represent the ranking of that halo in terms of particle number.

apontzen · 2020-07-10T13:07:07Z

Yes, I think I get it. I am wondering whether a simpler and more transparent fix would be for pynbody to use the AHF-provided IDs instead of assuming that they correspond to particular positions in a file? This prevents the mind-boggling situation of having three separate IDs stored by tangos, with the user trying to figure out what on earth they all mean.

I haven't seen this happen - is it an AHF version issue or something to do with running AHF with MPI?

Thoughts welcome...

mtremmel · 2020-07-10T13:15:18Z

I believe more recently that AHF has randomly generated IDs regardless of whether mpi is used but I could be wrong. I think either way this is difficult. My feeling is that if we want TANGOS to be easily adaptable to any analysis tool, it should be able to handle whatever that tool needs. There is nothing more fundamental about using halo ID numbers (which are themselves meaningless) versus the order in the catalog, in my opinion (at least, not for AHF). Therefore, it makes sense that individual analysis tools may select one or the other and TANGOS must be ok with that.

mtremmel · 2020-07-10T13:18:48Z

In this regard, I think the best option is to try to have everything, so that people can create their own input handlers for their specific software. So, that means having halo_number (ranking of the halo by npart), finder_id (raw ID taken directly from the catalog) and position_id (position of the halo in the catalog, i.e. 1, 2, 3, ...). With these three, one can then decide to translate to their analysis tool of choice. I think the "offset" values in the stat file class are more problematic. Rather, I'd prefer all stat file readers to generate the same ID values for halos and then require individual input_handlers to translate that as they need

mtremmel · 2020-07-16T19:47:52Z

Ok let me put out there my idea for this @apontzen and if you think it is worthwhile I can try to code it up.

TANGOS should aim to be as modular as possible to allow different analysis scripts and halo finder combos to be used with the system. If a stat file reader is used, it should always provide the same info regardless of the handler class (i.e. AHF + pynbody should give you the same info as Subfind+Gadget). In order to work with all halo catalogs while also providing as much meaningful info to the user as possible, I think this info should be (in addition to NDM, etc)

some raw output ID read from the stat file ("finder_id")
the indexed position within the list of halos provided by the catalog (0, 1, 2, ...; the "catalog_index")
the rank order of the halo (1,2,3 for most Npart, second most, etc; the "halo_number")

If the handler function load_object() then passes all three of these, each handler can individually determine how it uses them to load in a halo. I think then implementing a handler function that can take in these three numbers and tarnslate that into a halo catalog identifier usable by that specific handler+catalog combo would be easiest. This can be callef from load_object and it would be an easy function to override as a part of a new handler class.

#example of new function for AHF+pynbody handler
def _halo_identifier(halo_number, finder_id, catalog_index):
    return catalog_index+1

mtremmel · 2020-07-16T20:40:34Z

yup! But if you are using AHF it is catalog_index+1

apontzen · 2020-07-16T20:44:09Z

OK. What a giant mess (not entirely of our own making). I guess you sold me on 2 then. But hold off just a while because I am looking at addressing #117. Should fix that one first, otherwise it might become a nightmare to merge.

mtremmel · 2020-07-16T20:45:45Z

wait did I sell you on 2 or 3?? I was trying for 2! :-)

apontzen · 2020-07-16T20:46:42Z

2, yes! Gah

…the catalog of each halo. Include this in enumerate and iter_rows functions. Ensure that n_total is used to determine whether a halo satisfies min_halo_particles to ensure that ordering remains robust.

…catalog_id_offset to these values. To remain consistent, children calculations will use only the raw halo ID number with no offset applied

… ID numbers. Fix the ID grp stat file reader to have catalog indices equal to the grp number so that these values can be used to access halo catalog correctly via pynbody.

…nclude both finder_id and catalog_index as inputs

…utput_testing input handlers, and Halo class functions

Fix tracker and phantom halo initialization to include a catalog_index input Fix bh halo linking to now use catalog_index update ahf trees test to use catalog_index

… statfile -Update ahf_trees to use the raw finder_id values from tree data files -Perform correct halo object initialization in manager.py -If unset, make halo_number equal to the catalog index rather than finder_id -Update crosslink to utilize catalog_index rather than finder_id -Update property importer to expect correct number if outputs from stat file reader -Update object cache to use catalog_index rather than finder_id

make sure that the adapted enumerate returns the expected default result structure

…r_id with an additional argument in resolve Include additional argument when dealing with timestep cache for Proxy Objects

…l particle count (Gas + Stars + DM) rather than just dark matter

mtremmel · 2021-03-01T19:38:22Z

Ok so I finally got around to making these sweeping changes. The tests all seem to have passed ok! I've outlined in words the important bits of what has changed.

The basic overall idea is that halos have have three different identifiers:

finder_id = the raw ID given to the halo by the halo finder

catalog_index = the position of the halo within the catalog (this may have separate definitions based on the input handler being used. For pynbody AHF for example, this starts at 1 rather than 0)

halo_number = by default the rank of the halo in terms of the number of particles (1 = most number of particles, etc)

in general, the catalog_index is what is used to interact with the halo catalog readers in pynbody and yt. As an option, at the time of adding the simulation, the user can turn off the renumber parameter. In this case, the halo_number becomes equal to the catalog_index. For AmigaIDL halos, catalog_index = finder_id. In this case, it would be best to run with add_simulation renumber=False so that the halo_number is just made to equal catalog_index.

If one is not loading from a stat file but just with pynbody, then finder_id = catalog_index. I figured this was the best option that did not rely too much on what kind of catalog pynbody was reading. However, I am happy to fix this to use the actual properties stored in the halo catalog object (e.g. "#ID" for the finder_id in the case of AHF) but my concern is that this will just require a bunch of if statements for different catalogs that we probably want to avoid? I'm open to suggestions here.

This has also required changes to the nature of timestep caches. TimestepObjectCache now creates two different maps for catalog and finder numbers. By default, resolving this cache uses the catalog_index but there are instances where the finder_id is required (e.g. when loading in subhalo information from the stat file). An extra variable can be given in the resolve() function if one wants to specify "catalog" vs "finder".

One somewhat unrelated thing I've changed is that the min_halo_particles variable is used to compare the TOTAL number of particles rather than just the dark matter. However, I realize this may be more controversial than I think so I wanted to gauge what is best here. It just seems more natural from a user perspective that such a variable takes into account all halo particles. Further, I believe that AHF, etc uses the total number of particles when determining rank. In any case, I think I'd prefer a separate variable min_dm_particles or something for that.

Anyway, @apontzen, let me know what you think!

apontzen

Amazing to have got through all this! I have some comments though, sorry... would be great just to tidy things up a tiny bit before we merge. ~~Also, you should probably merge master into this branch before I merge it back into master (to resolve the conflicts).~~ -- sorry you already did that

apontzen · 2021-03-01T20:46:17Z

tangos/core/halo.py

@@ -120,12 +122,23 @@ def load(self, mode=None):
        handler this can be None or 'partial' (in a normal session) and, when running inside an MPI session,
        'server' or 'server-partial'. See https://pynbody.github.io/tangos/mpi.html.
        """
-        if self.finder_id is None:
+        halo_number = self.halo_number
+        if not hasattr(self, "finder_id"):


Why has this line changed?

This like is basically still there, but my thought was to first check to see if the object even has finder_id and catalog_index columns to begin with for backwards compatibility. Maybe this is all overkill?

I've now simplified this. The previous checks were deprecated anyway I think

apontzen · 2021-03-01T20:46:33Z

tangos/core/halo.py

        self.timestep = timestep
        self.halo_number = int(halo_number)
        self.finder_id = int(finder_id)
+        self.catalog_index = int(catalog_index)


Could we add comments to these definitions to explain what the three different things are?

Also, is catalog_index the best name for it? Would finder_offset or something like that be more descriptive?

Yup I can put in comments. The rationale for catalog_index is that this is the value passed to the input handler's halo catalog reader to extract a halo. In other words, this is the index pointing to this halo within that catalog object. Note that it is not at all a constant offset necessarily from the finder_id. In fact, there is no reason why it is related to the finder_id at all. Rather, h[catalog_index] will return this halo from halo catalog h

I understand that yes... but somehow finder_offset sounds right to me, it's the offset in the finder output of this halo from the start? Whereas catalog_index draws attention to finder vs catalog - is that the right thing to contrast? ~~As another alternative, could finder_id become catalog_id?~~ (bad idea, breaks backward compatibility)

haha I guess we might just have a difference in opinion here? to me your description of "offset" just sounds like an "index" (0 = start, 1 = start +1, etc). I'm ok going with whatever you think is the most descriptive. it shouldn't be very hard to change. Maybe I've been thinking about it too long and find finder_offset confusing.

apontzen · 2021-03-01T20:47:32Z

tangos/core/halo.py

+            if self.catalog_index is None:
+                catalog_index = finder_id
+            else:
+                catalog_index = self.catalog_index


This logic is a bit baffling, is it possible to add a comment explaining what is going on?

this is really just an attempt at backwards compatibility with a database that does not have these columns already. Maybe this is all too much and it would be better simply provide a tool for the user to update their database accordingly?

I think simple is better here. I believe sqlalchemy takes care of adding a column of NULLs (will appear as Nones) when opening an old database. Certainly, I don't think the attribute will ever be missing

apontzen · 2021-03-01T20:49:55Z