-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AHF halo ID fix #88
AHF halo ID fix #88
Conversation
… the ID rather than the raw file ID to be consistent in meaning between MPI and non-MPI AHF versions. Also, remove spurious extra line in filename method.
This is problematic because it copy/pastes from HaloStatFile, so duplicating functionality Would it be possible to reimplement within the HaloStatFile base class, with the AHF child class just setting a flag to activate the new behaviour? |
makes sense. I'll do that |
…thin the file rather than the value in the ID column of the file. Now each handler defines a flad when this needs to occur
Beautiful! Could we just add a test? |
Ping on adding a test so that we can merge this in |
Looking good... I have a few questions (1) Is there any way to disable this behaviour? |
…m ones from MPI runs. Add a separate function to test this conversion specifically.
Ok I have fixed the test to make it a bit more separate and a bit more specific to the issues surrounding MPI outputs from AHF. As far as turning it off... this is where I'm uncertain what the best course of action is. Currently (and really, from the beginning) it was assumed that each halo catalog could have their own format in terms of finder_id. For AHF the finder_id was originally always offset by 1 from the "raw" group ID (rather than starting at 0 it started at 1) and now it also specifies that the finder_id should be equal to the position within the catalog (still offset by 1). This makes the most sense when combined with how pynbody reads AHF (and it is consistent with the default AHF grp IDs when it isn't run with MPI). However, if we want to make TANGOS as adaptable as possible to any analysis tool, we could consider making all of these defaults able to be overridden by the user. I would think the best way to do that would be either
or
thoughts? I'm not 100% sure how to do this at the moment but it seems doable. I think I lean toward option 2. The argument against this (I think) would be that translating the default tangos ordering to whatever tools the user wants as their output handlers should occur at the level of the output handlers themselves. If your tools read in AHF halos by their raw ID number, then your handler needs to translate the TANGOS finder_id to what is needed. I'm really not sure what the right answer is here. |
as far as whether this will affect current work on merger trees, I looked at the test files and they seem to be approximating AHF files run without MPI. If this is the case, then the finder_id and halo_numbers generated by tangos should be the same as they were before. That is because the non-mpi AHF already makes grp numbers equal to their ordering by n_particles (and TANGOS already assumes that AHF ID numbers are shifted by 1 relative to their raw values) |
I think this should work for me. During my implementation @apontzen pointed me already to this pull request and I integrated the new AHF IDs in my testing branch. However, I can pull this particular branch and check if something breaks. |
Although, I am not sure how an AHF mtree file would look like if the halo catalogue was run with mpi. cheers |
So the changes made here would mean that if you were to initialize a tangos database with output from MPI, the |
So to be clear, there are basically three numbers
My guess is that the AHF mtrees will use a true Do I understand correctly by the way that What a mess... I really don't want to start storing an |
I think all we really need is that the Stat file input handler class information is saved at the database level when the information is initiated. So, as a user, I should be able to easily back out what the |
I think that is the best option. In case of the AHF input handler it would be totally sufficient to either save the original |
out of curiosity tough, does your code use the actual AHF_id from the file or just the order within the file? |
It needs the actual |
ah I see. So, as it is right now, for non-MPI runs, the AHF_id is directly related to the order within the file... it might be worth checking that AHF_id is still the important thing when it is run with MPI, but it makes sense that it would be I think. Maybe the best solution to all of this is, in fact, to add one more property: the original halo_number (call it, say, |
Sounds good to me. BTW, I checked the source code of the AHF merger tree code and it says somewhere when writing the |
Hey @apontzen just pinging this. Did we ever come to a consensus on this? |
I think that changing the meaning of |
Here is what I remember about this issue after looking back on my code changes and the current master branch. The problem comes up with AHF + pynbody specifically. Pynbody's AHF catalog object assumes that when you give it a halo number to extract, that number corresponds to the placement within the file (1 = first halo in the catalog, 2 = second, etc). This is the "finder_id" in tangos, which is the "raw" value taken directly from the stat file's first column (assumed to be the halo ID column) plus some offset value which is defined specifically for each halo catalog class. The assumption is that the ID in AHF is exactly the list index of the halo in the catalog (the first is ID = 0, second is 1, etc). The finder_id is then made to be this value + 1 (1, 2, 3, 4, etc). However, the new version of AHF has nonsense numbers for the IDs in no way connected with their position within the file. The changes I've made here, ensures that a particular halo stat file class can determine if it wants the finder_id (the number passed to the input handler to extract halos) to represent the raw ID number from the stat file or the position of the halo in the catalog. As it is now, for AHF at least the default is for finder_id to represent the location of the halo in the catalog in order to smoothly interface with pynbody. However, the way I've tried to set it up has it such that each stat file class can determine individually how it wants finder_id to work with the _id_from_file_pos class attribute. If true, finder_id = 0, 1, 2, 3. If false, finder_id = raw data from stat file. In contrast, the halo_number is always meant to represent the ranking of that halo in terms of particle number. |
Yes, I think I get it. I am wondering whether a simpler and more transparent fix would be for pynbody to use the AHF-provided IDs instead of assuming that they correspond to particular positions in a file? This prevents the mind-boggling situation of having three separate IDs stored by tangos, with the user trying to figure out what on earth they all mean. I haven't seen this happen - is it an AHF version issue or something to do with running AHF with MPI? Thoughts welcome... |
I believe more recently that AHF has randomly generated IDs regardless of whether mpi is used but I could be wrong. I think either way this is difficult. My feeling is that if we want TANGOS to be easily adaptable to any analysis tool, it should be able to handle whatever that tool needs. There is nothing more fundamental about using halo ID numbers (which are themselves meaningless) versus the order in the catalog, in my opinion (at least, not for AHF). Therefore, it makes sense that individual analysis tools may select one or the other and TANGOS must be ok with that. |
In this regard, I think the best option is to try to have everything, so that people can create their own input handlers for their specific software. So, that means having halo_number (ranking of the halo by npart), finder_id (raw ID taken directly from the catalog) and position_id (position of the halo in the catalog, i.e. 1, 2, 3, ...). With these three, one can then decide to translate to their analysis tool of choice. I think the "offset" values in the stat file class are more problematic. Rather, I'd prefer all stat file readers to generate the same ID values for halos and then require individual input_handlers to translate that as they need |
Ok let me put out there my idea for this @apontzen and if you think it is worthwhile I can try to code it up. TANGOS should aim to be as modular as possible to allow different analysis scripts and halo finder combos to be used with the system. If a stat file reader is used, it should always provide the same info regardless of the handler class (i.e. AHF + pynbody should give you the same info as Subfind+Gadget). In order to work with all halo catalogs while also providing as much meaningful info to the user as possible, I think this info should be (in addition to NDM, etc)
If the handler function load_object() then passes all three of these, each handler can individually determine how it uses them to load in a halo. I think then implementing a handler function that can take in these three numbers and tarnslate that into a halo catalog identifier usable by that specific handler+catalog combo would be easiest. This can be callef from load_object and it would be an easy function to override as a part of a new handler class.
|
yup! But if you are using AHF it is catalog_index+1 |
OK. What a giant mess (not entirely of our own making). I guess you sold me on 2 then. But hold off just a while because I am looking at addressing #117. Should fix that one first, otherwise it might become a nightmare to merge. |
wait did I sell you on 2 or 3?? I was trying for 2! :-) |
2, yes! Gah |
…the catalog of each halo. Include this in enumerate and iter_rows functions. Ensure that n_total is used to determine whether a halo satisfies min_halo_particles to ensure that ordering remains robust.
…catalog_id_offset to these values. To remain consistent, children calculations will use only the raw halo ID number with no offset applied
… ID numbers. Fix the ID grp stat file reader to have catalog indices equal to the grp number so that these values can be used to access halo catalog correctly via pynbody.
…nclude both finder_id and catalog_index as inputs
…utput_testing input handlers, and Halo class functions
Fix tracker and phantom halo initialization to include a catalog_index input Fix bh halo linking to now use catalog_index update ahf trees test to use catalog_index
… statfile -Update ahf_trees to use the raw finder_id values from tree data files -Perform correct halo object initialization in manager.py -If unset, make halo_number equal to the catalog index rather than finder_id -Update crosslink to utilize catalog_index rather than finder_id -Update property importer to expect correct number if outputs from stat file reader -Update object cache to use catalog_index rather than finder_id
make sure that the adapted enumerate returns the expected default result structure
…r_id with an additional argument in resolve Include additional argument when dealing with timestep cache for Proxy Objects
…l particle count (Gas + Stars + DM) rather than just dark matter
Ok so I finally got around to making these sweeping changes. The tests all seem to have passed ok! I've outlined in words the important bits of what has changed. The basic overall idea is that halos have have three different identifiers: finder_id = the raw ID given to the halo by the halo finder catalog_index = the position of the halo within the catalog (this may have separate definitions based on the input handler being used. For pynbody AHF for example, this starts at 1 rather than 0) halo_number = by default the rank of the halo in terms of the number of particles (1 = most number of particles, etc) in general, the catalog_index is what is used to interact with the halo catalog readers in pynbody and yt. As an option, at the time of adding the simulation, the user can turn off the renumber parameter. In this case, the halo_number becomes equal to the catalog_index. For AmigaIDL halos, catalog_index = finder_id. In this case, it would be best to run with add_simulation If one is not loading from a stat file but just with pynbody, then finder_id = catalog_index. I figured this was the best option that did not rely too much on what kind of catalog pynbody was reading. However, I am happy to fix this to use the actual properties stored in the halo catalog object (e.g. "#ID" for the finder_id in the case of AHF) but my concern is that this will just require a bunch of if statements for different catalogs that we probably want to avoid? I'm open to suggestions here. This has also required changes to the nature of timestep caches. One somewhat unrelated thing I've changed is that the Anyway, @apontzen, let me know what you think! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing to have got through all this! I have some comments though, sorry... would be great just to tidy things up a tiny bit before we merge. Also, you should probably merge -- sorry you already did thatmaster
into this branch before I merge it back into master
(to resolve the conflicts).
@@ -120,12 +122,23 @@ def load(self, mode=None): | |||
handler this can be None or 'partial' (in a normal session) and, when running inside an MPI session, | |||
'server' or 'server-partial'. See https://pynbody.github.io/tangos/mpi.html. | |||
""" | |||
if self.finder_id is None: | |||
halo_number = self.halo_number | |||
if not hasattr(self, "finder_id"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why has this line changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This like is basically still there, but my thought was to first check to see if the object even has finder_id and catalog_index columns to begin with for backwards compatibility. Maybe this is all overkill?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've now simplified this. The previous checks were deprecated anyway I think
tangos/core/halo.py
Outdated
self.timestep = timestep | ||
self.halo_number = int(halo_number) | ||
self.finder_id = int(finder_id) | ||
self.catalog_index = int(catalog_index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add comments to these definitions to explain what the three different things are?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, is catalog_index
the best name for it? Would finder_offset
or something like that be more descriptive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup I can put in comments. The rationale for catalog_index is that this is the value passed to the input handler's halo catalog reader to extract a halo. In other words, this is the index pointing to this halo within that catalog object. Note that it is not at all a constant offset necessarily from the finder_id. In fact, there is no reason why it is related to the finder_id at all. Rather, h[catalog_index]
will return this halo from halo catalog h
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that yes... but somehow finder_offset
sounds right to me, it's the offset in the finder output of this halo from the start? Whereas catalog_index
draws attention to finder
vs catalog
- is that the right thing to contrast? As another alternative, could (bad idea, breaks backward compatibility)finder_id
become catalog_id
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
haha I guess we might just have a difference in opinion here? to me your description of "offset" just sounds like an "index" (0 = start, 1 = start +1, etc). I'm ok going with whatever you think is the most descriptive. it shouldn't be very hard to change. Maybe I've been thinking about it too long and find finder_offset
confusing.
tangos/core/halo.py
Outdated
if self.catalog_index is None: | ||
catalog_index = finder_id | ||
else: | ||
catalog_index = self.catalog_index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic is a bit baffling, is it possible to add a comment explaining what is going on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is really just an attempt at backwards compatibility with a database that does not have these columns already. Maybe this is all too much and it would be better simply provide a tool for the user to update their database accordingly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think simple is better here. I believe sqlalchemy takes care of adding a column of NULLs (will appear as Nones) when opening an old database. Certainly, I don't think the attribute will ever be missing
@@ -11,7 +11,8 @@ | |||
|
|||
class HaloStatFile(object): | |||
"""Manages and reads a halo stat file of unspecified format.""" | |||
_id_offset = 0 | |||
_catalog_index_offset = 0 | |||
# _id_from_file_pos = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this commented out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should probably just be deleted. The old version had a flag whether the finder_id was derived from the position within the catalog file, but now finder_id and catalog_index are both saved separately
@@ -56,22 +57,29 @@ def all_columns(self): | |||
def iter_rows_raw(self, *args): | |||
""" | |||
Yield the halo ID and requested columns from each line of the stat file, without any emulation. | |||
|
|||
If _id_from_file_pos is True, the ID is not taken from the stat file, but rather from the order in which |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarify that _id_from_file_pos
is a class attribute, not something passed to the function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to update this description because this is inaccurate
|
||
def _calculate_children_if_required(self): | ||
if not hasattr(self, "_children_map"): | ||
self._calculate_children() | ||
|
||
def _child_halo_entry(self, this_id_raw): | ||
self._calculate_children_if_required() | ||
this_id = this_id_raw + self._id_offset | ||
children = self._children_map.get(this_id, []) | ||
#this_id = this_id_raw + self._id_offset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove code if it's not needed
@@ -208,7 +214,7 @@ def filename(cls, timestep_filename): | |||
return "CannotComputeRockstarFilename" | |||
|
|||
class AmigaIDLStatFile(HaloStatFile): | |||
_id_offset = 0 | |||
# _id_offset = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove
:param args: strings for the column names | ||
:return: id, arg1, arg2, arg3 where ID is the halo ID and argN is the value of the Nth named column | ||
""" | ||
with open(self.filename) as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this copy-paste code appear here? Shouldn't it be inherited from the base class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code expects from the stat file enumerator output of the form catalog_index, finder_id, arg1, arg2, arg3, ...
but for this particular stat file catalog_index (the number that the halo catalog associates with a given halo) is always equal to the finder_id. That's why there needs to be a different function here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the docstring incorrect in that case? (I think this is also true for the base class method above)
As for avoiding copy-paste, would something like the following work? (NB I didn't actually try it at all! Just speculating)
def iter_rows_raw(self, *args):
for row in super(self).iter_rows_raw(*args):
row[0] = row[1] # sequential catalog index not right in this case; overwrite to match finder id
yield row
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh yeah I think that's a good solution. I can code that up and see. I'll double check the docstrings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've implemented this and tested (no "self" in the super call, but otherwise exactly as you have it) and it seems to work. I've pushed this over just now!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also double checked and the iter_rows_raw
docstrings were accurate but iter_rows
one was not. That has been updated.
tangos/tools/crosslink.py
Outdated
@@ -56,7 +56,7 @@ def _generate_timestep_pairs(self): | |||
raise NotImplementedError("No implementation found for generating the timestep pairs") | |||
|
|||
def get_halo_entry(self, ts, halo_number): | |||
h = ts.halos.filter_by(finder_id=halo_number).first() | |||
h = ts.halos.filter_by(halo_number=halo_number).first() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This certainly looks more sensible, but is there any chance it is breaking something? Looks like there was previously a bug, which perhaps was cancelled out by a bug elsewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe, but when I checked (and I just checked again) this function is never called anywhere as far as I can tell
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest deleting the function in that case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok I've just committed a version with this function deleted. it passes all the nosetests
tangos/util/timestep_object_cache.py
Outdated
self._initialise_cache() | ||
|
||
def resolve(self, finder_id, typetag): | ||
self._ensure_cache() | ||
def resolve(self, identifier, typetag, type='catalog'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
type
shouldn't be used as a variable name (it is a reserved keyword).
Anyway... wonder whether it would be neater here to have two separate methods, maybe retrieve_from_finder_id
and retrieve_from_catalog_index
or something like that? The method name resolve
is cryptic anyway in this context (my fault, I know) and now with this weird type
parameter it is really hard to figure out what is going on.
…x, finder_id, halo_number)
to provide backwards compatibility but are now deprecated remove unneeded commented-out code re-write docstrings for iter_rows_raw functions to actually reflect the nature of the outputs.
… that identifies halos by either finder_id or catalog_index separately. Include this change in both the proxy object classes and property importer where appropriate.
make the AmigaIDLStatFile iter_rows_raw function more streamlined
hey just a ping @apontzen to have a look and let me know what you think about the variable names (finder_id, catalog_index and whether there are better, more intuitive options). As for the larger issue of backwards compatibility, I think this still runs the risk of breaking in older versions... we might need some script that can update old databases to fit this format. I think that's possible using sqlalchemy? |
and change all other variables and function names to be consistent with this change where appropriate
Thank you Michael! Heroic multi-year effort! |
change the AHF input handler to generate halo IDs based on the position within the file, making all versions of AHF consistent with one another. This is needed because AHF with MPI generates random ID numbers for all halos, which is currently incompatible with, e.g., pynbody's AHF catalog reader. This was done by changing the iter_rows_raw method. Further, deleted unnecessary (and unreachable) line in the filename method for AHFStatFile.