Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

demos for cosmosCatalog don't apply any cuts #812

Closed
esheldon opened this issue Oct 13, 2016 · 19 comments
Closed

demos for cosmosCatalog don't apply any cuts #812

esheldon opened this issue Oct 13, 2016 · 19 comments
Milestone

Comments

@esheldon
Copy link
Contributor

The demos that use cosmosCatalog do not apply any cuts on the size or flux of the galaxies.

I think it would be useful to include cuts in the demos to instruct end users.

In the deeper catalog, there is clear evidence of measurement issues that result in oddities such as zero size and zero flux. These would not be present in an actual sample at the intended depth.

Also, the size distribution "rolls off" at the small end rather than continuing the power law trend one would expect in a real sample. Essentially this is an additional selection effect that the end user is probably not expecting.

Mike pointed out that these cuts can be applied in the constructor.

@rmandelb
Copy link
Member

The demos that use cosmosCatalog do not apply any cuts on the size or flux of the galaxies.

Actually, the default "marginal" settings do impose cuts on a number of aspects of the fits to avoid obviously bad parametric fits, including cuts on size / Sersic n / flux.

My suggestion is that (a) we can update the comments to mention this, (b) include a single cut on size or flux to demonstrate that one can do additional cuts in the constructor, with comments explaining that one can apply multiple cuts in this way. Sound OK?

Also, the size distribution "rolls off" at the small end rather than continuing the power law
trend one would expect in a real sample.

Sorry - I don't entirely understand this comment. If you have a flux limit, and you have a size-flux relation that's a power law with some scatter, then as a rule you DO expect the size distribution to flatten out / roll off... or at least I do...?

@rmjarvis
Copy link
Member

I think on the first point, it would be helpful to showcase the min_hlr, min_flux, etc. options to COSMOSCatalog to point out that these are available (and often desired). The cgc.yaml example does use some of them, but that may be a little too hidden for most users.

@esheldon
Copy link
Contributor Author

On 10/14/16, Rachel Mandelbaum [email protected] wrote:

The demos that use cosmosCatalog do not apply any cuts on the size or flux
of the galaxies.

Actually, the default "marginal" settings do impose cuts on a number of
aspects of the fits to avoid obviously bad parametric fits, including cuts
on size / Sersic n / flux.

My suggestion is that (a) we can update the comments to mention this, (b)
include a single cut on size or flux to demonstrate that one can do
additional cuts in the constructor, with comments explaining that one can
apply multiple cuts in this way. Sound OK?

I ran a sim with the defaults and got galaxies that were smaller and fainter
than I would have expected.

Also, the size distribution "rolls off" at the small end rather than
continuing the power law
trend one would expect in a real sample.

Sorry - I don't entirely understand this comment. If you have a flux limit,
and you have a size-flux relation that's a power law with some scatter, then
as a rule you DO expect the size distribution to flatten out / roll off...
or at least I do...?

Yes, of course, in a measurement. But a simulation should not build that in
before noise is added and before a measurement is done, that should come when
measuring the simulation. The simulation should represent a true flux and
size distribution, before measurement.

If the goal is to intentionally not produce such a catalog, but rather just
represent everything beyond some very basic cuts, then that is fine, but I
would be greatfull to see a demo detailing how to produce the more vetted sim.

@rmandelb
Copy link
Member

The simulation should represent a true flux and
size distribution, before measurement.

I agree it would be nice if we could make a sim without any cuts on the input/parent population at all, but we cannot do that unless we deliberately choose to extrapolate beyond what we observe. That's a legitimate thing to do if you want to (see issue #808), but it's not what this catalog was ever claimed to be. It is a COSMOS-based catalog with a flux limit imposed on it. Once you impose a flux limit, there's an implicit size limit as well.

Unless I am misunderstanding what you are asking about? I feel we are talking past each other here...

Re: the defaults - one thing to check is the flux normalization compared to any observations you are trying to mimic. That's been a sticking in point in some of my sims.

Another issue that has come up when making HSC sims is that COSMOS catalog has had deblending done. But in a ground-based observation, there will be ambiguous blends, so a subset of the galaxies will appear brighter and larger as a result.

@rmjarvis
Copy link
Member

rmjarvis commented Oct 16, 2016

Erin said:

I ran a sim with the defaults and got galaxies that were smaller and fainter
than I would have expected.

Rachel said:

Unless I am misunderstanding what you are asking about? I feel we are talking past each other here...

I think some numbers will help get you two on the same page. I think this is the kind of thing that Erin is having trouble with:

import numpy
import galsim

cat = galsim.COSMOSCatalog()  # Use default exclusion level == 'marginal' and no other cuts
print 'Total objects passing marginal exclusion =',cat.getNTot()
Total objects passing marginal exclusion = 87798
# Get the raw data about the parametric fits
param = cat.param_cat
flux_radius = param['flux_radius']  # The SExtractor radius
mag_auto = param['mag_auto']        # The SExtractor magnitude
hlr = param['hlr']                  # The fitted half light radius from the Sersic or B+D fit
flux = param['flux']                # The fitted flux from the Sersic or B+D fit

good_sersic = (param['fit_status'][:,4] > 0) & (param['fit_status'][:,4] < 5)
good_bulge = (param['fit_status'][:,0] > 0) & (param['fit_status'][:,0] < 5)

print 'Total objects with good Sersic fit =',numpy.sum(good_sersic)
print 'Total objexts with good bulge fit =',numpy.sum(good_bulge)
Total objects with good Sersic fit = 87798
Total objexts with good bulge fit = 86958
print 'Range of flux_radius is',min(flux_radius),max(flux_radius)
print 'Range of max_auto is',min(mag_auto),max(mag_auto)
print 'Range of hlr for Sersic fits is',min(hlr[:,0][good_sersic]),max(hlr[:,0][good_sersic])
print 'Range of flux for Sersic fits is',min(flux[:,0][good_sersic]),max(flux[:,0][good_sersic])
print 'Range of hlr for bulge+disk fits is',min(hlr[:,1][good_bulge]),max(hlr[:,1][good_bulge])
print 'Range of flux for bulge+disk fits is',min(flux[:,1][good_bulge]),max(flux[:,1][good_bulge])
Range of flux_radius is 2.7419731617 59.2641944885
Range of max_auto is 16.1177158356 25.1999912262
Range of hlr for Sersic fits is 0.0 2241245.8131
Range of flux for Sersic fits is 0.0 256551120611.0
Range of hlr for bulge+disk fits is 0.0 22.7148077404
Range of flux for bulge+disk fits is 0.0 4958.57789615

Both those zeros and the extremely huge values for flux and hlr are I think the issue. It seems like they should probably have been excluded by the 'marginal' (i.e. most conservative) exclusion level.

@rmandelb
Copy link
Member

The number 87798 is the total number in the catalog, and not all objects pass the marginal cuts. I think you mean to be using cat.getNObjects(), not cat.getNTot()? (which gives the ~82k objects that actually pass the cuts) And then when you calculate ranges of fluxes/radii/etc., you should only be using the ones that pass the cuts, not all of them (as you are currently doing).

It's definitely the case that when you make Sersic fits to galaxies down to 25.2, you will get some nonsense/crap. The marginal cuts are meant to exclude those, but you are not using the marginal cuts in what's above. But what I'm not sure about is this: are you and Erin concerned about some of the weird fits you're seeing? If so, you should make sure to use the "marginal" cuts, which will help (but may not remove all bad fits, just a few classes of them). I originally thought Erin's concern was about weird fit parameters, but his later comments seemed to indicate that he does not want any cuts applied before making sims. And if that's the case, you have to deal with some bad fits. I don't see any way around this when doing parametric fits to noisy data.

@rmandelb
Copy link
Member

To be clear, you should be using cat.orig_index to get a list of indices into the original catalog for those that pass the marginal cuts, and only look at flux / radius / etc. for those objects.

@rmjarvis
Copy link
Member

Sorry, you're right. I wasn't applying that properly. When I do, they all list as having good fits for both Sersic and B+D.

But it looks like there's still a bunch of garbage objects left even when I do apply that. (Unless I'm still doing it wrong.)

import numpy
import galsim

cat = galsim.COSMOSCatalog()  # Use default exclusion level == 'marginal' and no other cuts
print 'Total objects passing marginal exclusion =',cat.getNObjects()
Total objects passing marginal exclusion = 81499
# Get the raw data about the parametric fits
orig_index = cat.orig_index         # Only the objects that the catalog will use
param = cat.param_cat[orig_index]
flux_radius = param['flux_radius']  # The SExtractor radius
mag_auto = param['mag_auto']        # The SExtractor magnitude
hlr = param['hlr']                  # The fitted half light radius from the Sersic or B+D fit
flux = param['flux']                # The fitted flux from the Sersic or B+D fit
use_bulge = param['use_bulgefit']   # Whether the B+D fit is preferred

good_sersic = (param['fit_status'][:,4] > 0) & (param['fit_status'][:,4] < 5)
good_bulge = (param['fit_status'][:,0] > 0) & (param['fit_status'][:,0] < 5)

print 'Total objects with good Sersic fit =',numpy.sum(good_sersic)
print 'Total objexts with good bulge fit =',numpy.sum(good_bulge)
Total objects with good Sersic fit = 81499
Total objexts with good bulge fit = 81499
print 'Number with Sersic hlr == 0 is',numpy.sum(hlr[:,0][good_sersic]==0.)
print 'Number with Sersic flux == 0 is',numpy.sum(flux[:,0][good_sersic]==0.)
print 'Number with bulge+disk hlr == 0 is',numpy.sum(hlr[:,1][good_bulge]==0.)
print 'Number with bulge+disk flux == 0 is',numpy.sum(flux[:,1][good_bulge]==0.)
print 'Number with Sersic hlr < 0.01 is',numpy.sum(hlr[:,0][good_sersic]<0.01)
print 'Number with Sersic flux < 0.01 is',numpy.sum(flux[:,0][good_sersic]<0.01)
print 'Number with bulge+disk hlr < 0.01 is',numpy.sum(hlr[:,1][good_bulge]<0.01)
print 'Number with bulge+disk flux < 0.01 is',numpy.sum(flux[:,1][good_bulge]<0.01)
Number with Sersic hlr == 0 is 1
Number with Sersic flux == 0 is 1
Number with bulge+disk hlr == 0 is 58074
Number with bulge+disk flux == 0 is 58074
Number with Sersic hlr < 0.01 is 50
Number with Sersic flux < 0.01 is 11
Number with bulge+disk hlr < 0.01 is 58237
Number with bulge+disk flux < 0.01 is 58074

I'm probably still doing something wrong with the B+D, since most objects claim to have a good bulgefit, but over half of them have zero hlr,flux.

@rmjarvis
Copy link
Member

BTW, when I run the same script with sample='23.5', there are no objects with Sersic hlr or flux == 0. Although there are some <0.01 (23 and 1 respectively).

@rmandelb
Copy link
Member

There's a flag in there (use_bulgefit) that says whether the b+d fit is preferred over Sersic from a statistical perspective. If the flag says that b+d is not preferred, then the flux and hlr values were not precomputed and stored, so you should get 0. (The b+d fit parameters are still available in the files, so you can use them if you want; I just didn't have it store hlr and flux in these arrays for these cases.) So when you are doing this test, you should check that use_bulgefit is 1 before you use the precomputed hlr and flux values for b+d fits.

@rmjarvis
Copy link
Member

rmjarvis commented Oct 16, 2016

OK, that makes sense. I should only look at the hlr or flux for the version that COSMOSCatalog will actually use to build the parametric object.

Here's some more analysis, where I do that:

# use_bulgefit determines whether or not to use the bulge plus disk fit rather than the Sersic fit:
bpd = param['use_bulgefit'] > 0
sersic = param['use_bulgefit'] == 0
flux_s = flux[:,0][sersic]
hlr_s = hlr[:,0][sersic]
flux_b = flux[:,1][bpd]
flux_d = flux[:,2][bpd]
hlr_b = hlr[:,1][bpd]
hlr_d = hlr[:,2][bpd]
print 'Range of flux for Sersic fits: ',min(flux_s),max(flux_s)
print 'Range of hlr for Sersic fits: ',min(hlr_s),max(hlr_s)
print 'Range of bulge flux for B+D fits: ',min(flux_b),max(flux_b)
print 'Range of bulge hlr for B+D fits: ',min(hlr_b),max(hlr_b)
print 'Range of disk flux for B+D fits: ',min(flux_d),max(flux_d)
print 'Range of disk hlr for B+D fits: ',min(hlr_d),max(hlr_d)
flux_bpd = flux_b + flux_d
print 'Range of total flux for B+D fits: ',min(flux_bpd),max(flux_bpd)
hlr_bpd = (flux_b * hlr_b + flux_d * hlr_d) / flux_bpd  # Not really right, but maybe informative.
print 'Range of weighted mean hlr for B+D fits: ',min(hlr_bpd),max(hlr_bpd)
Range of flux for Sersic fits:  0.995139179246 6515.65858411
Range of hlr for Sersic fits:  0.00250998009563 2.89718829899
Range of bulge flux for B+D fits:  0.161742284199 4409.48231374
Range of bulge hlr for B+D fits:  0.000330867582465 2.36956959043
Range of disk flux for B+D fits:  0.168974105316 2525.0374368
Range of disk hlr for B+D fits:  0.00383311030692 4.74676887459
Range of total flux for B+D fits:  1.02917610909 4974.45895504
Range of weighted mean hlr for B+D fits:  0.00403643465316 2.24292184918

So it looks like the fluxes should all be reasonable. (1 to several thousand seems like the right dynamic range.) But there are still some objects with half-light radii of only a few milliarcseconds. That's 1/10 of the HST pixel scale, so those still seem like indications of bad fits to me.

@esheldon Does this seem to match up with the weirdnesses you were seeing? I thought you said you were getting some objects with crazy small hlr, like 1.e-8 or so. But maybe that was from the ngmix fits of the drawn objects. Which is possible if all the flux is in a single pixel, the fit may dive relentlessly to very small hlr values.

@esheldon
Copy link
Contributor Author

On 10/15/16, Mike Jarvis [email protected] wrote:

OK, that makes sense. I should only look at the hlr or flux for the version
that COSMOSCatalog will actually use to build the parametric object.

Here's some more analysis, where I do that:

# use_bulgefit determines whether or not to use the bulge plus disk fit
rather than the Sersic fit:
bpd = param['use_bulgefit'] > 0
sersic = param['use_bulgefit'] == 0
flux_s = flux[:,0][sersic]
hlr_s = hlr[:,0][sersic]
flux_b = flux[:,1][bpd]
flux_d = flux[:,2][bpd]
hlr_b = hlr[:,1][bpd]
hlr_d = hlr[:,2][bpd]
print 'Range of flux for Sersic fits: ',min(flux_s),max(flux_s)
print 'Range of hlr for Sersic fits: ',min(hlr_s),max(hlr_s)
print 'Range of bulge flux for B+D fits: ',min(flux_b),max(flux_b)
print 'Range of bulge hlr for B+D fits: ',min(hlr_b),max(hlr_b)
print 'Range of disk flux for B+D fits: ',min(flux_d),max(flux_d)
print 'Range of disk hlr for B+D fits: ',min(hlr_d),max(hlr_d)
flux_bpd = flux_b + flux_d
print 'Range of total flux for B+D fits: ',min(flux_bpd),max(flux_bpd)
hlr_bpd = (flux_b * hlr_b + flux_d * hlr_d) / flux_bpd  # Not really right,
but maybe informative.
print 'Range of weighted mean hlr for B+D fits: ',min(hlr_bpd),max(hlr_bpd)
Range of flux for Sersic fits:  0.995139179246 6515.65858411
Range of hlr for Sersic fits:  0.00250998009563 2.89718829899
Range of bulge flux for B+D fits:  0.161742284199 4409.48231374
Range of bulge hlr for B+D fits:  0.000330867582465 2.36956959043
Range of disk flux for B+D fits:  0.168974105316 2525.0374368
Range of disk hlr for B+D fits:  0.00383311030692 4.74676887459
Range of total flux for B+D fits:  1.02917610909 4974.45895504
Range of weighted mean hlr for B+D fits:  0.00403643465316 2.24292184918

So it looks like the fluxes should all be reasonable. (1 to several
thousand seems like the right dynamic range.) But there are still some
objects with half-light radii of only a few milliarcseconds. That's 1/10 of
the HST pixel scale, so those still seem like the sign of a bad fit to me.

@esheldon Does this seem to match up with the weirdnesses you were seeing?
I thought you said you were getting some objects with crazy small hlr, like
1.e-8 or so. But maybe that was from the ngmix fits of the drawn objects.
Which is possible if all the flux is in a single pixel, the fit may dive
relentlessly to very small hlr values.

Yes, looking back it is really the smallness of the objects, not the low flux
that was the concern. I saw a class that had essentially zero size.

Rachel, I was actually arguing the opposite, sorry I wasn't clear.

I would like a more curated catalog , one that shows fewer obvious signs of
selection effects and bad fits. (Maybe this isn't what most people want).

For example, I mentioned the default cuts result in a "rolloff" of size but in
real data there is no such rolloff before observations and measurements are
made. With the rolloff there, there will be effectively two selections
applied, the initial one and then the one the user makes.

I can give my use case: I would like to be able to generate a catalog that
looks like what we see in DES at a very low s/n cut, say 2 or 3 in the coadd
data. In the end we select s/n > 5, but we can only test the effect of this
cut if we start with a lower s/n catalog, including the small objects. But if
the simulation catalog has signs of incompleteness at that level then I can't
use it for this purpose. Also if the fits are bad it will confuse the tests.

So I'd like to generate a version of this catalog that to first order doesn't
show any selection effects, and then decide if it is deep enough to do the
tests I want to do. If it isn't deep enough, I would attempt to extrapolate.

The fine details aren't incredibly important, but it is important, for
example, that the size and flux distributions continue to be power law-like
shapes, because dealing with selections on power laws is challenging, but
selecting on rolled-off distributions is relatively easy.

@rmandelb
Copy link
Member

But there are still some objects with half-light radii of only a few milliarcseconds.

Yes, that's true. If you use numpy.percentile on hlr_s you will see that the 1st percentile of the Sersic half-light radii is 0.0369 arcsec, so a bit under 1% of the objects have a half-light radius that is below 1 pixel. My take on this is that there are going to be a rare subset of objects with very small half-light radii, and in general the uncertainty on their hlr values is high (i.e., for a fixed detection S/N, the uncertainty on hlr rises as the hlr shrinks because there isn't much information on it). So I don't believe that their hlr are actually in the milliarcsecond range for real, and I think that if we had quoted uncertainties on the hlr, the uncertainties would reflect the fact that we shouldn't take these tiny numbers at face value. And when we draw those galaxies as they would appear in a ground-based survey, that uncertainty shouldn't matter -- whether we got the "right" hlr of 0.04 arcsec or whatever it really is, vs. these wrong values of 0.003 arcsec, either way it should look basically like a point source.

So basically, yes, I don't believe those fit values in detail, but I don't think it matters too much either.

BTW part of my thinking here is that people who care in great detail what the galaxies will look like won't want Sersic fits, and will use the images themselves. There are many other things wrong with using Sersic profiles, besides these tiny hlr values for 1% of the objects.

@rmandelb
Copy link
Member

Erin:

I would like a more curated catalog , one that shows fewer obvious signs of
selection effects and bad fits. (Maybe this isn't what most people want).

Thanks very much for this explanation. I think I understand what you are getting at much better now, and in fact it's something that comes up for HSC as well. Our data are deep enough that this flux limit in the catalog means some galaxies are not represented, so we have to either (a) extrapolate the catalog to represent all galaxies past our flux limit or (b) do science with a brighter subset of galaxies that are well-represented by this catalog (and that can be obtained by simulating with this catalog, then applying cuts). Ultimately, I think the answer is that COSMOS is not a sufficient parent catalog to use on its own for simulations for DES or HSC; we need a deeper HST dataset for that purpose, or we have to do some kind of clever tricks on COSMOS.

So, I do see a use case for what you are suggesting. I don't have a great answer for how to do it with the existing catalog. If you have suggestions for how to achieve this, I would be happy to help make it easier to do what you want using GalSim (e.g., a setting in COSMOSCatalog() constructor or whatever). I should note that in #808, the idea is to define the joint distribution of Sersic parameters using the part of the catalog that isn't too close to the flux limit, so as to avoid the realm where selection effects are important. Then we use those joint distributions for the extrapolation, so I think that the extrapolated catalog might be just what you are requesting?

Francois has not implemented this in GalSim yet, but he has working code for it that he has been playing with, so in principle this could end up in GalSim before long. If you post on there requesting it, that might help. :) Unless you think it's not what you want, after all.

@esheldon
Copy link
Contributor Author

On 10/16/16, Rachel Mandelbaum [email protected] wrote:

Erin:

I would like a more curated catalog , one that shows fewer obvious signs
of
selection effects and bad fits. (Maybe this isn't what most people
want).

Thanks very much for this explanation. I think I understand what you are
getting at much better now, and in fact it's something that comes up for HSC
as well. Our data are deep enough that this flux limit in the catalog means
some galaxies are not represented, so we have to either (a) extrapolate the
catalog to represent all galaxies past our flux limit or (b) do science with
a brighter subset of galaxies that are well-represented by this catalog (and
that can be obtained by simulating with this catalog, then applying cuts).
Ultimately, I think the answer is that COSMOS is not a sufficient parent
catalog to use on its own for simulations for DES or HSC; we need a deeper
HST dataset for that purpose, or we have to do some kind of clever tricks on
COSMOS.

So, I do see a use case for what you are suggesting. I don't have a great
answer for how to do it with the existing catalog. If you have suggestions
for how to achieve this, I would be happy to help make it easier to do what
you want using GalSim (e.g., a setting in COSMOSCatalog() constructor or
whatever). I should note that in #808, the idea is to define the joint
distribution of Sersic parameters using the part of the catalog that isn't
too close to the flux limit, so as to avoid the realm where selection
effects are important. Then we use those joint distributions for the
extrapolation, so I think that the extrapolated catalog might be just what
you are requesting?

Yes, that might be what I need.

Looking at the joint distribution of the fit sersic parameters, it is not
clear to me how exactly to extrapolate. I don't know how much I can trust the
fits, but even taken at face value, there are multiple populations, and I'm
not sure where each of these populations is complete. Surface brightness may
be an important thing to look at.

Erin Scott Sheldon
Brookhaven National Laboratory erin dot sheldon at gmail dot com

@rmandelb
Copy link
Member

Yes, surface brightness is definitely relevant here. Let me send you some more info offline.

@rmjarvis
Copy link
Member

Is this conversation resolved to everyone's satisfaction? Just wondering if the issue can be closed or if there is some action item still to be done here?

@esheldon esheldon reopened this Nov 12, 2016
@esheldon
Copy link
Contributor Author

I think sims using this catalog will be central going forward. It would be good to converge on a set of reliable cuts.

@esheldon
Copy link
Contributor Author

I think we can see how things go with #808

@rmjarvis rmjarvis modified the milestone: Will not fix Feb 16, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants