Skip to content

November 18

Matthew Alger edited this page Nov 18, 2016 · 1 revision

Design Discussion

What should the recommender return?

  • We may want to label one thing or lots of things
    • These are different scenarios
  • We should be able to request two kinds of lists of (ID, score) tuples
    • In one scenario, the ID represents a single object we want to label
    • In the other scenario, the ID represents a number of objects in some kind of box

How Chris Does Things

  • FITS table of query set, FITS table of label set
  • Former has features, latter has features and labels
  • These tables have information irrelevant for the predictions, but we need to have that information there anyway
  • Information that's returned should be able to go back into the database
  • The classifier Chris used was a wrapper code for a high level language around a C-based Bayesian classifier that ate a file and output predictions back to the query FITS table
  • The C code loaded the whole set of feature columns into memory, but the query table is read row-by-row
    • The query table is potentially unlimited in size, but the assumption is made that the model table is small (~200000 rows) and can thus be buffered in memory
    • The query rows are isolated
  • Chris feels very strongly that the user shouldn't have to modify their data files themselves
  • Config files are a very good idea (or config files with command line override)
  • Chris wants to change the feature set and the training set very often
    • Feature set = choosing different combinations of features
    • Training set = different FITS file, different subset of instances, different set of instances/labels
  • Example of the above:
    • Obtaining labels in the interesting parameter range is really expensive
    • We don't have lots of representative labels available at the frontier, but that's where we're doing our research
    • ∴ There's no perfect training set
    • ∴ We often build predictive models from an abstract model (simulated set of feature/label combinations calculated from a set of equations which you hope describe the universe but is empirically untested)
    • These are empirically motivated but not empirically observed
    • e.g. a model for how a galaxy would evolve through time assuming different metallicities
    • These abstract models may differ because they are unobserved
    • We may be able to constrain these models using predictors trained on them and tested against the evidence
    • "Template model" = simulated model. "Empirical model" = training set.

Uncertainties

  • Not only do we need features, we need errors
  • They affect the result
  • All existing astronomical predictors break if you don't have error columns
  • These errors are usually axis-parallel (and we tend to ignore covariances)

Unlabelled Instances

  • A lot of people just want a best estimate, with no subtleties
    • Chris is not one of these people
  • We can have training samples which contain some objects and for 90% of them we might know what they are, but for the other 10% there's something about them that made it impossible to label them.
    • This is important to know! It means that (and these 10% may not be evenly distributed over feature space) there is not only an ambiguity between classes that we can quantify but
    • When you have an empirical training set that explicitly contains (with correct proportions) the objects that ended up unlabelled (but should have been, but we couldn't label them!) then we would like to get an answer of the kind that includes an estimate for an unknown class
  • Cheng: Can I distinguish between an object that I don't know is a star, galaxy, or quasar, vs an object that is not one of them?
    • Chris: Practically, it's more likely to be one of the existing objects.
    • There are two types of incompleteness — you haven't pointed the telescope at them (which is a random reason, no bias), or we have observed them as best we can but somehow we can't distinguish them.

Command-line Interface

  • Default should generate a plot (or data for such a plot): the active learning curve
  • Should have subcommands for each component so that they can be run
    • Manually
    • As part of a shell script
    • As a subprocess in another language
  • Should have a subcommand for just the non-labeller part of the program (because getting a label is slow)
    • These subcommands could be super complicated — e.g. do we want to output the predictor itself? The predictions? Something else? This is unclear, so let's leave it for now.

Generate active learning curve

$ acton --data vla.fits \
	-f ci_1_2 -f ci_2_3 -f z \
	-l has_property \
	--epochs 100 \
	--diversity 0.7 \
	--recommendation-count 10 \
	--labeller-accuracy 0.85 \
	--predictor logistic-regression \
	--recommender qbc \
	-o active_learning_accuracy.h5

Generate recommendations

$ acton recommend --data vla.fits
	-f ci_1_2 -f ci_2_3 -f z \
	-l has_property \
	--diversity 0.7
	--recommendation-count 300 \
	--predictor logistic-regression \
	--recommender qbc \
	-o recommendations.fits