November 18

Design Discussion

What should the recommender return?

We may want to label one thing or lots of things
- These are different scenarios
We should be able to request two kinds of lists of (ID, score) tuples
- In one scenario, the ID represents a single object we want to label
- In the other scenario, the ID represents a number of objects in some kind of box

How Chris Does Things

FITS table of query set, FITS table of label set
Former has features, latter has features and labels
These tables have information irrelevant for the predictions, but we need to have that information there anyway
Information that's returned should be able to go back into the database
The classifier Chris used was a wrapper code for a high level language around a C-based Bayesian classifier that ate a file and output predictions back to the query FITS table
The C code loaded the whole set of feature columns into memory, but the query table is read row-by-row
- The query table is potentially unlimited in size, but the assumption is made that the model table is small (~200000 rows) and can thus be buffered in memory
- The query rows are isolated
Chris feels very strongly that the user shouldn't have to modify their data files themselves
Config files are a very good idea (or config files with command line override)
Chris wants to change the feature set and the training set very often
- Feature set = choosing different combinations of features
- Training set = different FITS file, different subset of instances, different set of instances/labels
Example of the above:
- Obtaining labels in the interesting parameter range is really expensive
- We don't have lots of representative labels available at the frontier, but that's where we're doing our research
- ∴ There's no perfect training set
- ∴ We often build predictive models from an abstract model (simulated set of feature/label combinations calculated from a set of equations which you hope describe the universe but is empirically untested)
- These are empirically motivated but not empirically observed
- e.g. a model for how a galaxy would evolve through time assuming different metallicities
- These abstract models may differ because they are unobserved
- We may be able to constrain these models using predictors trained on them and tested against the evidence
- "Template model" = simulated model. "Empirical model" = training set.

Uncertainties

Not only do we need features, we need errors
They affect the result
All existing astronomical predictors break if you don't have error columns
These errors are usually axis-parallel (and we tend to ignore covariances)

Unlabelled Instances

A lot of people just want a best estimate, with no subtleties
- Chris is not one of these people
We can have training samples which contain some objects and for 90% of them we might know what they are, but for the other 10% there's something about them that made it impossible to label them.
- This is important to know! It means that (and these 10% may not be evenly distributed over feature space) there is not only an ambiguity between classes that we can quantify but
- When you have an empirical training set that explicitly contains (with correct proportions) the objects that ended up unlabelled (but should have been, but we couldn't label them!) then we would like to get an answer of the kind that includes an estimate for an unknown class
Cheng: Can I distinguish between an object that I don't know is a star, galaxy, or quasar, vs an object that is not one of them?
- Chris: Practically, it's more likely to be one of the existing objects.
- There are two types of incompleteness — you haven't pointed the telescope at them (which is a random reason, no bias), or we have observed them as best we can but somehow we can't distinguish them.

Command-line Interface

Default should generate a plot (or data for such a plot): the active learning curve
Should have subcommands for each component so that they can be run
- Manually
- As part of a shell script
- As a subprocess in another language
Should have a subcommand for just the non-labeller part of the program (because getting a label is slow)
- These subcommands could be super complicated — e.g. do we want to output the predictor itself? The predictions? Something else? This is unclear, so let's leave it for now.

Generate active learning curve

$ acton --data vla.fits \
	-f ci_1_2 -f ci_2_3 -f z \
	-l has_property \
	--epochs 100 \
	--diversity 0.7 \
	--recommendation-count 10 \
	--labeller-accuracy 0.85 \
	--predictor logistic-regression \
	--recommender qbc \
	-o active_learning_accuracy.h5

Generate recommendations

$ acton recommend --data vla.fits
	-f ci_1_2 -f ci_2_3 -f z \
	-l has_property \
	--diversity 0.7
	--recommendation-count 300 \
	--predictor logistic-regression \
	--recommender qbc \
	-o recommendations.fits

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

November 18

Design Discussion

What should the recommender return?

How Chris Does Things

Uncertainties

Unlabelled Instances

Command-line Interface

Generate active learning curve

Generate recommendations

Clone this wiki locally