oasis
.OASISSampler¶
- class oasis.OASISSampler(alpha, predictions, scores, oracle, proba=False, epsilon=0.001, opt_class=None, prior_strength=None, decaying_prior=True, strata=None, record_inst_hist=False, max_iter=None, identifiers=None, debug=False, **kwargs)¶
Optimal Asymptotic Sequential Importance Sampling (OASIS) for estimation of the weighted F-measure.
Estimates the quantity:
TP / (alpha * (TP + FP) + (1 - alpha) * (TP + FN))
on a finite pool by sampling items according to an adaptive instrumental distribution that minimises asymptotic variance. See reference [Marchant2017] for details.
- Parameters
- alphafloat
Weight for the F-measure. Valid weights are on the interval [0, 1].
alpha == 1
corresponds to precision,alpha == 0
corresponds to recall, andalpha == 0.5
corresponds to the balanced F-measure.- predictionsarray-like, shape=(n_items,n_class)
Predicted labels for the items in the pool. Rows represent items and columns represent different classifiers under evaluation (i.e. more than one classifier may be evaluated in parallel). Valid labels are 0 or 1.
- scoresarray-like, shape=(n_items,n_class)
Scores which quantify the confidence in the classifiers’ predictions. Rows represent items and columns represent different classifiers under evaluation. High scores indicate a high confidence that the true label is 1 (and vice versa for label 0). It is recommended that the scores be scaled to the interval [0,1]. If the scores lie outside [0,1] they will be automatically re-scaled by applying the logisitic function.
- oraclefunction
Function that returns ground truth labels for items in the pool. The function should take an item identifier as input (i.e. its corresponding row index) and return the ground truth label. Valid labels are 0 or 1.
- probaarray-like, dtype=bool, shape=(n_class,), optional, default None
Indicates whether the scores are probabilistic, i.e. on the interval [0, 1] for each classifier under evaluation. If proba is False for a classifier, then the corresponding scores will be re-scaled by applying the logistic function. If None, proba will default to False for all classifiers.
- epsilonfloat, optional, default 1e-3
Epsilon-greedy parameter. Valid values are on the interval [0, 1]. The “asymptotically optimal” distribution is sampled from with probability 1 - epsilon and the passive distribution is sampled from with probability epsilon. The sampling is close to “optimal” for small epsilon.
- prior_strengthfloat, optional, default None
Quantifies the strength of the prior. May be interpreted as the number of pseudo-observations.
- max_iterint, optional, default None
Maximum number of iterations to expect for pre-allocating arrays. Once this limit is reached, sampling can no longer continue. If no value is given, defaults to n_items.
- strataStrata instance, optional, default None
Describes how to stratify the pool. If not given, the stratification will be done automatically based on the scores given. Additional keyword arguments may be passed to control this automatic stratification (see below).
- Other Parameters
- opt_classarray-like, dtype=bool, shape=(n_class,), optional, default None
Indicates which classifiers to use in calculating the optimal distribution (and prior and strata). If opt_class is False for a classifier, then its predictions and scores will not be used in calculating the optimal distribution, however estimates of its performance will still be calculated.
- decaying_priorbool, optional, default True
Whether to make the prior strength decay as 1/n_k, where n_k is the number of items sampled from stratum k at the current iteration. This is a greedy strategy which may yield faster convergence of the estimate.
- record_inst_histbool, optional, default False
Whether to store the instrumental distribution used at each iteration. This requires extra memory, but can be useful for assessing convergence.
- identifiersarray-like, optional, default None
Unique identifiers for the items in the pool. Must match the row order of the “predictions” parameter. If no value is given, defaults to [0, 1, …, n_items].
- debugbool, optional, default False
Whether to print out verbose debugging information.
- **kwargs :
Optional keyword arguments. Includes ‘stratification_method’, ‘stratification_n_strata’, and ‘stratification_n_bins’.
References
- Marchant2017
N. G. Marchant and B. I. P. Rubinstein, In Search of an Entity Resolution OASIS: Optimal Asymptotic Sequential Importance Sampling, arXiv:1703.00617 [cs.LG], Mar 2017.
- Attributes
- estimate_numpy.ndarray
F-measure estimates for each iteration.
- queried_oracle_numpy.ndarray
Records whether the oracle was queried at each iteration (True) or whether a cached label was used (False).
- cached_labels_numpy.ndarray, shape=(n_items,)
Previously sampled ground truth labels for the items in the pool. Items which have not had their labels queried are recorded as NaNs. The order of the items matches the row order for the “predictions” parameter.
- t_int
Iteration index.
- inst_pmf_numpy.ndarray, shape=(n_strata,) or (n_strata, max_iter)
Epsilon-greedy instrumental pmf used for sampling. If
record_inst_hist == False
only the most recent pmf is returned, otherwise returns the entire history of pmfs in a 2D array.
Methods
reset
()Resets the sampler to its initial state
sample
(n_to_sample, **kwargs)Sample a sequence of items from the pool
sample_distinct
(n_to_sample, **kwargs)Sample a sequence of items from the pool until a minimum number of distinct items are queried
- __init__(alpha, predictions, scores, oracle, proba=False, epsilon=0.001, opt_class=None, prior_strength=None, decaying_prior=True, strata=None, record_inst_hist=False, max_iter=None, identifiers=None, debug=False, **kwargs)¶
Initialize self. See help(type(self)) for accurate signature.