Bilkent University
Department of Computer Engineering


Active Learning by Statistical Leverage Scores


Cem Orhan
MS Student
Computer Engineering Department
Bilkent University

Label scarcity is a serious problem in many machine-learning tasks. Active learning framework addresses this challenge by effectively selecting which examples to label. In the pool-based active learning framework for classification, active learner is provided with a large set of unlabeled examples augmented with few labeled instances. Active learner aims to obtain a classifier of high accuracy by using lesser amount of label requests in comparison to passive learning through effective queries. Many different querying strategies have been developed for the pool-based active learning setting in the past two decades, in which the examples are selected based on their informativeness or representativeness. We present a novel querying method based on statistical leverage scores computed on the kernel matrix of the examples. The statistical leverage score of a row in a matrix are the squared row-norms of the top k-dimensional eigenspace as defined in [2] and it can be used as a measure of influence of the row on the matrix. Leverage scores have been used for detecting highly influential points in regression diagnostics [1] and have been recently shown to be useful for randomized low-rank matrix approximation algorithms [2,3]. In our querying strategy, ALEVS, labels are requested based on their leverage scores iteratively. Our experiments on several binary classification benchmark datasets demonstrate that ALEVS is an effective querying strategy.

[1] S. Chatterjee and A. S. Hadi. Influential observations, high leverage points, and outliers in linear regression. Statist. Sci., 1(3):379–393, 08 1986.

[2] A. Gittens and M. Mahoney. Revisiting the Nyström method for improved largescale machine learning. In Proceedings of the 30th International Conference on Machine Learning, 2013.

[3] Mahoney, Michael W., and Petros Drineas. “CUR matrix decompositions for improved data analysis.” Proceedings of the National Academy of Sciences 106, no. 3 (2009): 697-702.


DATE: 28 March, 2016, Monday @ 16:50