%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
colt92.ps.Z:
Learning with a Slowly Changing Distribution
Peter L. Bartlett
In this paper, we consider the problem of learning a subset of a domain
from randomly chosen examples when the probability distribution of the
examples changes slowly but continually throughout the learning
process. We give upper and lower bounds on the best achievable
probability of misclassification after a given number of examples. If
$d$ is the VC-dimension of the target function class, $t$ is the number
of examples, and $\gamma$ is the amount by which the distribution is
allowed to change (measured by the largest change in the probability of
a subset of the domain), the upper bound decreases as $d/t$ initially,
and settles to $O(d^{2/3}\gamma^{1/3})$ for large $t$. The general
lower bound on the probability of misclassification again decreases as
$d/t$ initially, but settles to $\Omega(d^{1/2}\gamma^{1/2})$ for large
$t$. These bounds give necessary and sufficient conditions on
$\gamma$, the rate of change of the distribution of examples, to ensure
that some learning algorithm can produce an acceptably small
probability of misclassification. We also consider the case of
learning a near-optimal subset of the domain when the examples and
their labels are generated by a joint probability distribution on the
example and label spaces. We give an upper bound on $\gamma$ that
ensures learning is possible from a finite number of examples.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
colt93.ps.Z:
Lower Bounds on the Vapnik-Chervonenkis Dimension of Multi-layer
Threshold Networks
Peter L. Bartlett
We consider the problem of learning in multi-layer feed-forward
networks of linear threshold units. We show that the Vapnik-
Chervonenkis dimension of the class of functions that can be
computed by a two-layer threshold network with real inputs is
at least proportional to the number of weights in the network.
This result also holds for a large class of two-layer networks
with binary inputs, and a large class of three-layer networks
with real inputs. In Valiant's probably approximately correct
learning framework, this implies that the number of examples
necessary for learning in these networks is at least linear in
the number of weights. This bound is within a log factor of
the upper bound.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
TR93a.ps.Z:
The VC-Dimension and Pseudodimension of Two-Layer Neural
Networks with Discrete Inputs
Peter L. Bartlett and Robert C. Williamson
We give upper bounds on the Vapnik-Chervonenkis dimension and
pseudodimension of two-layer neural networks that use the
standard sigmoid function or radial basis function and have
inputs from {-D,...,D}. In Valiant's probably approximately
correct (pac) learning framework for pattern classification,
and in Haussler's generalization of this framework to nonlinear
regression, the results imply that the number of training
examples necessary for satisfactory learning performance grows
no more rapidly than W log(WD), where W is the number of
weights. The previous best bound for these networks was O(W^4).
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
TR93b.ps.Z (full paper):
colt94b.ps.Z (extended abstract):
Fat-shattering and the learnability of real-valued functions
Peter L. Bartlett, Philip M. Long, and Robert C. Williamson
We consider the problem of learning real-valued functions from
random examples when the function values are corrupted with
noise. With mild conditions on independent observation noise,
we provide characterizations of the learnability of a real-valued
function class in terms of a generalization of the Vapnik-
Chervonenkis dimension, the fat-shattering function, introduced
by Kearns and Schapire. We show that, given some restrictions
on the noise, a function class is learnable in our model if and
only if its fat-shattering function is finite. With different
(also quite mild) restrictions, satisfied for example by gaussian
noise, we show that a function class is learnable from
polynomially many examples if and only if its fat-shattering
function grows polynomially. We prove analogous results in an
agnostic setting, where there is no assumption of an underlying
function class.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
TR95b.ps.Z (full paper):
colt94a.ps.Z (extended abstract):
Exploiting Random Walks for Learning
Peter L. Bartlett, Paul Fischer, and Klaus-Uwe Hoeffgen
ACM Conference on Computational Learning Theory, 1994.
In this paper we consider an approach to passive learning.
In contrast to the classical PAC model we do not assume that
the examples are independently drawn according to an underlying
distribution, but that they are generated by a time-driven process.
We define deterministic and probabilistic learning models of this
sort and investigate the relationships between them and with other
models. The fact that successive examples are related can often
be used to gain additional information similar to the information
gained by membership queries. We show that this can be used to
design on-line prediction algorithms. In particular, we present
efficient algorithms for exactly identifying Boolean threshold
functions, 2-term RSE, and 2-term-DNF, when the examples are
generated by a random walk on $\{0,1\}^n$.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
colt94c.ps.Z:
Lower bounds on the VC-dimension of smoothly parametrized
function classes
Wee Sun Lee, Peter L. Bartlett, and Robert C. Williamson
We examine the relationship between the VC-dimension and the
number of parameters of a smoothly parametrized function class.
We show that the VC-dimension of such a function class is at
least $k$ if there exists a $k$-dimensional differentiable
manifold in the parameter space such that each member of the
manifold corresponds to a different decision boundary. Using
this result, we are able to obtain lower bounds on the
VC-dimension proportional to the number of parameters for
several function classes including two-layer neural networks
with certain smooth activation functions and radial basis
functions with a gaussian basis. These lower bounds hold even
if the magnitudes of the parameters are restricted to be
arbitrarily small. In Valiant's probably approximately correct
learning framework, this implies that the number of examples
necessary for learning these function classes is at least
linear in the number of parameters.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
TR94a.ps.Z:
Valid Generalisation from Approximate Interpolation
Martin Anthony, Peter Bartlett, Yuval Ishai, and John Shawe-Taylor
Let H and C be sets of functions from domain X to R. We say that
H validly generalises C from approximate interpolation if and only
if for each eta>0 and epsilon, delta in (0,1) there is an
m_0(eta, epsilon, delta) such that for any function t in C and any
probability distribution P on X, if m >= m_0 then with
P^m-probability at least 1-delta, a sample vx =(x_1, x_2,..., x_m)
in X^m satisfies
forall h in H, if |h(x_i) - t(x_i)| < eta for i=1,...,m, then
P{x: |h(x) -t(x)| >= eta} < epsilon.
We find conditions that are necessary and sufficient
for H to validly generalise C from approximate interpolation, and
we obtain bounds on the sample length m_0(eta, epsilon, delta)$
in terms of various parameters describing the expressive power of
H.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
TR94b.ps.Z:
Efficient Agnostic Learning of Neural Networks with Bounded Fan-in
Wee Sun Lee, Peter Bartlett, and Robert C. Williamson
We show that the class of two layer neural networks with bounded
fan-in is efficiently learnable in a realistic extension to the
Probably Approximately Correct (PAC) learning model. In this model,
a joint probability distribution is assumed to exist on the
observations and the learner is required to approximate the neural
network which minimizes the expected quadratic error. As special
cases, the model allows learning real-valued functions with bounded
noise, learning probabilistic concepts and learning the best
approximation to a target function that cannot be well approximated
by the neural network. The networks we consider have real-valued
inputs and outputs, an unlimited number of threshold hidden units
with bounded fan-in, and a bound on the sum of the absolute values
of the output weights. The number of computation steps of the
learning algorithm is bounded by a polynomial in $1/\epsilon$,
$1/\delta$, $n$ and $B$ where $\epsilon$ is the desired accuracy,
$\delta$ is the probability that the algorithm fails, $n$ is the
input dimension and $B$ is the bound on both the absolute value of
the target (which may be a random variable) and the sum of the
absolute values of the output weights. In obtaining the result, we
also extended some results on iterative approximation of functions
in the closure of the convex hull of a function class and on the
sample complexity of agnostic learning with the quadratic loss
function.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
TR94c.ps.Z (full paper):
eurocolt95.ps.Z (extended abstract):
Function Learning from Interpolation
Martin Anthony, Peter Bartlett
In this paper, we study a statistical property of classes of
real-valued functions that we call approximation from
interpolated examples. We derive a characterization of function
classes that have this property, in terms of their
`fat-shattering function', a notion that has proven useful in
computational learning theory. We discuss the implications for
function learning of approximation from interpolated examples.
Specifically, it can be interpreted as a problem of learning
real-valued functions from random examples in which we require
satisfactory performance from every algorithm that returns a
function which approximately interpolates the training examples.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
TR94d.ps.Z:
Sample Complexity versus Approximation Error
Peter L. Bartlett, Robert C. Williamson
We consider the problem of learning an unknown real-valued
function from a sequence of values of the function at randomly
chosen points when the estimates are constrained to some class
of functions called the hypothesis class. Ideally, the
hypothesis class should be able to approximate a wide variety
of target functions, yet not need an excessive number of
examples to ensure that a learning algorithm can choose a
near-optimal approximation to the target function. We show
that these two objectives are incompatible, in the sense that
as the approximation error of the hypothesis class decreases,
a lower bound on the sample complexity must increase. The
approximation error is the largest distance between an element
of the class of candidate target functions and its best
approximation in the hypothesis class. The sample complexity
is the number of examples necessary for a near-optimal estimate
of any target function. The relationship between approximation
error and sample complexity depends on a combinatorial
dimension of the class of candidate target functions which are
to be approximated. We give an example of this relationship
for the case of Lipschitz spaces.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
TR95a.ps.Z (full paper):
colt95a.ps.Z (extended abstract):
More Theorems about Scale-sensitive Dimensions and Learning
Peter L. Bartlett, Philip M. Long
We present a general-purpose algorithm for learning classes of
$[0,1]$-valued functions in a generalization of the prediction
model, and prove a general upper bound on the expected absolute
error of this algorithm in terms of a scale-sensitive
generalization of the Vapnik dimension proposed by Alon, Ben-David,
Cesa-Bianchi and Haussler. We give lower bounds implying that our
upper bounds cannot be improved by more than a constant factor in
general. We apply this result, together with techniques due to
Haussler, to obtain new upper bounds on packing numbers in terms
of this scale-sensitive notion of dimension. Using a different
technique, we obtain new bounds on packing numbers in terms of
Kearns and Schapire's fat-shattering function. We show how to
apply both packing bounds to obtain improved general bounds on
the sample complexity of agnostic learning. In our analysis, we
additionally pay attention to the question of at what scale a
scale-sensitive dimension needs to be finite in order for a class
to be agnostically learnable to a particular relative accuracy.
Toward this end, we analyze the usual ``minimize empirical loss''
family of algorithms, but our best bounds on the scale at which
the finiteness of the notions of dimension is sufficient for
agnostic learning are proved using a new general-purpose
algorithm. We strengthen the best known necessary conditions for
agnostic learning to a particular relative accuracy. We also
present weaker sufficient and stronger necessary conditions in
terms of scale-sensitive dimensions for a class of $[0,1]$-valued
functions to be an $\epsilon$-uniform Glivenko-Cantelli class,
and to be agnostically learnable to relative accuracy $\epsilon$
by an algorithm using only hypotheses from the class.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
colt95b.ps.Z:
On Efficient Agnostic Learning of Linear Combinations of Basis Functions
Wee Sun Lee, Peter L. Bartlett, Robert C. Williamson
We consider efficient agnostic learning of linear combinations of
basis functions when the sum of absolute values of the weights of
the linear combinations is bounded. With the quadratic loss function,
we show that the class of linear combinations of a set of basis
functions is efficiently agnostically learnable if and only if the
class of basis functions is efficiently agnostically learnable.
We also show that the sample complexity for learning the linear
combinations grows polynomially if and only if a combinatorial
property of the class of basis functions, called the fat-shattering
function, grows at most polynomially. We also relate the problem
to agnostic learning of $\{0,1\}$-valued function classes by showing
that if a class of $\{0,1\}$-valued functions is efficiently
agnostically learnable (using the same function class) with the
discrete loss function, then the class of linear combinations of
functions from the class is efficiently agnostically learnable with
the quadratic loss function.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
TR95c.ps.Z:
The Importance of Convexity in Learning with Squared Loss
Wee Sun Lee, Peter L. Bartlett, Robert C. Williamson
We show that if the closure of a function class $F$ under the metric
induced by some probability distribution is not convex, then the sample
complexity for agnostically learning $F$ with squared loss (using only
hypotheses in $F$) is $\Omega(\ln(1/\delta)/\epsilon^2)$ where
$1-\delta$ is the probability of success and $\epsilon$ is the
required accuracy. In comparison, if the class $F$ is convex and has
finite pseudo-dimension, then the sample complexity is
O(1/epsilon(log 1/epsilon + log 1/delta)). If a non-convex class $F$
has finite pseudo-dimension, then the sample complexity for
agnostically learning the closure of the convex hull of $F$, is
$O(1/epsilon(1/epsilon log 1/epsilon + log 1/delta))$. Hence, for
agnostic learning, learning the convex hull provides better
approximation capabilities with little sample complexity penalty.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
TR95d.ps.Z:
Correction to ``Lower Bounds on the VC-Dimension of Smoothly
Parametrized Function Classes''
Wee Sun Lee, Peter L. Bartlett, Robert C. Williamson
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
TR95e.ps.Z:
Covering numbers for real-valued function classes
Peter L. Bartlett, Sanjeev R. Kulkarni, S. Eli Posner
We find tight upper and lower bounds on the growth
rate for the covering numbers of functions of bounded variation
in the L1 metric in terms of all the relevant constants.
We also find upper and lower bounds on covering numbers for
general function classes over the family of L1(dP) metrics,
in terms of a scale-sensitive combinatorial dimension of the
function class.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
TR95f.ps.Z:
Examples of learning curves from a modified VC-formalism
Adam Kowalczyk, Jacek Szymanski,
Peter L. Bartlett, Robert C. Williamson
We examine the issue of evaluation of model specific parameters
in a modified VC-formalism. Two examples are analyzed: the
2-dimensional homogeneous perceptron and the 1-dimensional
higher order neuron. Both models are solved theoretically,
and their learning curves are compared against true learning
curves. It is shown that the formalism has the potential to
generate a variety of learning curves, including ones
displaying ``phase transitions.''
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
TR98a.ps.Z (full paper)
colt96b.ps.Z (extended abstract)
Learning Changing Concepts by Exploiting the Structure of Change
Peter L. Bartlett, Shai Ben-David, Sanjeev R. Kulkarni
This paper examines learning problems in which the target function
is allowed to change. The learner sees a sequence of random examples,
labelled according to a sequence of functions, and must provide an
accurate estimate of the target function sequence. We consider a
variety of restrictions on how the target function is allowed to
change, including infrequent but arbitrary changes, sequences that
correspond to slow walks on a graph whose nodes are functions, and
changes that are small on average, as measured by the probability
of disagreements between consecutive functions. We first study
estimation, in which the learner sees a batch of examples and is
then required to give an accurate estimate of the function sequence.
Our results provide bounds on the sample complexity and allowable
drift rate for these problems. We also study prediction, in which
the learner must produce online a hypothesis after each labelled
example and the average misclassification probability over this
hypothesis sequence should be small. Using a deterministic analysis
in a general metric space setting, we provide a technique for
constructing a successful prediction algorithm, given a successful
estimation algorithm. This leads to sample complexity and drift rate
bounds for the prediction of changing concepts.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
TR96b.ps.Z (full paper):
Structural Risk Minimization over Data-Dependent Hierarchies
colt96a.ps.Z (extended abstract):
A Framework for Structural Risk Minimisation
John Shawe-Taylor, Peter Bartlett, Robert Williamson, Martin Anthony
The paper introduces some generalizations of Vapnik's method of
structural risk minimisation (SRM). As well as making explicit some
of the details on SRM, it provides a result that allows one to trade
off errors on the training sample against improved generalization
performance. It then considers the more general case when the hierarchy
of classes is chosen in response to the data. A result is presented
on the generalization performance of classifiers with a ``large margin''.
This theoretically explains the impressive generalization performance
of the maximal margin hyperplane algorithm of Vapnik and co-workers
(which is the basis for their support vector machines). The paper
concludes with a more general result in terms of ``luckiness''
functions, which provides a quite general way for exploiting
serendipitous simplicity in observed data to obtain better prediction
accuracy from small training sets. Four examples are given of such
functions, including the VC dimension measured on the sample.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
TR96c.ps.Z:
Exponential convergence of a gradient descent algorithm
for a class of recurrent neural networks
Peter Bartlett, Soura Dasgupta
We investigate the convergence properties of a gradient descent
learning algorithm for a class of recurrent neural networks.
The networks compute an affine combination of nonlinear (sigmoidal)
functions of the outputs of biased linear dynamical systems.
The learning algorithm performs approximate gradient descent to
minimize the squared error on a training sequence of input-output data.
We consider the convergence of the parameter estimates produced by
this algorithm when the data sequence is generated by a network in
this class. We assume that the sigmoid is analytic and bounded
everywhere, and that its singularities in the complex plane satisfy
certain conditions. We also assume that the target network is stable.
Under these conditions, we show that, for almost all values of the
true parameters and almost all periodic input sequences with period
at least as large as the number of parameters, the parameter
estimates exhibit asymptotic exponential convergence in a
neighborhood of the true parameters.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
TR96d.ps.Z (full paper):
nips96.ps.Z (extended abstract):
The sample complexity of pattern classification
with neural networks: the size of the weights is more important than
the size of the network
Peter Bartlett
Sample complexity results from computational learning theory,
when applied to neural network learning for pattern classification
problems, suggest that for good generalization performance the
number of training examples should grow at least linearly with
the number of adjustable parameters in the network. Results in
this paper give misclassification probability bounds for neural
networks that are qualitatively different. They show that, for
classification problems to which these networks are well suited,
the magnitude of the weights is more important, and the number
of parameters may be irrelevant.
Typically, the parameters in a network are adjusted during learning
to attempt to minimize the total squared error on the training
examples. If the output of the network takes values in the range
$[-1,1]$, and the classification of a point corresponds to the sign
of the output, then small squared error on the training examples
implies not just that most examples are correctly classified, but
also that the value of the output of the network is close to $-1$ or
$1$. Using this quantity (the proportion of ``distinctly correct''
training examples) one of the two main results of the paper gives
general bounds on misclassification probability. These bounds improve
the standard VC-theory bounds, which use only the proportion of
correct training examples to estimate misclassification probability.
Rather than depending on the VC-dimension of the function class,
the bounds depend on a smaller quantity, a scale-sensitive version
of the VC-dimension. For neural networks, the resulting bound can
be significantly better, particularly in classification problems for
which there is a network with small squared error.
The second main result of the paper gives bounds on this
scale-sensitive dimension for neural networks. Combining these results
gives bounds on the generalization performance of a number of neural
network classes, including multi-layer sigmoid networks of arbitrary
size in which the total of the magnitudes of weights associated with
each unit is bounded. In this case, the number of training examples
needs to grow only as $A^{2\ell}$ to avoid overfitting (ignoring log
factors), where $A$ is the bound on the total weight magnitude for a
unit and $\ell$ is the depth of the network. Unless $A$ is large, the
number of parameters is irrelevant.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
TR97a.ps.Z:
cdc96.ps.Z:
The Complexity of Model Classes and Smoothing Noisy Data
Peter L. Bartlett, Sanjeev R. Kulkarni
We consider the problem of smoothing a sequence of
noisy observations using a fixed class of models.
Via a deterministic analysis, we obtain necessary and
sufficient conditions on the noise sequence and model class
that ensure that a class of natural estimators gives near optimal
smoothing. In the case of i.i.d. random noise,
we show that the accuracy of these estimators depends on a measure of
complexity of the model class involving covering numbers.
Our formulation and results are quite general and are related to
a number of problems in learning, prediction, and estimation.
As a special case, we consider an application to output
smoothing for certain classes of linear and nonlinear systems.
The performance of output smoothing is given in terms
of natural complexity parameters of the model class, such
as bounds on the order of linear systems, the $l_1$-norm
of the impulse response of stable linear systems, or
the memory of a Lipschitz nonlinear system satisfying
a fading memory condition.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
TR96a.ps.Z:
eurocolt97a.ps.Z:
A minimax lower bound for empirical quantizer design
Peter Bartlett, Tamas Linder, Gabor Lugosi
We obtain a minimax lower bound for the expected distortion
of empirically designed vector quantizers. We show that
the mean squared distortion of any empirically designed
vector quantizer is at least $\Omega\left(n^{-1/2}\right)$
away from the optimal distortion for some distribution
on a bounded subset of $\Rd$, where $n$ is the number
of i.i.d.\ data points that is used to train the
empirical quantizer.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
eurocolt97b.ps.Z:
A result relating convex n-widths to covering numbers with some
applications to neural networks
Jonathan Baxter, Peter Bartlett
In general, approximating classes of functions defined over
high-dimensional input spaces by linear combinations of a fixed
set of basis functions or ``features'' is known to be hard.
Typically, the worst-case error of the best basis set decays
only as fast as $n^{-1/d}$, where n is the number of basis
functions and d is the input dimension. However, there are many
examples of high-dimensional pattern recognition problems (such
as face recognition) where linear combinations of small sets of
features do solve the problem well. Hence these function classes
do not suffer from the `curse of dimensionality' associated with
more general classes. It is natural then, to look for
characterizations of high-dimensional function classes that
nevertheless are approximated well by linear combinations of
small sets of features. In this paper we give a general result
relating the error of approximation of a function class to the
covering number of its `convex core.' For one-hidden-layer neural
networks, covering numbers of the class of functions computed
by a single hidden node upper bound the covering numbers of the
convex core. Hence, using standard results we obtain upper bounds
on the approximation rate of neural network classes.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
nips97a.ps.Z:
Generalization in decision trees and DNF:
Does size matter?
Mostefa Golea, Peter L. Bartlett, Wee Sun Lee, and Llew Mason
Recent theoretical results for pattern classification with thresholded
real-valued functions (such as support vector machines, sigmoid
networks, and boosting) give bounds on misclassification probability
that do not depend on the size of the classifier, and hence can be
considerably smaller than the bounds that follow from the VC theory. In
this paper, we show that these techniques can be more widely applied, by
representing other boolean functions as two-layer neural networks
(thresholded convex combinations of boolean functions). For example, we
show that with high probability any decision tree of depth no more than
$d$ that is consistent with $m$ training examples has misclassification
probability no more than $O\left(\left(\frac{1}{m}\left(\Neff \,
\VCdim(\U)\, \log^2 m \log d\right)\right)^{1/2}\right)$, where
$\U$ is the class of node decision functions, and $\Neff\le N$ can be
thought of as the effective number of leaves (it becomes small as the
distribution on the leaves induced by the training data gets far from
uniform). This bound is qualitatively different from the VC bound
and can be considerably smaller.
We use the same technique to give similar results for DNF formulae.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
TR97b.ps.Z:
The Minimax Distortion Redundancy in Empirical Quantizer Design
Peter Bartlett, Tamas Linder and Gabor Lugosi
We obtain minimax lower and upper bounds for the expected distortion
redundancy of empirically designed vector quantizers. We show that
the mean squared distortion of a vector quantizer designed from $n$
i.i.d. data points using any design algorithm is at least
$\Omega\left(n^{-1/2}\right)$ away from the optimal distortion for
some distribution on a bounded subset of $R^d$. Together with
existing upper bounds this result shows that the minimax distortion
redundancy for empirical quantizer design, as a function of the size
of the training data, is asymptotically on the order of $n^{1/2}$.
We also derive a new upper bound for the performance of the
empirically optimal quantizer.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
TR97c.ps.Z:
Almost linear VC dimension bounds for piecewise polynomial networks
Peter L. Bartlett, Vitaly Maiorov, and Ron Meir
We compute upper and lower bounds on the VC-dimension and pseudo
dimension of feedforward neural networks composed of piecewise
polynomial activation functions. We show that if the number of
layers is fixed, then the VC-dimension and pseudo-dimension grow
as $W\log W$, where $W$ is the number of parameters in the network.
This result stands in opposition to the case where the the number
of layers is unbounded, in which case the VC-dimension and
pseudo-dimension grow as $W^2$. We combine our results with
recently established approximation error rates, and determine error
bounds for the problem of regression estimation by piecewise
polynomial networks.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
TR97d.ps.Z:
Efficient neural network learning
Peter L. Bartlett
(Note submitted to collection of open problems)
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
TR98b.ps.Z:
Generalization Performance of Support Vector Machines and
Other Pattern Classifiers
Peter L. Bartlett and John Shawe-Taylor
The aim of this chapter is to summarise results that have been
obtained for high confidence generalization error bounds for the
Support Vector Machine (SVM) and other pattern classifiers related
to the SVM. As a by-product of the analysis we argue that the
margin and number of support vectors are both estimators of the
degree to which the distribution generating the inputs assists
identification of the target hyperplane. Along the way, we give
improved an upper bound on the fat-shattering dimension of the
class of bounded linear functions.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
TR98c.ps.Z:
An inequality for uniform deviations of sample averages from
their means
Peter L. Bartlett and Gabor Lugosi
We derive a new inequality for uniform deviations of
averages from their means. The inequality is a common
generalization of previous results of Vapnik and Chervonenkis (1974)
and Pollard (1986).
Using the new inequality we obtain tight bounds
for empirical loss minimization learning.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
TR98e.ps.Z:
Boosting the Margin: A New Explanation for the Effectiveness
of Voting Methods
Robert E. Schapire, Yoav Freund, Peter Bartlett, and Wee Sun Lee
One of the surprising recurring phenomena observed in experiments
with boosting is that the test error of the generated classifier
usually does not increase as its size becomes very large, and
often is observed to decrease even after the training error reaches
zero. In this paper, we show that this phenomenon is related to the
distribution of margins of the training examples with respect to the
generated voting classification rule, where the margin of an example
is simply the difference between the number of correct votes and the
maximum number of votes received by any incorrect label. We show that
techniques used in the analysis of Vapnik's support vector classifiers
and of neural networks with small weights can be applied to voting
methods to relate the margin distribution to the test error. We also
show theoretically and experimentally that boosting is especially
effective at increasing the margins of the training examples. Finally,
we compare our explanation to those based on the bias-variance
decomposition.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
TR98d.ps.Z:
Direct optimization of margins improves generalization in
combined classifiers
Llew Mason, Peter L. Bartlett and Jonathan Baxter
Recent theoretical results have shown that the generalization
performance of thresholded convex combinations of base classifiers is
greatly improved if the underlying convex combination has large
margins on the training data (correct examples are classified well
away from the decision boundary). Neural network algorithms and
AdaBoost have been shown to implicitly maximize margins, thus
providing some theoretical justification for their remarkably good
generalization performance. In this paper we are concerned with
maximizing the margin explicitly. In particular, we prove a theorem
bounding the generalization performance of convex combinations in
terms of general cost functions of the margin (previous results were
stated in terms of the particular cost function $\sgn(\text{margin} -
\theta)$). We then present an algorithm (DOOM) for directly optimizing
a piecewise-linear family of cost functions satisfying the conditions
of the theorem. Experiments on several of the datasets in the UC
Irvine database are presented in which AdaBoost was used to generate
a set of base classifiers and then DOOM was used to find the optimal
convex combination of those classifiers. In all but one case the
convex combination generated by DOOM had lower test error than
AdaBoost's combination. In many cases DOOM achieves these lower test
errors by sacrificing training error, in the interests of reducing the
new cost function. The margin plots also show that the size of the
minimum margin is not relevant to generalization performance.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
eurocolt99.ps.Z:
Hardness results for neural network approximation problems
Peter L. Bartlett and Shai Ben-David
We consider the problem of efficiently learning in two-layer neural
networks. We show that it is NP-hard to find a linear threshold network
of a fixed size that approximately minimizes the proportion of
misclassified examples in a training set, even if there is a network that
correctly classifies all of the training examples.
In particular, for a training set that is correctly classified by
some two-layer linear threshold network with k hidden units, it is
NP-hard to find such a network that makes mistakes on a proportion
smaller than c/k^3 of the examples, for some constant c.
We prove a similar result for the problem of approximately minimizing
the quadratic loss of a two-layer network with a sigmoid output unit.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%