Center for Scientific Workshops in All Disciplines

Current Workshop | Overview | Back | Home | Search | | ||||||||||

## Online Algorithms and Learning |

We study the following balls and bins stochastic game
between a player and an adversary: there are B bins and a sequence of ball
arrival and extraction events. In an arrival event a ball is stored in an empty
bin chosen by the adversary and discarded if no bin is empty. In an extraction
event, an algorithm selects a bin, clears it, and gains its content. We are
interested in analyzing the gain of an algorithm which serves in the dark
without any feedback at all, i.e., does not see the sequence, the content of
the bins, and even the content of the cleared bins (i.e. an oblivious
algorithm). We compare that gain to the gain of an optimal, open eyes, strategy
that gets the same online sequence. We name this gain ratio "the loss of
serving in the dark". The randomized algorithm that was previously analyzed
is choosing a bin independently and uniformly at random, which resulted in a competitive
ratio of about 1.69. We show that although no information is ever provided to
the algorithm, using non-uniform probability distribution reduces the
competitive ratio. Specifically, we design a $1.56$-competitive algorithm and
establish a lower bound of $1.5$. We also prove a lower bound of 2 against any determinstic algorithm. This matches the preformace of the round robin 2-competitive strategy.
Finally, we present an application relating to a prompt mechanism for bounded
capacity auction.
Beyond stochastic gradient descent for large-scale machine learning
Many machine learning and
signal processing problems are traditionally cast as convex optimization
problems. A common difficulty in solving these problems is the size of the
data, where there are many observations ("large n") and each of
these is large ("large p"). In this setting, online
algorithms such as stochastic gradient descent which pass
over the data only once, are usually preferred over batch algorithms, which
require multiple passes over the data. Given n observations/iterations, the
optimal convergence rates of these algorithms are O(1/\sqrt{n})
for general convex functions and reaches O(1/n) for strongly-convex functions.
In this talk, I will show how the smoothness of loss functions may be used to
design novel algorithms with improved behavior, both in theory and practice: in
the ideal infinite-data setting, an efficient novel
Newton-based stochastic approximation algorithm leads to a
convergence rate of O(1/n) without strong convexity assumptions, while in the
practical finite-data setting, an appropriate combination of batch and online
algorithms leads to unexpected behaviors, such as a linear convergence rate for
strongly convex problems, with an iteration cost similar
to stochastic gradient descent. (joint work with Nicolas Le
Roux, Eric Moulines and Mark Schmidt).
Agnostic Active Learning of Linear
Separators
In many practical settings one can sequentially and
adaptively guide the collection of future data, based on information extracted
from data already collected, in what is known as sequential experimental design,
active learning, or adaptive sensing/sampling (depending on the context). The intricate relation between data analysis
and acquisition in adaptive sensing paradigms is extremely powerful, and allows
for reliable estimation in situations where non-adaptive sensing would fail
dramatically. In this talk I consider
estimation and detection of high-dimensional structured sparse signals in
noise, and present near-optimal adaptive sensing procedures that provably outperform
the best possible inference methods based on non-adaptive sensing. The methods
developed can also be used also for adaptive compressive sensing and for
detection of (sparse) correlations.
Joint work with: Claudio Gentile (Varese) and Yishay Mansour (Tel-Aviv).
We
introduce a new technique to this area, using reductions from problems that are
hard on average. We put forward a natural generalization of Feige's
assumption about the complexity of refuting random K-SAT instances. Under this
assumption we show: 1.
Learning DNFs is hard. 2.
Learning an intersection of super logarithmic number of halfspaces
is hard. In
addition, the same assumption implies the hardness of virtually all learning
problems that were previously shown hard under cryptographic assumptions. Joint
work with Nati Linial and Shai Shelev-Shwartz
Joint work with Shipra
Agrawal, Microsoft Research.
Our main result is an $O(\log^{3/2} n)$-competitive
deterministic algorithm for the special case of a line of $n$ nodes, using
buffers of size at least 5 and assuming link capacity at least 5. The best previous
algorithm was $O(\log^5 n)$-competitive. We note that if the algorithm is allowed
to accept packets within $O(\log n)$ steps (not immediately), then the
competitive ratio drops to $O(\log n)$.
A great deal of effort has been devoted to reducing the risk of spurious scientific discoveries, from the use of holdout sets and sophisticated cross-validation techniques, to procedures for controlling the false discovery rate in multiple hypothesis testing. However, there is a fundamental disconnect between the theoretical results and the practice of science: the theory assumes a fixed collection of hypotheses to be tested, or learning algorithms to be applied, selected non-adaptively before the data are gathered, whereas science is by definition an adaptive process, in which data are shared and re-used, and hypotheses and new studies are generated on the basis of data exploration and previous outcomes. Surprisingly, the challenges of adaptivity can be addressed using insights from differential privacy, a field of study supporting a definition of privacy tailored to private data analysis. As a corollary we show how to safely reuse a holdout set a great many times without undermining its validation power, even when hypotheses and computations are chosen adaptively. Armed with this technique, the analyst is free to explore the data ad libitum, generating and evaluating hypotheses, verifying results on the holdout, and backtracking as needed. Joint work with Cynthia Dwork, Vitaly Feldman, Toni Pitassi, Omer Reingold and Aaron Roth
In this talk, we introduce a general scheduling
problem that we term the Packing Scheduling problem (PSP). In this problem, jobs
can have different arrival times and sizes, and the rates at which a scheduler
can process jobs are subject to arbitrary packing constraints. The PSP
framework captures a variety of scheduling problems, including the classical
problems of heterogeneous machines scheduling, broadcast scheduling, and
scheduling jobs of different parallelizability. It also captures scheduling
constraints arising in diverse modern environments ranging from individual
computer architectures to data centers. More concretely, PSP models
multidimensional resource requirements and parallelizability, as well as
network bandwidth requirements found in data center scheduling. We design
non-clairvoyant online algorithms for PSP and its special cases, meaning that
they make scheduling decisions without knowing the future jobs, or even
outstanding jobs sizes until their completion. Our algorithms, which are
inspired by equilibria in game theory, minimize the average completion time
approximately while processing jobs fairly.
joint work with Virginia Smith, Martin Takáč, Jonathan Terhorst,
Sanjay Krishnan, Thomas Hofmann, Michael I. Jordan
We present a new algorithm for the contextual bandit learning problem, where the learner repeatedly takes one of K actions in response to the observed context, and observes the reward only for that chosen action. Our method assumes access to an oracle for solving fully supervised cost-sensitive classication problems and achieves the statistically optimal regret guarantee with only O*(sqrt{KT}) oracle calls across all T rounds. By doing so, we obtain the most practical contextual bandit learning algorithm amongst approaches that work for general policy classes.
Non-clairvoyant Scheduling to Minimize Weighted
Flow-time. In
this talk, I will describe the recent progress made on the problem of
minimizing weighted flow-time in the online non-clairvoyant scheduling model,
arguably the most general model of machine scheduling. In this problem, a
set of jobs $J$ arrive over time to be scheduled on a set of $M$ machines. Each
job $j$ has processing length $p_j$, weight $w_j$, and is processed at a rate of $\ell_{ij}$ when scheduled on machine $i$. The online scheduler
knows the values of $w_j$ and $\ell_{ij}$ upon arrival of the job, but is not aware of the
quantity $p_j$. We present the first online
algorithm that is scalable ($(1+\eps)$-speed $O(\frac{1}{\epsilon^2})$-competitive
for any constant $\eps > 0$) for the total weighted flow-time objective. No
non-trivial results were known for this setting, except for the most basic case
of identical machines. This result resolves a major open problem in online
scheduling theory. Moreover, we also show that no job needs more than a
logarithmic number of migrations. Next, I will sketch how the framework can be
easily extended to the objective of minimizing total weighted flow-time plus
energy cost for the case of unrelated machines. The
key algorithmic idea is to let jobs migrate selfishly until they converge to an
equilibrium. Towards this end, we define a game where each job's utility which
is closely tied to the instantaneous increase in the objective the job is
responsible for, and each machine declares a policy that assigns priorities to
jobs based on when they migrate to it, and the execution speeds. This has a
spirit similar to coordination mechanisms that attempt to achieve near optimum
welfare in the presence of selfish agents (jobs). The
work was done in collaboration with Kamesh Munagala, Kirk Purhs and Sungjin Im and appeared in FOCS
2014.
Specifically, I will discuss recent results about
simultaneous or robust adversarial and stochastic allocation, and
time-permitting describe a recent result about online submodular
welfare maximization on a random order.
constraint matrix are presented in random order. This
problem has received much attention: the main open question is to figure out
how large the right-hand sides of the LPs have to be (compared to the entries
on the left-hand side of the constraint) to get (1 + eps)-approximations
online? It is known that the RHS has to be Omega(eps^(-2) log m) times the
left-hand sides, where m is the number of constraints. We give a primal-dual algorithm to achieve this bound
for all packing LPs, and also for a class of mixed packing/covering LPs. Our
algorithms construct dual solutions using a regret-minimizing online learning
algorithm in a black-box fashion, and use them to construct primal solutions.
The adversarial guarantee that holds for the constructed duals help us to take care
of most of the correlations that arise in the algorithm; the remaining
correlations are handled via martingale concentration and maximal inequalities.
These ideas lead to conceptually simple and modular algorithms, which we hope
will be useful in other contexts. This is joint work with Anupam
Gupta.
In this work, we show how to significantly generalize
and simplify this result using techniques from online learning. We give a
simple, decentralized tatonnement process (which can
be used to adjust tolls independently at each edge, without the need for
communication) that quickly causes the induced flow to converge to the target.
Moreover, our method requires almost no assumptions on the form of the latency
functions: they do not need to be polynomials, do not need to be convex (nor
indeed have any concise representation), or even be differentiable (or Lipschitz!) -- we require only that they be increasing
functions. Our solution frames the problem as solving for an equilibrium in a
game that is strategically equivalent to a zero sum game, which allows us to
leverage the powerful properties of no-regret algorithms to converge to the minmax solutions of such games, without the need to know
any details about the payoff function.
We
consider online algorithms for learning to rank objects (such as documents in
information retrieval or products in recommender systems). In a setting with
covariates/features, we derive a perceptron-like algorithm that updates only on
rounds where a non-zero loss in incurred. Under a margin condition, we show
that the cumulative ranking loss of the perceptron-like algorithm is finite.
Our result holds for ranking losses derived from popular ranking measures such
as NDCG (Normalized Discounted Cumulative Gain) and MAP (Mean Average
Precision). We also explore a partial monitoring setting, without
covariates/features, where the learner is only given feedback on the top item
in the rankings it produces. We give regret guarantees for unnormalized
versions of NDCG, AUC, and MAP. Surprisingly, we can also show that the
normalized versions do not admit no-regret algorithms under the top-1 feedback
model.
In PwEA, recent work has
focused on second-order bounds, which depend on the variance of some quantity
on the data. To exploit small variance and get fast rates, it has been found
that *very large* learning rates are necessary. In OCO, research has focused on the degree of strong
convexity of the objective function. To exploit more convexity and get fast
rates, it has been found that *very small* learning rates are necessary. Thus, there is an apparent contradiction between the
learning rates that we need for fast rates in PwEA
and in OCO. But there cannot be a real contradiction, because PwEA is a special case of OCO (with a linear objective
function)! So it must be possible to reconcile both worlds and obtain a unified
understanding that takes both the variance and the convexity into account. I would like to raise the
question of how to proceed towards such a unified understanding. [Back] |