Talk Schedule

First Order Methods for Linear Programming: Theory, Computation, and Applications

Date

Friday, February 25, 2022

Speaker

Haihao Lu
(University of Chicago Booth School of Business)

Time

**Talk Information: ** Linear programming (LP) is a fundamental tool in operations research with wide applications in practice. The state-of-the-art LP solvers are essentially based on either simplex method or barrier method, which are quite mature and reliable at delivering highly accurate solutions. However, it is highly challenging to further scale up these two methods. The computational bottleneck of both methods is the matrix factorization when solving linear equations, preventing them from using the modern computing resources, i.e., distributed computing and/or GPUs, which are the workhorses in the big data era. In contrast, first-order methods (FOMs) only require matrix-vector multiplications, which work very well on these modern computing infrastructures and have massively accelerated the machine learning training process during the last 15 years. In this talk, I'll present new FOMs for solving LP. On the computational side, I'll present extensive numerical studies and compare the proposed FOMs with existing LP solvers on standard benchmark sets and on larger instances. On the theory side, I'll present new techniques that improve the existing complexity of FOMs for LP (and other problems) and show that the proposed algorithms achieve the optimal convergence rate in a wide class of FOMs. If time permitted, I'll characterize the convergence behaviors of our algorithms for solving (primal and/or dual) infeasible problems, which play central roles in LP, but have not been overlooked in the FOM literature. I'll conclude the talk with open questions and new directions on this line of research. Part of this research was done at Google.

** Bio: ** Haihao (Sean) Lu is an assistant professor of Operations Management at the University of Chicago Booth School of Business. His research interests are in extending the computational and mathematical boundaries of methods for solving the large-scale optimization problems that arise in data science, machine learning, and operations research. Before joining Booth, he was a visiting researcher at Google Research large-scale optimization team, where he primarily worked on designing and implementing a huge-scale linear programming solver. He obtained his Ph.D degree in Operations Research and Applied Mathematics at MIT in 2019. He is the winner of INFORMS Optimization Society Young Researcher Prize (2021).

Does Understanding Deep Learning Require Rethinking Linear Regression?

Date

Friday, March 4, 2022

Speaker

Wei Hu
(University of California, Berkeley)

Time

**Talk Information: **The empirical success of deep learning has posed significant challenges to classical learning theory, due to the excess capacity of over-parameterized neural networks and the complex nature of the loss landscapes and learning dynamics. An emerging type of theory proceeds by approximating deep neural networks by linearized models (a.k.a. neural tangent kernels). This talk will motivate and examine the study of over-parameterized linear models, using several real-world settings (such as the empirical neural tangent kernels corresponding to a pre-trained ResNet applied to CIFAR-10) as a testbed. We find that, perhaps surprisingly, even in these linear settings, most existing theoretical analyses fail to qualitatively capture the empirical generalization phenomena. On the other hand, a random matrix theory perspective gives rise to an estimator that accurately predicts the generalization.

** Bio: **Wei Hu is a postdoctoral scholar at the University of California, Berkeley and an incoming assistant professor at the University of Michigan (starting in Fall 2022). He obtained his Ph.D. degree from Princeton University in 2021, and his B.Eng. degree from Tsinghua University in 2016, both in Computer Science. He is broadly interested in the theoretical and scientific foundations of modern machine learning, with a particular focus on deep learning.

Causal inference for socio-economic and engineering systems

Date

Friday, March 11, 2022

Speaker

Anish Agarwal
(Simons Institute at UC Berkeley)

Time

**Talk Information: **What will happen to Y if we do A?A variety of meaningful socio-economic and engineering questions can be formulated this way. To name a few: What will happen to a patient’s health if they are given a new therapy? What will happen to a country’s economy if policy-makers legislate a new tax? What will happen to a company’s revenue if a new discount is introduced? What will happen to a data center’s latency if a new congestion control protocol is used? In this talk, we will explore how to answer such counterfactual questions using observational data---which is increasingly available due to digitization and pervasive sensors---and/or very limited experimental data. The two key challenges in doing so are: (i) counterfactual prediction in the presence of latent confounders; (ii) estimation with modern datasets which are high-dimensional, noisy, and sparse.Towards this goal, the key framework we introduce is connecting causal inference with tensor completion, a very active area of research across a variety of fields. In particular, we show how to represent the various potential outcomes (i.e., counterfactuals) of interest through an order-3 tensor; here, rows index units (e.g., individuals, sub-populations, geographic regions), columns index measurements (e.g. outcomes over time), and matrix slices index interventions (e.g., discounts, health therapies, socio-economic policies). The key theoretical results presented are: (i) Formal identification results establishing under what missingness patterns, latent confounding, and structure on the tensor is recovery of unobserved potential outcomes possible. (ii) Introducing novel estimators to recover these unobserved potential outcomes and proving they are finite-sample consistent and asymptotically normal.The efficacy of the proposed estimators is shown on high-impact real-world applications. These include working with: (i) TaurRx Therapeutics to correct for bias from patient dropouts and proposing novel clinical trial designs to reduce the number of patients recruited for a trial. (ii) Uber Technologies on evaluating the impact of certain driver engagement policies without having to run an A/B test. (iii) U.S. and Indian policy-makers to evaluate the impact of mobility restrictions on COVID-19 mortality outcomes. (iv) The Poverty Action Lab (J-PAL) at MIT to make personalized policy recommendations to improve childhood immunization rates across different villages in Haryana, India.Finally, we discuss connections between causal inference, tensor completion, and offline reinforcement learning.

** Bio: **Anish is currently a postdoctoral fellow at the Simons Institute at UC Berkeley. He did his PhD at MIT in electrical engineering and computer science where he was advised by Alberto Abadie, Munther Dahleh, and Devavrat Shah. His research focuses on designing and analyzing methods for causal machine learning, and applying it to critical problems in social and engineering systems. He currently serves as a technical consultant to TauRx Therapeutics and Uber Technologies on questions related to experiment design and causal inference. Prior to the PhD, he was a management consultant at Boston Consulting Group. He received his BSc and MSc at Caltech.

What Happens after SGD Reaches Zero Loss? --A Mathematical Framework

**Talk Information: **Understanding the implicit bias of Stochastic Gradient Descent (SGD) is one of the key challenges in deep learning, especially for overparametrized models, where the local minimizers of the loss function L can form a manifold. Intuitively, with a sufficiently small learning rate η, SGD tracks Gradient Descent (GD) until it gets close to such manifold, where the gradient noise prevents further convergence. In such a regime, Blanc et al. (2020) proved that SGD with label noise locally decreases a regularizer-like term, the sharpness of loss, tr[∇2L]. In this talk I will present a general framework for such analysis by adapting ideas from Katzenberger (1991). It allows in principle a complete characterization for the regularization effect of SGD around such manifold -- i.e., the "implicit bias" -- using a stochastic differential equation (SDE) describing the limiting dynamics of the parameters, which is determined jointly by the loss function and the noise covariance. This yields some new results: (1) a global analysis of the implicit bias valid for η−2 steps, in contrast to the local analysis of Blanc et al. (2020) that is only valid for η^−1.6 steps and (2) allowing arbitrary noise covariance. As an application, we show with arbitrary large initialization, label noise SGD can always escape the kernel regime and only requires O(κlnd) samples for learning an κ-sparse overparametrized linear model in ℝ^d, while GD initialized in the kernel regime requires Ω(d) samples. This upper bound is minimax optimal and improves the previous Õ (κ2) upper bound (HaoChen et al., 2020).

** Bio: ** Zhiyuan Li is a PhD candidate in the Department of Computer Science at Princeton University, advised by Sanjeev Arora. Previously, he obtained his bachelor’s degree in Computer Science from Tsinghua University. He has also spent time as a research intern at Google Research. His current research goal is to develop a mathematical theory towards a better understanding of modern deep learning, as well as to design more efficient and principled machine learning methods using theoretical insights. He is a recipient of Microsoft Research PhD Fellowship in 2020.

Theoretical Foundations of Multiagent Reinforcement Learning

**Talk Information: **Reinforcement learning (RL) has made substantial empirical progress in solving hard AI challenges in the past few years. A large fraction of these progresses—Go, Dota 2, Starcraft 2, economic simulation, social behavior learning, and so on—come from multi-agent RL, that is, sequential decision making involving more than one agent. While the theoretical study of single-agent RL has a long history and a vastly growing recent interest, Multi-Agent RL (MARL) theory is arguably a newer and less developed field. In this talk, I will present our recent theoretical developments in MARL theory. Starting with basic formulations, we present several provably efficient algorithms for finding equilibria in MARL problems, addressing unique challenges in MARL such as the curse of multiagent, and design of decentralized algorithms.

** Bio: ** Chi Jin is an assistant professor at the Electrical and Computer Engineering department of Princeton University. He obtained his PhD degree in Computer Science at University of California, Berkeley, advised by Michael I. Jordan. His research mainly focuses on theoretical machine learning, with special emphasis on nonconvex optimization and reinforcement learning. My representative work includes proving noisy gradient descent escape saddle points efficiently and proving the efficiency of Q-learning and least-squares value iteration when combined with optimism in reinforcement learning.

Gradient Descent on Two-layer Nets: Margin Maximization and Simplicity Bias

**Talk Information: **The generalization mystery of overparametrized deep nets has motivated efforts to understand how gradient descent (GD) converges to low-loss solutions that generalize well. A recent sequence of results (Lyu and Li, 2020; Chizat and Bach, 2020; Ji and Telgarsky, 2020) provide theoretical evidence that GD may converge to the "max-margin" solution with zero loss, which presumably generalizes well. However, the global optimality of margin is proved only in some settings where neural nets are infinitely or exponentially wide. In this talk, I will present our recent paper that establishes this global optimality for two-layer Leaky ReLU nets trained with gradient flow on linearly separable and symmetric data, regardless of the width. The analysis also gives some theoretical justification for recent empirical findings (Kalimeris et al., 2019) on the so-called simplicity bias of GD towards linear or other "simple" classes of solutions, especially early in training. On the pessimistic side, the paper suggests that such results are fragile. A simple data manipulation can make gradient flow converge to a linear classifier with a suboptimal margin.

** Bio: ** Kaifeng Lyu is a first-year PhD student in the Department of Computer Science at Princeton University, advised by Sanjeev Arora. Previously, he obtained his bachelor’s degree in Computer Science from Tsinghua University. His current research interest lies in building a better theoretical foundation for deep learning that is capable of explaining the various optimization and generalization mysteries in deep learning experiments.

When does SGD work

**Talk Information: **Multi-epoch, small-batch, Stochastic Gradient Descent (SGD) is the method of choice for learning with large over-parameterized non-convex models. However, our understanding of when and why SGD succeeds in learning is still quite rudimentary. Implicit regularization has emerged as a popular paradigm to explain the empirical success of SGD. In this talk, I will discuss why implicit regularization is not a sufficient explanation, and discuss a new framework for understanding when SGD works.

** Bio: ** Ayush is a PhD student in the Computer Science department at Cornell University, advised by Professor Karthik Sridharan and Professor Robert D. Kleinberg. His research interests span across optimization, online learning, reinforcement learning and control, and the interplay between them. Before coming to Cornell, he spent a year at Google as a part of the Brain residency program. Before Google, he completed his undergraduate studies in computer science from IIT Kanpur in India where he was awarded the President's gold medal. Ayush will be starting a postdoc at MIT with Sasha Rakhlin in Fall 2022.

What Can Conformal Inference Offer To Statistics?

**Talk Information: **Valid uncertainty quantification is crucial for high-stakes decision-making. Conformal inference provides a powerful framework that can wrap around any black-box prediction algorithm, like random forests or deep neural networks, and generate prediction intervals with distribution-free coverage guarantees. In this talk, I will describe how conformal inference can be adapted to handle more complicated inferential tasks in statistics, with the focus on two important statistical problems: counterfactual inference and time-to-event analysis. Unlike standard prediction problems, the predictive targets are only partially observable owing to selection and censoring. In addition, I will briefly talk about generalizing conformal inference to other statistical problems, including election, outlier detection, and risk-calibrated predictions.

** Bio: ** Lihua Lei obtained his Ph.D. in statistics at UC Berkeley and is currently a postdoctoral researcher in Statistics at Stanford University. He will be joining the Stanford Graduate School of Business as an Assistant Professor of Economics in Fall 2022. His research interests include conformal inference, causal inference, multiple hypothesis testing, network clustering, stochastic optimization, and econometrics.

GANs vs. Maximum Likelihood: A Theoretical Comparison

Date

Friday, April 22, 2022

Speaker

Farzan Farnia
(The Chinese University of Hong Kong)

Time

**Talk Information: **Generative adversarial networks (GANs) represent a zero-sum
game played between a generator and a discriminator to learn the
distribution of observed data. Since their introduction in 2014, GANs
have achieved
impressive results on a wide array of machine learning tasks and
managed to outperform maximum likelihood methods over benchmark image
datasets. In this seminar, we compare and contrast the GAN and maximum
likelihood frameworks, and provide evidence that the game-based design
of GANs could contribute to a superior generalization performance from
training examples to unseen test data in comparison to deep maximum
likelihood methods. Furthermore, we demonstrate that the common
divergence/distance measures targeted by GANs are more suitable for
learning multi-modal distributions than the KL-divergence optimized by
maximum likelihood learners. We discuss numerical results supporting
our theoretical comparison of the GAN and maximum likelihood approaches.

** Bio: ** Farzan Farnia is an Assistant Professor of Computer Science and
Engineering at The Chinese University of Hong Kong. Prior to joining
CUHK, he was a postdoctoral research associate at the Laboratory for
Information and Decision Systems, Massachusetts Institute of
Technology, from 2019-2021. He received his master’s and PhD degrees
in electrical engineering from Stanford University and his bachelor’s
degrees in electrical engineering and mathematics from Sharif
University of Technology. At Stanford, he was a graduate research
assistant at the Information Systems Laboratory advised by Professor
David Tse. Farzan's research interests span statistical learning
theory, information theory, and convex optimization. He has been the
recipient of the Stanford Graduate Fellowship (Sequoia Capital
Fellowship) between 2013-2016 and the Numerical Technology Founders
Prize as the second top performer of Stanford’s electrical engineering
PhD qualifying exams in 2014.

Proximal-primal-dual algorithms for nonconvex optimization problems and landscape analysis for narrow neural network

Date

Friday, April 29, 2022

Speaker

Jiawei Zhang
(Massachusetts Institute of Technology)

Time

**Talk Information: **This talk consists of two parts. In the first part, we will introduce proximal-primal-dual framework for solving constrained nonconvex optimization problems.
Consider the minimization of a nonconvex differentiable function over a convex, closed set under some linear equality constraints. It is well-known that the lower bound of the iteration complexity of first-order algorithms to attain an $\epsilon$-solution for nonconvex optimization is $\Omega(1/\epsilon^2)$. In this talk, we discuss how to design first-order algorithms for solving nonconvex problems to achieve this lower bound. A popular approach for solving constrained optimization problems is the augmented Lagrangian method (ALM) and alternating direction method of multiplier (ADMM). However, these methods are not guaranteed to converge theoretically for constrained nonconvex problems because numerical examples show that they can exhibit ``oscillation" and may not converge. In this talk, we introduce a smoothed proximal augmented Lagrangian method and its multi-block version. A distinctive feature of this method is the introduction of a ``smoothed" (i.e., exponentially weighted) sequence of primal iterates, and the inclusion, at each iteration, to the augmented Lagrangian function a quadratic proximal term centered at the current smoothed primal iterate. The resulting proximal augmented Lagrangian function is inexactly minimized at each iteration while the dual multiplier vector is updated using a standard dual ascent. Using a proximal-primal-dual framework and some novel error bounds, we prove that our algorithm can converge with an $\mathcal{O}(1/\epsilon^2)$ iteration complexity for the problems with polyhedral constraints and achieve the same complexity for problems with compact, convex constraints under some common regularity assumptions.
In addition, we show that our algorithmic framework can lead to distributed algorithms, which are well-suited for large-scale problems. We prove the $\mathcal{O}(1/\epsilon^2)$ iteration complexity of these algorithms and apply them to efficiently train machine models with distributed samples and distributed features. Moreover, we extend the framework to solve min-max problems and we show the $\mathcal{O}(1/\epsilon^2)$ iteration complexity for solving a family of nonconvex-concave problems -- minimizing the point-wise maximum of a finite collection of nonconvex functions, which covers a wide range of applications in machine learning.
In the second part, we will talk about landscape analysis for the loss surface of narrow neural network.
First, we prove that as long as the width $m \ge 2n/d$ (where $d$ is the input dimension), its expressivity is strong, i.e., there exists at least one global minimizer with zero training loss.
Second,
we identify a nice local region with no local-min or
saddle points.
Nevertheless, it is not clear whether gradient
descent can stay in this nice region.
Third, we consider a constrained optimization formulation where the feasible region is the nice local region, and prove that every KKT point is a nearly global minimizer.
It is expected that projected gradient methods
converge to KKT points under mild technical conditions,
but we leave the rigorous convergence analysis to future work.
Thorough numerical results show that projected gradient methods
on this constrained formulation significantly
outperform SGD for training narrow neural nets.

** Bio: **Jiawei Zhang is a postdoc associate in Laboratory for Information & Decision Systems At MIT, working with Prof. Asuman Ozdaglar. He attained his Ph.D degree from the Chinese University of Hong Kong (Shenzhen) advised by Professor Zhi-quan (Tom) Luo.
Previously, he attained B.Sc in Mathematics (Hua Loo-Keng talent program) from University of Science and technology of China.
His research interest includes optimization theory and algorithms for nonconvex problems.

Efficient Optimization Methods for Machine Learning

**Talk Information: **In this talk, I will first talk about some classical optimization algorithms/results and my contributions of designing efficient algorithms for convex and nonconvex optimization. Then I will introduce some state-of-the-art algorithms/results for communication-efficient distributed/federated learning.

** Bio: ** Zhize Li is a Research Scientist at Carnegie Mellon University since 2022. He received his PhD degree in Computer Science from the Institute for Interdisciplinary Information Sciences at Tsinghua University in 2019. His research mainly focuses on mathematical optimization, federated learning, and machine learning.

Treatment effect estimation with noisy conditioning variables

**Talk Information: **In this paper, I develop a new identification strategy for treatment effects when one observes noisy measurements of unobserved confounding factors. I use proxy variables to construct a random variable conditional on which treatment variables become exogenous. The key idea is that, under appropriate conditions, there exists a one-to-one mapping between the distribution of unobserved confounding factors and the distribution of proxies. To ensure sufficient variation in the constructed control variable, I use an additional variable, termed excluded variable, which satisfies certain exclusion restrictions and relevance conditions. I establish asymptotic distributional results for flexible parametric and nonparametric estimators of the average structural function. I illustrate empirical relevance and usefulness of my results by estimating causal effects of college selectivity on wages.

** Bio: ** Kenichi Nagasawa is an Assistant Professor of Economics at the University of Warwick. His main research interest lies in theoretical econometrics. He has worked on bootstrap-based inference procedures for estimators with non-standard asymptotics and identification of causal effects in selection models in economics. He obtained a Ph.D. in Economics from the University of Michigan in 2019.

How well do deep neural networks (DNNs) learn causal effect, and how can we certify the DNN-based causal effect estimator using a new stable higher-order influence function (HOIF) estimator?

**Talk Information: **Under causal sufficiency, causal effect estimation is by now a quite well-understood problem in both the infinite-dimensional nonparametric regime and the high-dimensional sparse linear model regime. Yet, as DNNs start to replace the “old” statistical techniques in practice, it is important to study how well we estimate causal effect of interest with DNNs.
In the first part of my talk, I will present a new set of theoretical properties of DNN-based causal effect estimators that improve the results of Farrell, Liang and Misra, ECTA, 2022 in several aspects, based on the metric entropy and a new approximation property of DNNs. In particular, our result does not require sparse DNN architecture yet allow for sparsity in the feature space, both of which are desirable in many applied settings. However, via a carefully crafted numerical experiment, we found a significant gap between theory and practice in such statistical guarantees, not only in our work but also in almost all other works on using DNNs to estimate causal effects.
To certify if the DNN-based causal effect estimators really work in practice, we use an assumption-lean hypothesis testing framework developed in the past in two of our recent papers (Liu et al., Statistical Sciences 2020; Liu et al. 2022+), based on the so-called Higher-Order Influence Functions (HOIFs). However, HOIF-based approach has been viewed as being “impractical” due to (1) the use of infinite-order U-statistics and (2) the numerical instability that we observed in practice. To deal with the first issue, we introduce an early-stopping criterion that only requires the computation of low-order U-statistics. To deal with the second issue, we develop a new stable HOIF that is provably both (statistical) optimal and numerically stable. This new HOIF can also be shown to be “the” HOIF under the fixed-design setting, answering a question raised by Edward Kennedy, Sivaraman Balakrishnan and Larry Wasserman. We also demonstrate our approach in practice by revisiting the experiments from the first part of my talk.
The talk is based on joint works with Siqi Xu and Zhonghua Liu (University of Hong Kong), and Chang Li (senior at SJTU --> PhD in statistics at UVA).

** Bio: ** Lin Liu is an Assistant Professor in the Institute of Natural Sciences, School of Mathematical Sciences and SJTU-Yale Joint Center for Biostatistics and Data Science at SJTU. He obtained his PhD in biostatistics from Harvard University, supervised by Franziska Michor and Jamie Robins. His research focuses on non- and semi-parametric theory, causal inference, machine learning theory, statistical computing and applications in biomedical sciences.

TBD

Date

Friday, June 3, 2022

Speaker

Charlie Yun
(KAIST Kim Jaechul Graduate School of AI)

Time

**Talk Information: **

** Bio: **

TBD

Date

Friday, June 10, 2022

Speaker

Andre Milzarek
(The Chinese University of Hong Kong, Shenzhen)

Time

**Talk Information: **

** Bio: **