First Order Methods for Linear Programming: Theory, Computation, and Applications
Talk Information: Linear programming (LP) is a fundamental tool in operations research with wide applications in practice. The state-of-the-art LP solvers are essentially based on either simplex method or barrier method, which are quite mature and reliable at delivering highly accurate solutions. However, it is highly challenging to further scale up these two methods. The computational bottleneck of both methods is the matrix factorization when solving linear equations, preventing them from using the modern computing resources, i.e., distributed computing and/or GPUs, which are the workhorses in the big data era. In contrast, first-order methods (FOMs) only require matrix-vector multiplications, which work very well on these modern computing infrastructures and have massively accelerated the machine learning training process during the last 15 years. In this talk, I'll present new FOMs for solving LP. On the computational side, I'll present extensive numerical studies and compare the proposed FOMs with existing LP solvers on standard benchmark sets and on larger instances. On the theory side, I'll present new techniques that improve the existing complexity of FOMs for LP (and other problems) and show that the proposed algorithms achieve the optimal convergence rate in a wide class of FOMs. If time permitted, I'll characterize the convergence behaviors of our algorithms for solving (primal and/or dual) infeasible problems, which play central roles in LP, but have not been overlooked in the FOM literature. I'll conclude the talk with open questions and new directions on this line of research. Part of this research was done at Google.
Bio: Haihao (Sean) Lu is an assistant professor of Operations Management at the University of Chicago Booth School of Business. His research interests are in extending the computational and mathematical boundaries of methods for solving the large-scale optimization problems that arise in data science, machine learning, and operations research. Before joining Booth, he was a visiting researcher at Google Research large-scale optimization team, where he primarily worked on designing and implementing a huge-scale linear programming solver. He obtained his Ph.D degree in Operations Research and Applied Mathematics at MIT in 2019. He is the winner of INFORMS Optimization Society Young Researcher Prize (2021).
Does Understanding Deep Learning Require Rethinking Linear Regression?
Talk Information: The empirical success of deep learning has posed significant challenges to classical learning theory, due to the excess capacity of over-parameterized neural networks and the complex nature of the loss landscapes and learning dynamics. An emerging type of theory proceeds by approximating deep neural networks by linearized models (a.k.a. neural tangent kernels). This talk will motivate and examine the study of over-parameterized linear models, using several real-world settings (such as the empirical neural tangent kernels corresponding to a pre-trained ResNet applied to CIFAR-10) as a testbed. We find that, perhaps surprisingly, even in these linear settings, most existing theoretical analyses fail to qualitatively capture the empirical generalization phenomena. On the other hand, a random matrix theory perspective gives rise to an estimator that accurately predicts the generalization.
Bio: Wei Hu is a postdoctoral scholar at the University of California, Berkeley and an incoming assistant professor at the University of Michigan (starting in Fall 2022). He obtained his Ph.D. degree from Princeton University in 2021, and his B.Eng. degree from Tsinghua University in 2016, both in Computer Science. He is broadly interested in the theoretical and scientific foundations of modern machine learning, with a particular focus on deep learning.
Causal inference for socio-economic and engineering systems
Talk Information: What will happen to Y if we do A?A variety of meaningful socio-economic and engineering questions can be formulated this way. To name a few: What will happen to a patient’s health if they are given a new therapy? What will happen to a country’s economy if policy-makers legislate a new tax? What will happen to a company’s revenue if a new discount is introduced? What will happen to a data center’s latency if a new congestion control protocol is used? In this talk, we will explore how to answer such counterfactual questions using observational data---which is increasingly available due to digitization and pervasive sensors---and/or very limited experimental data. The two key challenges in doing so are: (i) counterfactual prediction in the presence of latent confounders; (ii) estimation with modern datasets which are high-dimensional, noisy, and sparse.Towards this goal, the key framework we introduce is connecting causal inference with tensor completion, a very active area of research across a variety of fields. In particular, we show how to represent the various potential outcomes (i.e., counterfactuals) of interest through an order-3 tensor; here, rows index units (e.g., individuals, sub-populations, geographic regions), columns index measurements (e.g. outcomes over time), and matrix slices index interventions (e.g., discounts, health therapies, socio-economic policies). The key theoretical results presented are: (i) Formal identification results establishing under what missingness patterns, latent confounding, and structure on the tensor is recovery of unobserved potential outcomes possible. (ii) Introducing novel estimators to recover these unobserved potential outcomes and proving they are finite-sample consistent and asymptotically normal.The efficacy of the proposed estimators is shown on high-impact real-world applications. These include working with: (i) TaurRx Therapeutics to correct for bias from patient dropouts and proposing novel clinical trial designs to reduce the number of patients recruited for a trial. (ii) Uber Technologies on evaluating the impact of certain driver engagement policies without having to run an A/B test. (iii) U.S. and Indian policy-makers to evaluate the impact of mobility restrictions on COVID-19 mortality outcomes. (iv) The Poverty Action Lab (J-PAL) at MIT to make personalized policy recommendations to improve childhood immunization rates across different villages in Haryana, India.Finally, we discuss connections between causal inference, tensor completion, and offline reinforcement learning.
Bio: Anish is currently a postdoctoral fellow at the Simons Institute at UC Berkeley. He did his PhD at MIT in electrical engineering and computer science where he was advised by Alberto Abadie, Munther Dahleh, and Devavrat Shah. His research focuses on designing and analyzing methods for causal machine learning, and applying it to critical problems in social and engineering systems. He currently serves as a technical consultant to TauRx Therapeutics and Uber Technologies on questions related to experiment design and causal inference. Prior to the PhD, he was a management consultant at Boston Consulting Group. He received his BSc and MSc at Caltech.
What Happens after SGD Reaches Zero Loss? --A Mathematical Framework
Talk Information: Understanding the implicit bias of Stochastic Gradient Descent (SGD) is one of the key challenges in deep learning, especially for overparametrized models, where the local minimizers of the loss function L can form a manifold. Intuitively, with a sufficiently small learning rate η, SGD tracks Gradient Descent (GD) until it gets close to such manifold, where the gradient noise prevents further convergence. In such a regime, Blanc et al. (2020) proved that SGD with label noise locally decreases a regularizer-like term, the sharpness of loss, tr[∇2L]. In this talk I will present a general framework for such analysis by adapting ideas from Katzenberger (1991). It allows in principle a complete characterization for the regularization effect of SGD around such manifold -- i.e., the "implicit bias" -- using a stochastic differential equation (SDE) describing the limiting dynamics of the parameters, which is determined jointly by the loss function and the noise covariance. This yields some new results: (1) a global analysis of the implicit bias valid for η−2 steps, in contrast to the local analysis of Blanc et al. (2020) that is only valid for η^−1.6 steps and (2) allowing arbitrary noise covariance. As an application, we show with arbitrary large initialization, label noise SGD can always escape the kernel regime and only requires O(κlnd) samples for learning an κ-sparse overparametrized linear model in ℝ^d, while GD initialized in the kernel regime requires Ω(d) samples. This upper bound is minimax optimal and improves the previous Õ (κ2) upper bound (HaoChen et al., 2020).
Bio: Zhiyuan Li is a PhD candidate in the Department of Computer Science at Princeton University, advised by Sanjeev Arora. Previously, he obtained his bachelor’s degree in Computer Science from Tsinghua University. He has also spent time as a research intern at Google Research. His current research goal is to develop a mathematical theory towards a better understanding of modern deep learning, as well as to design more efficient and principled machine learning methods using theoretical insights. He is a recipient of Microsoft Research PhD Fellowship in 2020.
Theoretical Foundations of Multiagent Reinforcement Learning
Talk Information: Reinforcement learning (RL) has made substantial empirical progress in solving hard AI challenges in the past few years. A large fraction of these progresses—Go, Dota 2, Starcraft 2, economic simulation, social behavior learning, and so on—come from multi-agent RL, that is, sequential decision making involving more than one agent. While the theoretical study of single-agent RL has a long history and a vastly growing recent interest, Multi-Agent RL (MARL) theory is arguably a newer and less developed field. In this talk, I will present our recent theoretical developments in MARL theory. Starting with basic formulations, we present several provably efficient algorithms for finding equilibria in MARL problems, addressing unique challenges in MARL such as the curse of multiagent, and design of decentralized algorithms.
Bio: Chi Jin is an assistant professor at the Electrical and Computer Engineering department of Princeton University. He obtained his PhD degree in Computer Science at University of California, Berkeley, advised by Michael I. Jordan. His research mainly focuses on theoretical machine learning, with special emphasis on nonconvex optimization and reinforcement learning. My representative work includes proving noisy gradient descent escape saddle points efficiently and proving the efficiency of Q-learning and least-squares value iteration when combined with optimism in reinforcement learning.
Gradient Descent on Two-layer Nets: Margin Maximization and Simplicity Bias
Talk Information: The generalization mystery of overparametrized deep nets has motivated efforts to understand how gradient descent (GD) converges to low-loss solutions that generalize well. A recent sequence of results (Lyu and Li, 2020; Chizat and Bach, 2020; Ji and Telgarsky, 2020) provide theoretical evidence that GD may converge to the "max-margin" solution with zero loss, which presumably generalizes well. However, the global optimality of margin is proved only in some settings where neural nets are infinitely or exponentially wide. In this talk, I will present our recent paper that establishes this global optimality for two-layer Leaky ReLU nets trained with gradient flow on linearly separable and symmetric data, regardless of the width. The analysis also gives some theoretical justification for recent empirical findings (Kalimeris et al., 2019) on the so-called simplicity bias of GD towards linear or other "simple" classes of solutions, especially early in training. On the pessimistic side, the paper suggests that such results are fragile. A simple data manipulation can make gradient flow converge to a linear classifier with a suboptimal margin.
Bio: Kaifeng Lyu is a first-year PhD student in the Department of Computer Science at Princeton University, advised by Sanjeev Arora. Previously, he obtained his bachelor’s degree in Computer Science from Tsinghua University. His current research interest lies in building a better theoretical foundation for deep learning that is capable of explaining the various optimization and generalization mysteries in deep learning experiments.
When does SGD work
Talk Information: Multi-epoch, small-batch, Stochastic Gradient Descent (SGD) is the method of choice for learning with large over-parameterized non-convex models. However, our understanding of when and why SGD succeeds in learning is still quite rudimentary. Implicit regularization has emerged as a popular paradigm to explain the empirical success of SGD. In this talk, I will discuss why implicit regularization is not a sufficient explanation, and discuss a new framework for understanding when SGD works.
Bio: Ayush is a PhD student in the Computer Science department at Cornell University, advised by Professor Karthik Sridharan and Professor Robert D. Kleinberg. His research interests span across optimization, online learning, reinforcement learning and control, and the interplay between them. Before coming to Cornell, he spent a year at Google as a part of the Brain residency program. Before Google, he completed his undergraduate studies in computer science from IIT Kanpur in India where he was awarded the President's gold medal. Ayush will be starting a postdoc at MIT with Sasha Rakhlin in Fall 2022.
What Can Conformal Inference Offer To Statistics?
Talk Information: Valid uncertainty quantification is crucial for high-stakes decision-making. Conformal inference provides a powerful framework that can wrap around any black-box prediction algorithm, like random forests or deep neural networks, and generate prediction intervals with distribution-free coverage guarantees. In this talk, I will describe how conformal inference can be adapted to handle more complicated inferential tasks in statistics, with the focus on two important statistical problems: counterfactual inference and time-to-event analysis. Unlike standard prediction problems, the predictive targets are only partially observable owing to selection and censoring. In addition, I will briefly talk about generalizing conformal inference to other statistical problems, including election, outlier detection, and risk-calibrated predictions.
Bio: Lihua Lei obtained his Ph.D. in statistics at UC Berkeley and is currently a postdoctoral researcher in Statistics at Stanford University. He will be joining the Stanford Graduate School of Business as an Assistant Professor of Economics in Fall 2022. His research interests include conformal inference, causal inference, multiple hypothesis testing, network clustering, stochastic optimization, and econometrics.
GANs vs. Maximum Likelihood: A Theoretical Comparison
Talk Information: Generative adversarial networks (GANs) represent a zero-sum game played between a generator and a discriminator to learn the distribution of observed data. Since their introduction in 2014, GANs have achieved impressive results on a wide array of machine learning tasks and managed to outperform maximum likelihood methods over benchmark image datasets. In this seminar, we compare and contrast the GAN and maximum likelihood frameworks, and provide evidence that the game-based design of GANs could contribute to a superior generalization performance from training examples to unseen test data in comparison to deep maximum likelihood methods. Furthermore, we demonstrate that the common divergence/distance measures targeted by GANs are more suitable for learning multi-modal distributions than the KL-divergence optimized by maximum likelihood learners. We discuss numerical results supporting our theoretical comparison of the GAN and maximum likelihood approaches.
Bio: Farzan Farnia is an Assistant Professor of Computer Science and Engineering at The Chinese University of Hong Kong. Prior to joining CUHK, he was a postdoctoral research associate at the Laboratory for Information and Decision Systems, Massachusetts Institute of Technology, from 2019-2021. He received his master’s and PhD degrees in electrical engineering from Stanford University and his bachelor’s degrees in electrical engineering and mathematics from Sharif University of Technology. At Stanford, he was a graduate research assistant at the Information Systems Laboratory advised by Professor David Tse. Farzan's research interests span statistical learning theory, information theory, and convex optimization. He has been the recipient of the Stanford Graduate Fellowship (Sequoia Capital Fellowship) between 2013-2016 and the Numerical Technology Founders Prize as the second top performer of Stanford’s electrical engineering PhD qualifying exams in 2014.
Proximal-primal-dual algorithms for nonconvex optimization problems and landscape analysis for narrow neural network
Talk Information: This talk consists of two parts. In the first part, we will introduce proximal-primal-dual framework for solving constrained nonconvex optimization problems. Consider the minimization of a nonconvex differentiable function over a convex, closed set under some linear equality constraints. It is well-known that the lower bound of the iteration complexity of first-order algorithms to attain an $\epsilon$-solution for nonconvex optimization is $\Omega(1/\epsilon^2)$. In this talk, we discuss how to design first-order algorithms for solving nonconvex problems to achieve this lower bound. A popular approach for solving constrained optimization problems is the augmented Lagrangian method (ALM) and alternating direction method of multiplier (ADMM). However, these methods are not guaranteed to converge theoretically for constrained nonconvex problems because numerical examples show that they can exhibit ``oscillation" and may not converge. In this talk, we introduce a smoothed proximal augmented Lagrangian method and its multi-block version. A distinctive feature of this method is the introduction of a ``smoothed" (i.e., exponentially weighted) sequence of primal iterates, and the inclusion, at each iteration, to the augmented Lagrangian function a quadratic proximal term centered at the current smoothed primal iterate. The resulting proximal augmented Lagrangian function is inexactly minimized at each iteration while the dual multiplier vector is updated using a standard dual ascent. Using a proximal-primal-dual framework and some novel error bounds, we prove that our algorithm can converge with an $\mathcal{O}(1/\epsilon^2)$ iteration complexity for the problems with polyhedral constraints and achieve the same complexity for problems with compact, convex constraints under some common regularity assumptions. In addition, we show that our algorithmic framework can lead to distributed algorithms, which are well-suited for large-scale problems. We prove the $\mathcal{O}(1/\epsilon^2)$ iteration complexity of these algorithms and apply them to efficiently train machine models with distributed samples and distributed features. Moreover, we extend the framework to solve min-max problems and we show the $\mathcal{O}(1/\epsilon^2)$ iteration complexity for solving a family of nonconvex-concave problems -- minimizing the point-wise maximum of a finite collection of nonconvex functions, which covers a wide range of applications in machine learning. In the second part, we will talk about landscape analysis for the loss surface of narrow neural network. First, we prove that as long as the width $m \ge 2n/d$ (where $d$ is the input dimension), its expressivity is strong, i.e., there exists at least one global minimizer with zero training loss. Second, we identify a nice local region with no local-min or saddle points. Nevertheless, it is not clear whether gradient descent can stay in this nice region. Third, we consider a constrained optimization formulation where the feasible region is the nice local region, and prove that every KKT point is a nearly global minimizer. It is expected that projected gradient methods converge to KKT points under mild technical conditions, but we leave the rigorous convergence analysis to future work. Thorough numerical results show that projected gradient methods on this constrained formulation significantly outperform SGD for training narrow neural nets.
Bio: Jiawei Zhang is a postdoc associate in Laboratory for Information & Decision Systems At MIT, working with Prof. Asuman Ozdaglar. He attained his Ph.D degree from the Chinese University of Hong Kong (Shenzhen) advised by Professor Zhi-quan (Tom) Luo. Previously, he attained B.Sc in Mathematics (Hua Loo-Keng talent program) from University of Science and technology of China. His research interest includes optimization theory and algorithms for nonconvex problems.
Efficient Optimization Methods for Machine Learning
Talk Information: In this talk, I will first talk about some classical optimization algorithms/results and my contributions of designing efficient algorithms for convex and nonconvex optimization. Then I will introduce some state-of-the-art algorithms/results for communication-efficient distributed/federated learning.
Bio: Zhize Li is a Research Scientist at Carnegie Mellon University since 2022. He received his PhD degree in Computer Science from the Institute for Interdisciplinary Information Sciences at Tsinghua University in 2019. His research mainly focuses on mathematical optimization, federated learning, and machine learning.
Treatment effect estimation with noisy conditioning variables
Talk Information: In this paper, I develop a new identification strategy for treatment effects when one observes noisy measurements of unobserved confounding factors. I use proxy variables to construct a random variable conditional on which treatment variables become exogenous. The key idea is that, under appropriate conditions, there exists a one-to-one mapping between the distribution of unobserved confounding factors and the distribution of proxies. To ensure sufficient variation in the constructed control variable, I use an additional variable, termed excluded variable, which satisfies certain exclusion restrictions and relevance conditions. I establish asymptotic distributional results for flexible parametric and nonparametric estimators of the average structural function. I illustrate empirical relevance and usefulness of my results by estimating causal effects of college selectivity on wages.
Bio: Kenichi Nagasawa is an Assistant Professor of Economics at the University of Warwick. His main research interest lies in theoretical econometrics. He has worked on bootstrap-based inference procedures for estimators with non-standard asymptotics and identification of causal effects in selection models in economics. He obtained a Ph.D. in Economics from the University of Michigan in 2019.
How well do deep neural networks (DNNs) learn causal effect, and how can we certify the DNN-based causal effect estimator using a new stable higher-order influence function (HOIF) estimator?
Talk Information: Under causal sufficiency, causal effect estimation is by now a quite well-understood problem in both the infinite-dimensional nonparametric regime and the high-dimensional sparse linear model regime. Yet, as DNNs start to replace the “old” statistical techniques in practice, it is important to study how well we estimate causal effect of interest with DNNs. In the first part of my talk, I will present a new set of theoretical properties of DNN-based causal effect estimators that improve the results of Farrell, Liang and Misra, ECTA, 2022 in several aspects, based on the metric entropy and a new approximation property of DNNs. In particular, our result does not require sparse DNN architecture yet allow for sparsity in the feature space, both of which are desirable in many applied settings. However, via a carefully crafted numerical experiment, we found a significant gap between theory and practice in such statistical guarantees, not only in our work but also in almost all other works on using DNNs to estimate causal effects. To certify if the DNN-based causal effect estimators really work in practice, we use an assumption-lean hypothesis testing framework developed in the past in two of our recent papers (Liu et al., Statistical Sciences 2020; Liu et al. 2022+), based on the so-called Higher-Order Influence Functions (HOIFs). However, HOIF-based approach has been viewed as being “impractical” due to (1) the use of infinite-order U-statistics and (2) the numerical instability that we observed in practice. To deal with the first issue, we introduce an early-stopping criterion that only requires the computation of low-order U-statistics. To deal with the second issue, we develop a new stable HOIF that is provably both (statistical) optimal and numerically stable. This new HOIF can also be shown to be “the” HOIF under the fixed-design setting, answering a question raised by Edward Kennedy, Sivaraman Balakrishnan and Larry Wasserman. We also demonstrate our approach in practice by revisiting the experiments from the first part of my talk. The talk is based on joint works with Siqi Xu and Zhonghua Liu (University of Hong Kong), and Chang Li (senior at SJTU --> PhD in statistics at UVA).
Bio: Lin Liu is an Assistant Professor in the Institute of Natural Sciences, School of Mathematical Sciences and SJTU-Yale Joint Center for Biostatistics and Data Science at SJTU. He obtained his PhD in biostatistics from Harvard University, supervised by Franziska Michor and Jamie Robins. His research focuses on non- and semi-parametric theory, causal inference, machine learning theory, statistical computing and applications in biomedical sciences.
Minibatch vs Local SGD with Shuffling: Tight Convergence Bounds and Beyond
Talk Information: In distributed learning, local SGD (also known as federated averaging) and its simple baseline minibatch SGD are widely studied optimization methods. Most existing analyses of these methods assume independent and unbiased gradient estimates obtained via with-replacement sampling. In contrast, we study shuffling-based variants: minibatch and local Random Reshuffling, which draw stochastic gradients without replacement and are thus closer to practice. For smooth functions satisfying the Polyak-Łojasiewicz condition, we obtain convergence bounds (in the large epoch regime) which show that these shuffling-based variants converge faster than their with-replacement counterparts. Moreover, we prove matching lower bounds showing that our convergence analysis is tight. Finally, we propose an algorithmic modification called synchronized shuffling that leads to convergence rates faster than our lower bounds in near-homogeneous settings.
Bio: Chulhee “Charlie” Yun is an assistant professor at KAIST Kim Jaechul Graduate School of AI, where he directs the Optimization & Machine Learning Laboratory. He recently finished his PhD from the Laboratory for Information and Decision Systems and the Department of Electrical Engineering & Computer Science at MIT, under the joint supervision of Prof. Suvrit Sra and Prof. Ali Jadbabaie. Charlie’s research spans optimization and machine learning theory, with the driving goal of bridging the gap between theory and practice. His recent research topics include the analysis of shuffling-based optimization algorithms, distributed/federated learning, and neural network optimization.
Exploring asymptotic convergence properties of stochastic optimization methods.
Talk Information: In this talk, we explore and discuss novel convergence properties and analysis techniques for a variety of stochastic optimization methods. We first present a fundamental unified convergence theorem that can be used to derive expected and almost sure convergence results for a broad class of stochastic optimization approaches. Our unified theorem only requires to verify several representative conditions and is not tailored to specific algorithms. As a direct application, we recover expected and almost sure convergence results of the stochastic gradient method (SGD) and random reshuffling (RR) under general settings. Moreover, our unified plugin-type analysis allows to derive new expected and almost sure convergence results for the stochastic proximal gradient method (prox-SGD) and stochastic model-based methods (SMM) for nonsmooth nonconvex optimization problems. In the second part of the talk, we discuss several recent extensions of the well-known Kurdyka-Lojasiewicz (KL) framework to stochastic optimization. The standard KL inequality-based analysis only applies to algorithms with a certain descent property which typically limits applicability to deterministic settings. Here, we present novel convergence analyses for the non-descent RR and a normal map-based prox-SGD method (nor-SGD) with diminishing step sizes based on the KL inequality. Specifically, we establish strong limit-point convergence results with appropriate diminishing step sizes, namely, the whole sequence of iterates generated by RR or nor-SGD is convergent and converges to a single stationary point in an almost sure sense. In addition, asymptotic rates of convergence are obtainable – depending on the KL exponent and the selected diminishing step sizes. Our results provide new and general tools to analyze the convergence behavior of stochastic non-descent methods for nonsmooth and nonconvex optimization.
Bio: Andre Milzarek is an assistant professor at the School of Data Science (SDS) at the Chinese University of Hong Kong, Shenzhen. He received his master’s degree with honours and his doctoral degree in mathematics from the Technical University of Munich in Germany under the supervision of Michael Ulbrich in 2013 and 2016. He was a postdoctoral researcher at the Beijing International Center for Mathematical Research at the Peking University in Beijing from 2017 to March 2019. His main research directions and interests cover nonsmooth optimization, large-scale and stochastic optimization, second order methods and theory. From 2010 to 2012, he was supported by the Max-Weber program of the state of Bavaria and in 2017 he received the Boya Postdoctoral Fellowship at Peking University. Dr. Milzarek’s work has been published in top tier journals including Mathematical Programming, SIAM Journal on Optimization, SIAM Journal on Matrix Analysis and Applications, etc. His research has been funded by the research institutes SRIBD and AIRS and by the National Natural Science Foundation of China (NSFC).