Vatsal Sharan


Assistant Professor,
Thomas Lord Department of Computer Science,
University of Southern California

Email: vsharan at usc dot edu

Office: SAL 220

About

I'm an assistant professor of computer science at USC. I am a part of the Theory Group, Machine Learning Center and the Center for AI in Society at USC. Previously, I was a postdoc at MIT hosted by Ankur Moitra. I obtained my Ph.D. from Stanford advised by Greg Valiant.

I work on the foundations of machine learning, and my interests mostly lie in the intersection of machine learning, theoretical computer science and statistics. The goal of my research is to study and discover the underlying principles which govern learning, and to leverage this understanding to build practical machine learning systems which are more efficient, fair and robust. A large part of my work aims to inspect questions which arise from modern applications and challenges of machine learning. If you're interested in learning more, some of my representative work is highlighted below.

My research is supported by an NSF CAREER award (2023), and Amazon Research Awards (2022 and 2024). This support is very gratefully acknowledged.

I'm also a part of the Learning Theory Alliance (LeT-All), a community building and mentorship initiative for the learning theory community. Here is my CV, which has a more complete list of activities.

Some Representative Publications | All Publications

Using algorithms to understand Transformers, and using Transformers to understand algorithms

Theory can provide a bird's eye view of the landscape of information, computation and how they interact for learning problems. Can we use some of this computational and information theoretic understanding to understand Transformers? On the flip side, can we use Transformers to explore this landscape, and understand and discover algorithms and data structures?
Here is a talk which covers some of this work (slides).
  • Transformers Learn to Achieve Second-Order Convergence Rates for In-Context Linear Regression
    Deqing Fu, Tian-Qi Chen, Robin Jia, Vatsal Sharan
    Neurips, 2024
    SoCal NLP Symposium 2023 Best Paper Award
    arXiv | pdf

  • Pre-trained Large Language Models Use Fourier Features to Compute Addition
    Tianyi Zhou, Deqing Fu, Vatsal Sharan, Robin Jia
    Neurips, 2024
    arXiv | pdf

  • Simplicity Bias of Transformers to Learn Low Sensitivity Functions
    Bhavya Vasudeva, Deqing Fu, Tianyi Zhou, Elliott Kau, Youqi Huang, Vatsal Sharan
    arXiv, 2024
    arXiv | pdf

  • Discovering Data Structures: Nearest Neighbor Search and Beyond
    Omar Salemohamed, Laurent Charlin, Shivam Garg, Vatsal Sharan, Gregory Valiant
    arXiv, 2024
    arXiv | pdf

Some more work along these lines
  • Mitigating Simplicity Bias in Deep Learning for Improved OOD Generalization and Robustness
    Bhavya Vasudeva, Kameron Shahabi, Vatsal Sharan
    TMLR, 2024
    arXiv | pdf

  • One Network Fits All? Modular versus Monolithic Task Formulations in Neural Networks Learning
    Atish Agarwala, Abhimanyu Das, Brendan Juba, Rina Panigrahy, Vatsal Sharan, Xin Wang, Qiuyi Zhang
    ICLR, 2021
    abstract | arXiv | pdf | video

A multi-group perspective to go beyond loss minimization in ML

Minimizing some loss function on average across all datapoints is the dominant paradigm in ML, but applications of ML in societal systems often involve more complex considerations. Different individuals participating in the system may have their own loss functions, it may not be possible to make decisions for these individuals in isolation, and we may care about the model’s behavior on various groups of individuals and not just on average across all of them. Some of my work here uses a multigroup perspective --- which examines the model's predictions on a large number of groups within the population --- to provide solutions to some of the above problems.
Here is a talk which covers some of this work (slides).
  • When is Multicalibration Post-Processing Necessary?
    Dutch Hansen, Siddartha Devic, Preetum Nakkiran, Vatsal Sharan
    Neurips, 2024
    arXiv | pdf

  • Stability and Multigroup Fairness in Ranking with Uncertain Predictions
    Siddartha Devic, Aleksandra Korolova, David Kempe, Vatsal Sharan
    ICML, 2024
    Non-archival at FORC, 2024
    arXiv | pdf | video

  • Omnipredictors
    Parikshit Gopalan, Adam Tauman Kalai, Omer Reingold, Vatsal Sharan, Udi Wieder
    ITCS, 2022
    arXiv | pdf

Some more work along these lines
  • Optimal Multiclass U-Calibration Error and Beyond
    Haipeng Luo, Spandan Senapati, Vatsal Sharan
    Neurips, 2024
    arXiv | pdf

  • Fairness in Matching under Uncertainty
    Siddartha Devic, David Kempe, Vatsal Sharan, Aleksandra Korolova
    ICML, 2023
    Non-archival at EAAMO, 2023
    arXiv | pdf

  • KL Divergence Estimation with Multi-group Attribution
    Parikshit Gopalan, Nina Narodytska, Omer Reingold, Vatsal Sharan, Udi Wieder
    arXiv, 2022
    arXiv | pdf

  • Multicalibrated Partitions for Importance Weights
    Parikshit Gopalan, Omer Reingold, Vatsal Sharan, Udi Wieder
    ALT, 2022
    arXiv | pdf

Memory as a lens to understand efficient learning and optimization

Classical learning theory mainly focuses on the number of operations performed by the algorithm as the proxy for the algorithm’s running time. However, since growth in the available processing power has outpaced the growth in the available memory by many orders of magnitude, memory rather than compute has become the primary bottleneck in many applications. Despite this, and even though memory has traditionally been one of the most fundamental computational resources in theoretical computer science, very little is known about the role of memory in solving learning tasks. My work investigates the role of memory in learning, and if memory could be a useful discerning factor to provide a clear separation between 'efficient' and 'expensive' techniques.
Here is a talk which covers some of this work (slides).
  • Efficient Convex Optimization Requires Superlinear Memory
    Annie Marsden, Vatsal Sharan, Aaron Sidford, Gregory Valiant
    COLT, 2022
    Best Paper Award
    Invited to IJCAI 2023 Sister Conference Notable Paper Track
    JACM, 2024
    arXiv | pdf

  • Big-Step-Little-Step: Efficient Gradient Methods for Objectives with Multiple Scales
    Jonathan Kelner, Annie Marsden, Vatsal Sharan, Aaron Sidford, Gregory Valiant, Honglin Yuan
    COLT, 2022
    arXiv | pdf

  • Memory-Sample Tradeoffs for Linear Regression with Small Error
    Vatsal Sharan, Aaron Sidford, Gregory Valiant
    STOC, 2019
    abstract | pdf | arXiv

  • What is the role of memory in continuous optimization and learning? Are there inherent trade-offs between the available memory and the data requirement? Is it possible to achieve the sample complexity of second-order optimization methods with significantly less memory? We raise these questions, and show the first nontrivial lower bounds for linear regression with super-linear memory: we show that there is a gap between the sample complexity of algorithms with quadratic memory and that of any algorithm with sub-quadratic memory.

Some more work along these lines
  • NeuroSketch: A Neural Network Method for Fast and Approximate Evaluation of Range Aggregate Queries
    Sepanta Zeighami, Vatsal Sharan, Cyrus Shahabi
    SIGMOD, 2023
    arXiv | pdf

  • Compressed Factorization: Fast and Accurate Low-Rank Factorization of Compressively-Sensed Data
    Vatsal Sharan, Kai Sheng Tai, Peter Bailis, Gregory Valiant
    ICML, 2019
    abstract | pdf | arXiv | code | spotlight video

  • What learning algorithms can be run directly on compressively-sensed data? If we compress a high dimensional matrix via a random projection, then what is the relationship between the factors of the original matrix and the factors of the compressed matrix? While it is well-known that random projections preserve a number of geometric properties of a dataset, this work shows that random projections can also preserve certain solutions to non-convex, NP-Hard problems like non-negative matrix factorization---both in theory and in practice.

    Jump to 53:20 in this video.
  • Efficient Anomaly Detection via Matrix Sketching
    Vatsal Sharan, Parikshit Gopalan, Udi Wieder
    NeurIPS, 2018
    abstract | pdf | arXiv

  • PCA based anomaly scores are commonly used for finding anomalies in high-dimensional data. The naive algorithms for computing these scores explicitly compute the PCA of the covariance matrix which uses space quadratic in the dimensionality of the data. Using matrix sketching tools and new matrix perturbation inequalities, we give the first streaming algorithms for computing these scores that use space that is linear or sublinear in the dimension.

  • Moment-Based Quantile Sketches for Efficient High Cardinality Aggregation Queries
    Edward Gan, Jialin Ding, Kai Sheng Tai, Vatsal Sharan, Peter Bailis
    VLDB, 2018
    abstract | pdf | arXiv | code

  • We propose a compact and efficiently mergeable sketch for answering quantile queries on a dataset. This data structure, which we refer to as the moments sketch, operates with a small memory footprint (200 bytes) and achieves computationally efficient (50ns) merges by tracking only a set of summary statistics, notably the sample moments.

  • Prediction with a Short Memory
    Vatsal Sharan, Sham Kakade, Percy Liang, Gregory Valiant
    STOC, 2018
    abstract | pdf | arXiv | blog | video

  • When is it necessary to remember significant information from the distant past to make good predictions about the future in sequential prediction tasks? How do we consolidate and reference memories about the past in order to predict the future? This work studies these questions, and the findings are perhaps counterintuitive: we show that a simple (Markov) model which bases its predictions on only the most recent observations and the statistics of short windows, can achieve small error on average with respect to a large class of sequences, such as those generated by Hidden Markov Models (HMMs). This shows that long-term memory may not be necessary for making good predictions on average, even on sequences generated by complex processes that have long-range dependencies.

  • Sketching Linear Classifiers over Data Streams
    Kai Sheng Tai, Vatsal Sharan, Peter Bailis, Gregory Valiant
    SIGMOD, 2018
    abstract | pdf | arXiv | code

  • We introduce a new sub-linear space sketch---the Weight-Median Sketch---for learning compressed linear classifiers over data streams while supporting the efficient recovery of large-magnitude weights in the classifier. The Weight-Median Sketch adopts the core data structure used in the Count-Sketch, but, instead of sketching counts, it captures sketched gradient updates to the model parameters. We establishes recovery guarantees for our sketch and demonstrate empirical improvements in memory-accuracy trade-offs over alternative memory-budgeted methods.

Learning with limited (or synthetic) data

Modern ML techniques are data-hungry, but data is an expensive resource. Current ML applications have brought to light several interesting information-theoretic questions around learning with limited data. Some of my work provides a formulation to study the basic statistical task of synthetically augmenting a given set of samples, and explores notions of regularization --- a classical technique for data-efficient learning whose role still remains mysterious in deep learning.
  • Regularization and Optimal Multiclass Learning
    Julian Asilis, Siddartha Devic, Shaddin Dughmi, Vatsal Sharan, Shang-Hua Teng
    COLT, 2024
    arXiv | pdf

  • On the Statistical Complexity of Sample Amplification
    Brian Axelrod, Shivam Garg, Yanjun Han, Vatsal Sharan, Gregory Valiant
    Annals of Statistics, 2024
    arXiv | preprint pdf

Some more work along these lines
  • Transductive Sample Complexities Are Compact
    Julian Asilis, Siddartha Devic, Shaddin Dughmi, Vatsal Sharan, Shang-Hua Teng
    Neurips, 2024
    arXiv | pdf

  • Open Problem: Can Local Regularization Learn All Multiclass Problems?
    Julian Asilis, Siddartha Devic, Shaddin Dughmi, Vatsal Sharan, Shang-Hua Teng
    Open Problem @ COLT, 2024
    pdf

  • Sample Amplification: Increasing Dataset Size even when Learning is Impossible
    Brian Axelrod, Shivam Garg, Vatsal Sharan, Gregory Valiant
    ICML, 2020
    abstract | pdf | arXiv | video

  • Is learning a distribution always necessary for generating new samples from the distribution? We introduce the problem of "sample amplification": given n independent draws from a distribution, D , to what extent is it possible to output a set of m > n datapoints that are indistinguishable from m i.i.d. draws from D ? Curiously, we show that nontrivial amplification is often possible in the regime where n is too small to learn D to any nontrivial accuracy.

    Jump to 50:00 (or topic #27) in this video.

Students

I am very lucky to advise the following amazing Ph.D. students:

And the following amazing undergrad/Masters students:

  • Dutch Hansen
  • Anish Jayant
  • Jung Whan Lee
  • You Qi Huang (SURE program intern in Summer'23, mentored by Bhavya Vasudeva)
  • Natalie Abreu (graduated in Fall'23, now Ph.D. student at Harvard)
  • Aditya Prased (graduated in Fall'24, now Ph.D. student at the University of Chicago)
  • Kameron Shahabi (graduated in Fall'24, now Ph.D. student at the University of Washington)
  • Qilin Ye (joint with Robin Jia, graduated in Fall'24, now M.S. student at Duke)
  • Devin Martin (SURE program intern in Summer'22, mentored by Bhavya Vasudeva)

Teaching