Vatsal Sharan

About

I'm an assistant professor of computer science at USC. I am a part of the Theory Group, Machine Learning Center and the Center for AI in Society at USC. Previously, I was a postdoc at MIT hosted by Ankur Moitra. I obtained my Ph.D. from Stanford advised by Greg Valiant.

I work on the foundations of machine learning, and my interests mostly lie in the intersection of machine learning, theoretical computer science and statistics. The goal of my research is to study and discover the underlying principles which govern learning, and to leverage this understanding to build practical machine learning systems which are more efficient, fair and robust. A large part of my work aims to inspect questions which arise from modern applications and challenges of machine learning. If you're interested in learning more, some of my representative work is highlighted below.

My research is supported by an NSF CAREER award (2023), and Amazon Research Awards (2022 and 2024). This support is very gratefully acknowledged.

I'm also a part of the Learning Theory Alliance (LeT-All), a community building and mentorship initiative for the learning theory community. Here is my CV, which has a more complete list of activities.

Some Representative Publications | All Publications

Using algorithms to understand Transformers, and using Transformers to understand algorithms

Theory can provide a bird's eye view of the landscape of information, computation and how they interact for learning problems. Can we use some of this computational and information theoretic understanding to understand Transformers? On the flip side, can we use Transformers to explore this landscape, and understand and discover algorithms and data structures?

Here is a talk which covers some of this work (slides).

Transformers Learn to Achieve Second-Order Convergence Rates for In-Context Linear Regression

Deqing Fu, Tian-Qi Chen, Robin Jia, Vatsal Sharan
Neurips, 2024
SoCal NLP Symposium 2023 Best Paper Award
arXiv | pdf
Pre-trained Large Language Models Use Fourier Features to Compute Addition

Tianyi Zhou, Deqing Fu, Vatsal Sharan, Robin Jia
Neurips, 2024
arXiv | pdf
Transformers Learn Low Sensitivity Functions: Investigations and Implications

Bhavya Vasudeva, Deqing Fu, Tianyi Zhou, Elliott Kau, Youqi Huang, Vatsal Sharan
ICLR, 2025
arXiv | pdf
Discovering Data Structures: Nearest Neighbor Search and Beyond

Omar Salemohamed, Laurent Charlin, Shivam Garg, Vatsal Sharan, Gregory Valiant
arXiv, 2024
arXiv | pdf

Some more work along these lines

Mitigating Simplicity Bias in Deep Learning for Improved OOD Generalization and Robustness

Bhavya Vasudeva, Kameron Shahabi, Vatsal Sharan
TMLR, 2024
arXiv | pdf
One Network Fits All? Modular versus Monolithic Task Formulations in Neural Networks Learning

Atish Agarwala, Abhimanyu Das, Brendan Juba, Rina Panigrahy, Vatsal Sharan, Xin Wang, Qiuyi Zhang
ICLR, 2021
abstract | arXiv | pdf | video

A multi-group perspective to go beyond loss minimization in ML

Minimizing some loss function on average across all datapoints is the dominant paradigm in ML, but applications of ML in societal systems often involve more complex considerations. Different individuals participating in the system may have their own loss functions, it may not be possible to make decisions for these individuals in isolation, and we may care about the model’s behavior on various groups of individuals and not just on average across all of them. Some of my work here uses a multigroup perspective --- which examines the model's predictions on a large number of groups within the population --- to provide solutions to some of the above problems.

Here is a talk which covers some of this work (slides).

When is Multicalibration Post-Processing Necessary?

Dutch Hansen, Siddartha Devic, Preetum Nakkiran, Vatsal Sharan
Neurips, 2024
arXiv | pdf
Stability and Multigroup Fairness in Ranking with Uncertain Predictions

Siddartha Devic, Aleksandra Korolova, David Kempe, Vatsal Sharan
ICML, 2024
Non-archival at FORC, 2024
arXiv | pdf | video
Omnipredictors

Parikshit Gopalan, Adam Tauman Kalai, Omer Reingold, Vatsal Sharan, Udi Wieder
ITCS, 2022
arXiv | pdf

Some more work along these lines

Optimal Multiclass U-Calibration Error and Beyond

Haipeng Luo, Spandan Senapati, Vatsal Sharan
Neurips, 2024
arXiv | pdf
Fairness in Matching under Uncertainty

Siddartha Devic, David Kempe, Vatsal Sharan, Aleksandra Korolova
ICML, 2023
Non-archival at EAAMO, 2023
arXiv | pdf
KL Divergence Estimation with Multi-group Attribution

Parikshit Gopalan, Nina Narodytska, Omer Reingold, Vatsal Sharan, Udi Wieder
arXiv, 2022
arXiv | pdf
Multicalibrated Partitions for Importance Weights

Parikshit Gopalan, Omer Reingold, Vatsal Sharan, Udi Wieder
ALT, 2022
arXiv | pdf

Memory as a lens to understand efficient learning and optimization

Classical learning theory mainly focuses on the number of operations performed by the algorithm as the proxy for the algorithm’s running time. However, since growth in the available processing power has outpaced the growth in the available memory by many orders of magnitude, memory rather than compute has become the primary bottleneck in many applications. Despite this, and even though memory has traditionally been one of the most fundamental computational resources in theoretical computer science, very little is known about the role of memory in solving learning tasks. My work investigates the role of memory in learning, and if memory could be a useful discerning factor to provide a clear separation between 'efficient' and 'expensive' techniques.

Here is a talk which covers some of this work (slides).

Efficient Convex Optimization Requires Superlinear Memory

Annie Marsden, Vatsal Sharan, Aaron Sidford, Gregory Valiant
COLT, 2022
Best Paper Award
Invited to IJCAI 2023 Sister Conference Notable Paper Track
JACM, 2024
arXiv | pdf
Big-Step-Little-Step: Efficient Gradient Methods for Objectives with Multiple Scales

Jonathan Kelner, Annie Marsden, Vatsal Sharan, Aaron Sidford, Gregory Valiant, Honglin Yuan
COLT, 2022
arXiv | pdf
Memory-Sample Tradeoffs for Linear Regression with Small Error

Vatsal Sharan, Aaron Sidford, Gregory Valiant
STOC, 2019
abstract | pdf | arXiv

What is the role of memory in continuous optimization and learning? Are there inherent trade-offs between the available memory and the data requirement? Is it possible to achieve the sample complexity of second-order optimization methods with significantly less memory? We raise these questions, and show the first nontrivial lower bounds for linear regression with super-linear memory: we show that there is a gap between the sample complexity of algorithms with quadratic memory and that of any algorithm with sub-quadratic memory.

Some more work along these lines

NeuroSketch: A Neural Network Method for Fast and Approximate Evaluation of Range Aggregate Queries

Sepanta Zeighami, Vatsal Sharan, Cyrus Shahabi
SIGMOD, 2023
arXiv | pdf
Compressed Factorization: Fast and Accurate Low-Rank Factorization of Compressively-Sensed Data

Vatsal Sharan, Kai Sheng Tai, Peter Bailis, Gregory Valiant
ICML, 2019
abstract | pdf | arXiv | code | spotlight video

What learning algorithms can be run directly on compressively-sensed data? If we compress a high dimensional matrix via a random projection, then what is the relationship between the factors of the original matrix and the factors of the compressed matrix? While it is well-known that random projections preserve a number of geometric properties of a dataset, this work shows that random projections can also preserve certain solutions to non-convex, NP-Hard problems like non-negative matrix factorization---both in theory and in practice.

Jump to 53:20 in this video.

Efficient Anomaly Detection via Matrix Sketching

Vatsal Sharan, Parikshit Gopalan, Udi Wieder
NeurIPS, 2018
abstract | pdf | arXiv

PCA based anomaly scores are commonly used for finding anomalies in high-dimensional data. The naive algorithms for computing these scores explicitly compute the PCA of the covariance matrix which uses space quadratic in the dimensionality of the data. Using matrix sketching tools and new matrix perturbation inequalities, we give the first streaming algorithms for computing these scores that use space that is linear or sublinear in the dimension.

Moment-Based Quantile Sketches for Efficient High Cardinality Aggregation Queries

Edward Gan, Jialin Ding, Kai Sheng Tai, Vatsal Sharan, Peter Bailis
VLDB, 2018
abstract | pdf | arXiv | code

We propose a compact and efficiently mergeable sketch for answering quantile queries on a dataset. This data structure, which we refer to as the moments sketch, operates with a small memory footprint (200 bytes) and achieves computationally efficient (50ns) merges by tracking only a set of summary statistics, notably the sample moments.

Prediction with a Short Memory

Vatsal Sharan, Sham Kakade, Percy Liang, Gregory Valiant
STOC, 2018
abstract | pdf | arXiv | blog | video

When is it necessary to remember significant information from the distant past to make good predictions about the future in sequential prediction tasks? How do we consolidate and reference memories about the past in order to predict the future? This work studies these questions, and the findings are perhaps counterintuitive: we show that a simple (Markov) model which bases its predictions on only the most recent observations and the statistics of short windows, can achieve small error on average with respect to a large class of sequences, such as those generated by Hidden Markov Models (HMMs). This shows that long-term memory may not be necessary for making good predictions on average, even on sequences generated by complex processes that have long-range dependencies.

Sketching Linear Classifiers over Data Streams

Kai Sheng Tai, Vatsal Sharan, Peter Bailis, Gregory Valiant
SIGMOD, 2018
abstract | pdf | arXiv | code

We introduce a new sub-linear space sketch---the Weight-Median Sketch---for learning compressed linear classifiers over data streams while supporting the efficient recovery of large-magnitude weights in the classifier. The Weight-Median Sketch adopts the core data structure used in the Count-Sketch, but, instead of sketching counts, it captures sketched gradient updates to the model parameters. We establishes recovery guarantees for our sketch and demonstrate empirical improvements in memory-accuracy trade-offs over alternative memory-budgeted methods.

Learning with limited (or synthetic) data

Modern ML techniques are data-hungry, but data is an expensive resource. Current ML applications have brought to light several interesting information-theoretic questions around learning with limited data. Some of my work provides a formulation to study the basic statistical task of synthetically augmenting a given set of samples, and explores notions of regularization --- a classical technique for data-efficient learning whose role still remains mysterious in deep learning.

Proper Learnability and the Role of Unlabeled Data

Julian Asilis, Siddartha Devic, Shaddin Dughmi, Vatsal Sharan, Shang-Hua Teng
ALT, 2025
arXiv | pdf
Regularization and Optimal Multiclass Learning

Julian Asilis, Siddartha Devic, Shaddin Dughmi, Vatsal Sharan, Shang-Hua Teng
COLT, 2024
arXiv | pdf
On the Statistical Complexity of Sample Amplification

Brian Axelrod, Shivam Garg, Yanjun Han, Vatsal Sharan, Gregory Valiant
Annals of Statistics, 2024
arXiv | preprint pdf

Some more work along these lines

Transductive Sample Complexities Are Compact

Julian Asilis, Siddartha Devic, Shaddin Dughmi, Vatsal Sharan, Shang-Hua Teng
Neurips, 2024
arXiv | pdf
Open Problem: Can Local Regularization Learn All Multiclass Problems?

Julian Asilis, Siddartha Devic, Shaddin Dughmi, Vatsal Sharan, Shang-Hua Teng
Open Problem @ COLT, 2024
pdf
Sample Amplification: Increasing Dataset Size even when Learning is Impossible

Brian Axelrod, Shivam Garg, Vatsal Sharan, Gregory Valiant
ICML, 2020
abstract | pdf | arXiv | video

Is learning a distribution always necessary for generating new samples from the distribution? We introduce the problem of "sample amplification": given n independent draws from a distribution, D , to what extent is it possible to output a set of m > n datapoints that are indistinguishable from m i.i.d. draws from D ? Curiously, we show that nontrivial amplification is often possible in the regime where n is too small to learn D to any nontrivial accuracy.

Jump to 50:00 (or topic #27) in this video.

Students

I am very lucky to advise an amazing group of students, including the following amazing Ph.D. students:

Siddartha Devic (co-advised with Aleksandra Korolova)

Bhavya Vasudeva
Julian Asilis
Deqing Fu (co-advised with Robin Jia)
Devansh Gupta (co-advised with Meisam Razaviyayn)
Spandan Senapati (co-advised with Haipeng Luo)
Tianyi Zhou (co-advised with Robin Jia)

And the following amazing undergrad/Masters students:

Dutch Hansen
Anish Jayant
Jung Whan Lee
You Qi Huang (SURE program intern in Summer'23, mentored by Bhavya Vasudeva)
Natalie Abreu (graduated in Fall'23, now Ph.D. student at Harvard)
Aditya Prased (graduated in Fall'24, now Ph.D. student at the University of Chicago)
Kameron Shahabi (graduated in Fall'24, now Ph.D. student at the University of Washington)
Qilin Ye (joint with Robin Jia, graduated in Fall'24, now M.S. student at Duke)
Devin Martin (SURE program intern in Summer'22, mentored by Bhavya Vasudeva)

Teaching

CSCI 567: Machine Learning (Spring 2024)
CSCI 699: Theory of Machine Learning (Fall 2023)
CSCI 699: Seminar on Computational Perspectives on the Frontiers of ML (Spring 2023)
CSCI 567: Machine Learning (Fall 2022)
CSCI 699: Theory of Machine Learning (Fall 2021)