Leshem Choshen

I'm an Open and Collaborative Natural Language Processing researcher MIT & IBM, currently working on making large language model research more efficient, collaborative and achievable by anyone. I work a lot on evaluation (check out Unitxt), co-created model merging, ZipNN for compressing models (no not quantization, compression :-)). My work focuses on democratizing AI through open science initiatives like BabyLM challenge which I co-organize to promote sample-efficient language model training. I am passionate about collaborative and accessible research. My recent projects include ComPEFT for compressing fine-tuned models, ShareLM for sharing human-model conversations with the community, and tinyBenchmarks for efficient model evaluation. I've also worked extensively on model merging techniques like TIES-Merging and ColD Fusion to enable model recycling.

I believe some technologies are more beneficial to the world than others and that science can be fun.
My research emphasizes making AI systems more accessible to broader communities to use, build, tweak and understand.

Publications

Sloth: scaling laws for LLM skills to predict multi-benchmark performance across families

Felipe Maia Polo, Seamus Somerstep, Leshem Choshen, Yuekai Sun, M. Yurochkin

arXiv.org 2024

ABS HTML PDF

Findings of the Second BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora

Michael Y. Hu, Aaron Mueller, Candace Ross, Adina Williams, Tal Linzen, Chengxu Zhuang, Ryan Cotterell, Leshem Choshen, A. Warstadt, E. Wilcox

arXiv.org 2024

ABS HTML PDF

Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation

Shivalika Singh, Angelika Romanou, Clémentine Fourrier, D. Adelani, Jian Gang Ngui, Daniel Vila-Suero, Peerat Limkonchotiwat, Kelly Marchisio, Wei Qi Leong, Yosephine Susanto, Raymond Ng, Shayne Longpre, Wei-Yin Ko, Madeline Smith, Antoine Bosselut, Alice Oh, Andre F. T. Martins, Leshem Choshen, Daphne Ippolito, Enzo Ferrante, Marzieh Fadaee, B. Ermiş, Sara Hooker

arXiv.org 2024

Leshem Choshen

Publications

Sloth: scaling laws for LLM skills to predict multi-benchmark performance across families

Findings of the Second BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora

Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation

Holmes ⌕ A Benchmark to Assess the Linguistic Competence of Language Models

ZipNN: Lossless Compression for AI Models

Model merging with SVD to tie the Knots

A Hitchhiker's Guide to Scaling Law Estimation

LiveXiv - A Multi-Modal Live Benchmark Based on Arxiv Papers Content

Unforgettable Generalization in Language Models

Can You Trust Your Metric? Automatic Concatenation-Based Tests for Metric Validity

Beneath the Surface of Consistency: Exploring Cross-lingual Knowledge Representation Sharing in LLMs

The ShareLM Collection and Plugin: Contributing Human-Model Chats for the Benefit of the Community

The Future of Open Human Feedback

A Survey on Model MoErging: Recycling and Routing Among Specialized Experts for Collaborative Learning

Data Contamination Report from the 2024 CONDA Shared Task

Do These LLM Benchmarks Agree? Fixing Benchmark Evaluation with BenchBench

Learning from Naturally Occurring Feedback

Compress then Serve: Serving Thousands of LoRA Adapters with Little Overhead

Efficient multi-prompt evaluation of LLMs

Elements of World Knowledge (EWOK): A cognition-inspired framework for evaluating basic world knowledge in language models

Holmes: A Benchmark to Assess the Linguistic Competence of Language Models

[Call for Papers] The 2nd BabyLM Challenge: Sample-efficient pretraining on a developmentally plausible corpus

Lossless and Near-Lossless Compression for Foundation Models

NumeroLogic: Number Encoding for Enhanced LLMs’ Numerical Reasoning

Asymmetry in Low-Rank Adapters of Foundation Models

tinyBenchmarks: evaluating LLMs with fewer examples

Label-Efficient Model Selection for Text Generation

Unitxt: Flexible, Shareable and Reusable Data Preparation and Evaluation for Generative AI

Genie: Achieving Human Parity in Content-Grounded Datasets Generation

Deductive Closure Training of Language Models for Coherence, Accuracy, and Updatability

ComPEFT: Compression for Communicating Parameter Efficient Updates via Sparsification and Quantization

Human Learning by Model Feedback: The Dynamics of Iterative Prompting with Midjourney

Fuse to Forget: Bias Reduction and Selective Memorization through Model Fusion

Efficient Benchmarking (of Language Models)

TIES-Merging: Resolving Interference When Merging Models

MuLER: Detailed and Scalable Reference-based Evaluation

Jump to Conclusions: Short-Cutting Transformers with Linear Transformations

Knowledge is a Region in Weight Space for Fine-tuned Language Models

Call for Papers - The BabyLM Challenge: Sample-efficient pretraining on a developmentally plausible corpus

ColD Fusion: Collaborative Descent for Distributed Multitask Finetuning

DisentQA: Disentangling Parametric and Contextual Knowledge with Counterfactual Question Answering

Where to start? Analyzing the potential value of intermediate models

Reinforcement Learning with Large Action Spaces for Neural Machine Translation

Label Sleuth: From Unlabeled Text to a Classifier in a Few Hours

PreQuEL: Quality Estimation of Machine Translation Outputs in Advance

Some Grammatical Errors are Frequent, Others are Important

Fusing finetuned models for better pretraining

Cluster & Tune: Boost Cold Start Performance in Text Classification

Semantics-aware Attention Improves Neural Machine Translation

On Neurons Invariant to Sentence Structural Changes in Neural Machine Translation

The Grammar-Learning Trajectories of Neural Language Models

ComSum: Commit Messages Summarization and Meaning Preservation

Part of Speech and Universal Dependency effects on English Arabic Machine Translation

Q^{2}: Evaluating Factual Consistency in Knowledge-Grounded Dialogues via Question Generation and Question Answering

Mediators in Determining what Processing BERT Performs First

GrASP: A Library for Extracting and Exploring Human-Interpretable Textual Patterns

SERRANT: a syntactic classifier for English Grammatical Error Types

An autonomous debating system

Transition based Graph Decoder for Neural Machine Translation

Enhancing the Transformer Decoder with Transition-based Syntax

Active Learning for BERT: An Empirical Study

Classifying Syntactic Errors in Learner Language

Unsupervised Expressive Rules Provide Explainability and Assist Human Experts Grasping New Domains

Corpus Wide Argument Mining - a Working Solution

All Neural Networks are Created Equal

Automatically Extracting Challenge Sets for Non-Local Phenomena in Neural Machine Translation

On the Weaknesses of Reinforcement Learning for Neural Machine Translation

Are You Convinced? Choosing the More Convincing Evidence with a Siamese Network

Learning to combine Grammatical Error Corrections

Let's Agree to Agree: Neural Networks Share Classification Order on Real Datasets

The Language of Legal and Illegal Activity on the Darknet

SemEval-2019 Task 1: Cross-lingual Semantic Parsing with UCCA

Will it Blend? Blending Weak and Strong Labeled Data in a Neural Network for Argumentation Mining

SemEval 2019 Shared Task: Cross-lingual Semantic Parsing with UCCA - Call for Participation

Inherent Biases in Reference-based Evaluation for Grammatical Error Correction

Automatic Metric Validation for Grammatical Error Correction

Reference-less Measure of Faithfulness for Grammatical Error Correction

DORA The Explorer: Directed Outreaching Reinforcement Action-Selection