Late Breaking Results: Dynamically Scalable Pruning for Transformer-Based Large Language Models

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

We propose Matryoshka, a novel framework for transformer model pruning, enabling dynamic runtime controls while maintaining competitive accuracy to modern large language models (LLMs). Matryoshka incrementally constructs submodels with varying complexities, allowing runtime adaptation without maintaining separate models. Our evaluations on LLaMA-7B demonstrate that Matryoshka achieves up to 34% speedup and outperforms the quality of state-of-the-art pruning methods, providing a flexible solution for deploying LLMs.

Original languageEnglish
Title of host publication2025 Design, Automation and Test in Europe Conference, DATE 2025 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9783982674100
DOIs
StatePublished - 2025
Event2025 Design, Automation and Test in Europe Conference, DATE 2025 - Lyon, France
Duration: 31 Mar 20252 Apr 2025

Publication series

NameProceedings -Design, Automation and Test in Europe, DATE
ISSN (Print)1530-1591

Conference

Conference2025 Design, Automation and Test in Europe Conference, DATE 2025
Country/TerritoryFrance
CityLyon
Period31/03/252/04/25

Bibliographical note

Publisher Copyright:
© 2025 EDAA.

Keywords

  • Depth-Pruning
  • Large Language Model
  • Real-Time Management

Fingerprint

Dive into the research topics of 'Late Breaking Results: Dynamically Scalable Pruning for Transformer-Based Large Language Models'. Together they form a unique fingerprint.

Cite this