Automating Codebase Translation from SAS to Python with LLMs

Ellie Mamantov Co-Author
Reveal Global Consulting
 
John Lynagh Co-Author
Reveal Global Consulting
 
Cameron Milne First Author
Reveal Global Consulting
 
Cameron Milne Presenting Author
Reveal Global Consulting
 
Monday, Aug 4: 3:35 PM - 3:50 PM
1400 
Contributed Papers 
Music City Center 
Code translation from SAS to Python remains a challenging effort for organizations migrating their codebases. Classical rules-based methods like Abstract Syntax Trees rely on handcrafted rules that can be time-consuming and inflexible. Unsupervised learning approaches have shown improvements but require massive parallel data for training which is unavailable for SAS and Python. Large Language Models (LLMs) overcome these barriers through parametric knowledge retrieval and offer more promising results despite diverse quality issues (syntax and semantic errors). This presentation explores various strategies for automating SAS to Python translation on complex codebases. We discuss managing context window limitations, nested dependencies, incorporating rules-based approaches, and reducing laziness over tedious code. We also detail specific challenges when adapting SAS to Python such as sentinel values, vectorized operations, and adapting macros. This presentation highlights practical approaches for migrating proprietary software to open-source languages more quickly, reducing resource burden on organizations while preserving critical business logic.

Keywords

Large Language Models (LLMs)

Code Translation





Federal Statistics

Natural Language Processing 

Main Sponsor

Section on Statistical Consulting