CS004 Innovation and Diversity in Data Science

Conference: Symposium on Data Science and Statistics (SDSS) 2024
06/05/2024: 10:30 AM - 12:00 PM EDT
Refereed 
Room: Shenandoah 

Chair

Amira Burns, USDA - ARS - APHIS

Tracks

Practice and Applications
Symposium on Data Science and Statistics (SDSS) 2024

Presentations

Empowering Data Science through Code Modernization: Bridging the Gap between Innovation and Efficiency

Robust data systems are crucial for data science and statistics. As data complexity and volume increase, code modernization is essential for maintaining efficient, scalable, and sustainable data infrastructure, particularly for agencies reliant on data-driven decision-making. This involves updating and optimizing the codebase supporting data operations and processes. Traditional methods may struggle with large datasets, necessitating modern techniques like distributed computing. Additionally, the emergence of new data sources and formats demands adaptable coding practices for efficient processing. Technological innovation requires agile and responsive data infrastructure. Modernization allows for the incorporation of the latest tools, fostering code that is easier to understand and scale. This agility is crucial for organizations to adapt quickly to changing requirements and trends.
Our case study with the Economic Research Service (ERS) within the USDA exemplifies code modernization. ERS's data products, critical for agricultural policies, are built on outdated code. We advocate a modular programming approach, dividing data processing into distinct functions for independent development and maintenance. This approach, complemented by best practices like unit testing and version control, enhances the code's robustness and maintainability. By rewriting the code as an R package and implementing modern practices, we significantly improved the maintainability and adaptability of the data processing code. This simplifies updates and maintenance while also ensuring accurate and current documentation. Code modernization within data infrastructure is imperative for organizations to remain efficient and responsive in a data-driven world. The ERS case demonstrates the benefits of adopting a modular approach and best practices, essential for leveraging the full potential of data assets and meeting evolving demands. 

Presenting Author

Roy McKenzie, Coleridge Initiative

First Author

Roy McKenzie, Coleridge Initiative

CoAuthor(s)

Nathan Barrett
Ahu Yildirmaz

Gender Differences in the Development of R Packages on GitHub

The analysis of the gender dynamics in scientific research and respective outputs is crucial for ensuring that science policy is inclusive and equitable. Similar to other research outputs such as publications and patents, open source software (OSS) projects are also developed by contributors from universities, government research institutions, and nonprofits, in addition to businesses. Despite its reach and continued rapid growth, reliable and comprehensive survey data on OSS does not exist, limiting insights into contributions by gender and policy-makers' ability to assess trends in gender representation. Like in scientific research, the inclusion of diverse perspectives in software development enhances creativity and problem-solving. Using GitHub data, researchers have found positive correlations between gender diversity of an OSS development team and its productivity (Vasilescu et al., 2015; Ortu et al., 2017). Yet there is evidence of gender bias, with women facing higher standards to have their contributions accepted (Terrell et al., 2017; Imtiaz et al., 2019).

This exploratory study aims to quantify gender differences in development and use (impact) of OSS using publicly available information collected from GitHub. We focus on software packages developed for programming language R, with the majority of contributors from academia. The paper asks (1) what are gender differences in the volume of contributions? (2) has gender representation shifted over time? (3) is there a correlation between the gender of contributors and the impact of a package? Our dataset includes 1,883,977 commits to 7,016 registered R packages from 2008 to mid 2023 and information about 14,311 unique contributors. Through percentage breakdowns we showcased how different gender groups contributed to OSS projects through commits, lines of code, and package ownership. 

Presenting Author

Carol Moore

First Author

Carol Moore

CoAuthor(s)

Uyen Nguyen, University of Virginia
Gizem Korkmaz, Westat

Embracing the AI Revolution: ChatGPT’s Role in Advancing Data Science Consultation Services

In this innovative exploration, we discuss the application of ChatGPT as a tool for data science consultants to provide robust consulting services. Targeted at all levels of data consulting, as practicing data science consultants, we outline our experiences employing ChatGPT to assist patrons in various data science endeavors such as data cleaning, manipulation, preprocessing, coding, visualization, algorithm development, statistical modeling, and results interpretation. We will provide a demonstration of ChatGPT in action, a walkthrough of its integration in our data consulting process, and discuss overcoming challenges associated with using this and other generative AI models. ChatGPT can expand the capacity and efficiency of data science, imparting vital skills for consultants to better cater to the evolving data-related needs of their patrons. 

Presenting Author

Alp Tezbasaran

First Author

Alp Tezbasaran

CoAuthor

Shannon Ricci, North Carolina State University