Data Science for the Public Good (DSPG)



Each year the University of Virginia’s Biocomplexity Institute holds a summer program for training graduate and undergraduate students how to do applied data science. The Data Science for the Public Good (DSPG) Young Scholars program is a summer immersive program for undergraduate and graduate students from across the country. The program brings together and engages students on research projects that address state, federal, and local government challenges around critical social issues relevant in the world today. DSPG young scholars conduct research at the intersection of statistics, computation, and the social sciences to determine how information generated within every community can be leveraged to improve quality of life and inform public policy. Undergraduate interns and graduate fellows work in collaborative teams with postdoctoral associates and research faculty from the Social and Decision Analytics division, and project stakeholders. From 2019-2021, we have carried out projects on OSS in this DSPG program. Click on the links to the project websites and papers to learn more about what did during this summer projects.


DSPG Summer Class of 2021

Crystal Zang (Team Lead, University of Pittsburgh, Biostatistics),
Cierra Oliveira (Clemson University, Computing and Applied Sciences), and
Stephanie Zhang (University of Virginia, Mathematics/Probability/Statistics, Sociology)

Zang, C., Oliveira, C., Zhang, S., Kramer, B., and Korkmaz, G. “Defining and Measuring the Universe of Open Source Software Innovation.” (2021) [Summer Project RShiny App]

Since the advent of the internet, software has become integral part of our lived social realities. From the rise of mobile phones to social media apps, software shapes how we interact with those around us as well as driving much of the econonic growth seen around the world. Federal economic indicators developed by the National Center for Science & Engineering Statistics do not currently do well in measuring the value of goods and services that do not have market transactions (i.e., they are not captured in surveys nor are they in economic measures such as the Gross Domestic Product or GDP). Although the NCSES does track some types of software development, it is challenging to account for software that is developed outside of traditional business contexts. To address this gap, the NCSES is interested in evaluating the economic and social impact of Open Source Software (OSS) through the use of public administrative data. OSS refers to computer software with its source code shared with a license in which the copyright holder provides the rights to study, change, or distribute the software to anyone and for any purpose. Over the past few years, our team has aimed to measure how much OSS is in use (stock), how much is created (flow), who is developing these tools (based on sectors, institutions, and organizations) as well as how OSS is shared across these various institutions. In past Data Science for the Public Good projects, we developed procedures to classify users into sectors, but very little work to date examines how different types of software are used within and across these sectors. In this year’s OSS DSPG project, we will be collecting data from GitHub OSS repositories, classifying these repos into different OSS types, and evaluating how these different types of software are used within and across economic sectors. Developing a procedure to classify repositories into categories will allow the NCSES to better determine the effect that variations in software may have on OSS contribution activity, collaboration tendencies in networked ecosystems, or on the overall cost of OSS projects.


DSPG Summer Class of 2020

Daniel Bullock (Team Lead, Indiana University, Computational Neuroanatomy),
Morgan Klutzke (Indiana University, Psychology and Cognitive Science), and
Crystal Zang (University of Pittsburgh, Biostatistics)

Daniel Bullock, Morgan Klutzke, Crystal Zang, Brandon Kramer, Gizem Korkmaz, J Bayoán Santiago Calderón, and Aaron Schroeder. (2020) Sectoring Open Source Software: Where Do GitHub Contributions Come From? [Summer Project Website]

Current economic indicators and indicators developed by the National Center for Science & Engineering Statistics (NCSES) do not measure the value of goods and services that fall outside of market transactions. Although NCSES does track some types of software development, it is challenging to account for open source software (OSS) outputs because these products are largely being advanced outside of traditional business contexts. Moreover, while current measures of innovation tend to rely on survey data, patent issues, trademarks approvals, intangible asset data, or estimates of total factor productivity growth, these measures are either incomplete or fail to capture innovation that is freely available to the public. In order to address these gaps, our project aims to measure how much OSS is in use (stock), how much is created (flow), who is developing these tools, and how OSS tools are being shared across these different sectors, institutions, and organizations. Building on research conducted over the past three years, we examined the production, diffusion, and impact of open-source software in specific sectors, institutions, and geographic areas using data that we scraped from GitHub – the world’s largest remote hosting platform. More specifically, we are interested in understanding how GitHub users from different economic sectors, academic institutions, and private organizations share resources within the context of OSS and the potential impact that this has on global innovation. Over the course of the 2020 Data Science for the Public Good Program, we worked to classify OSS contributors into these sectors, count which institutions users are affiliated within each sector, and researched how users collaborate within and across these various sectors. In our project, we implemented various methods, including web scraping (to collect the data), computational text analysis (to match and recode user affilations), and social network analysis (to examine collaboration tendencies).


2019

Cong, C., Isch, C., Tobin, E., Korkmaz, G., Santiago Calderón, J.B., Schroeder, A. and Kramer, B. L. (2020) “Open Source: The Future of Software Innovation.” MethodSpace. SAGE Publishing. [Link to Paper]

Open Source software is a key part of our economy, yet currently not measured because it is ‘free’ and therefore not captured in the nation’s Gross Domestic Product. The National Center for Science and Engineering Statistics would like to better understand the contribution and value of OSS to provide policy makers with a more comprehensive picture of science and engineering indicators. In this project, various data sources including online repositories such as GitHub, and databases of copyrights, trademarks, and patents are examined to assess if these data sources provide useful information about quantity and value of OSS.