Papers

Current Papers

Robbins, C., Korkmaz, G., Calderon, J.B.S., and Kramer, B.L. “Measuring the Cost of Open Source Software Innovation on GitHub.” Draft Available. [GitHub Repo.]

Open Source Software (OSS), defined by Open Source Initiative, is computer software with its source code shared with a license in which the copyright holder provides the rights to study, change, and distribute the software to anyone and for any purpose. OSS is developed, maintained, and extended both within and outside of the private sector, through the contribution of independent developers as well as people from universities, government research institutions, businesses, and nonprofits. Despite its ubiquity and extensive use, reliable measures of the scope and impact of OSS developed outside of the business sector are scarce. Activities around OSS development, a vital component of science activity, are not well-measured in existing federal statistics on innovation. Many of the OSS projects are developed and maintained in free repositories, such as GitHub, and information embedded in these repositories, including the code, contributors, and development activity, is publicly available. In this paper, we use data from GitHub, the largest platform with 31 million users and developers worldwide, obtaining information about OSS projects. We collect 5.2 million project repositories, containing metadata such as author, license, commits (approved code edits), and lines of code. We adopt methods used in software engineering to estimate the resource cost associated with creating OSS. We use lines of code as the measure of effort to estimate the time spent on software development and calculate the monetary value using the average compensation for computer programmers from Bureau of Labor Statistics wage data and other costs based on national accounts methodologies. The preliminary estimates show that the resource cost for developing open source software projects exceeds $928 billion dollars, based on 2017 costs.

Kramer, B.L., Korkmaz, G., Calderón, J.B.S., and Robbins, C. “International Collaboration in Open Source Software: A Longitudinal Network Analysis of GitHub.” Draft Available. [GitHub Repo.]

Over the past two decades, international collaboration has more than doubled in academic research. At the same time, the open source software community has burgeoned from a collection of small, dispersed communities to a multi-billion dollar industry spanning several prominent industrial sectors around the world. To date, few studies have examined the structure of open source software development as a transnational collaboration system. In this paper, we study international collaboration networks in the open source community using data scraped from GitHub - the world’s largest remote-hosting repository platform. After collecting data from roughly 740,000 GitHub users from 241 different countries, we analyze longitudinal trends for both contributor- and country-level network data from 2008-2019. Our findings demonstrate that the contributor-level networks have grown exponentially while simultaneously becoming less dense, less centralized, and less transitive over time. In this network, GitHub users from the US have a disproportionately higher impact on collaborative efforts, as indexed by the fraction of contributions from other countries and various centrality measures. This influence carries over to the country-level networks where most nations around the world are more likely to collaborate with the US than they are to collaborate with any other country, including their own. More generally, we find that the country-level network has become more structurally integrated over time, translating to some countries, like China and India, gaining more influence in the open source community. In addition to offering novel insights about the history of open source collaboration tendencies, this paper also raises a number of important questions for future research to address.

2022

Kramer, B. L., Calderón, J.B., & Korkmaz, G. “U.S.-International Collaborations in Open-Source Software: 2018.” (2022) Invention, Knowledge Transfer, and Innovation. Science and Engineering Indicators 2022: Official Report Published by the National Center for Science and Engineering Statistics.

Santiago Calderón, J.B., Kramer, B. L., & Korkmaz, G. “Cumulative Contribution to U.S. Federal Departments and Agencies to Open-Source Software on GitHub: 2010-2019.” (2022) Invention, Knowledge Transfer, and Innovation. Science and Engineering Indicators 2022: Official Report Published by the National Center for Science and Engineering Statistics.

These two citations refer to the statistical indicators that are forthcoming in the National Center for Science and Engineering Statistics’ annual report (anticipated early 2022.)

2021

Moradi-Jamei, B., Kramer, B.L., Calderón, J.B.S., and Korkmaz, G. “Community Formation and Detection on GitHub Collaboration Networks.” Proceedings of the 2021 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining. [Link to Paper on ArXiv]

This paper studies community formation in OSS collaboration networks. While most current work examines the emergence of small-scale OSS projects, our approach draws on a large-scale historical dataset of 1.8 million GitHub users and their repository contributions. OSS collaborations are characterized by small groups of users that work closely together, leading to the presence of communities defined by short cycles in the underlying network structure. To understand the impact of this phenomenon, we apply a pre-processing step that accounts for the cyclic network structure by using Renewal-Nonbacktracking Random Walks (RNBRW) and the strength of pairwise collaborations before implementing the Louvain method to identify communities within the network. Equipping Louvain with RNBRW and the contribution strength provides a more assertive approach for detecting small-scale teams and reveals nontrivial differences in community detection such as users tendencies toward preferential attachment to more established collaboration communities. Using this method, we also identify key factors that affect community formation, including the effect of users location and primary programming language, which was determined using a comparative method of contribution activities. Overall, this paper offers several promising methodological insights for both open-source software experts and network scholars interested in studying team formation.

2020

Korkmaz, G., Kelling, C., Robbins, C., and Keller, S. (2020) “Modeling the Impact of Python and R Packages Using Dependency and Contributor Networks.” Social Network Analysis and Mining. [Link to Paper]

This paper develops methods to estimate the factors that affect the impact of open-source software (OSS), measured by number of downloads, with a study of Python and R packages. The OSS community is characterized by a high level of collaboration and sharing which results in interactions between contributors as well as packages due to reuses. We use data collected from Depsy.org about the development activities of Python and R packages, and generate the dependency and contributor networks. We develop three Quasi-Poisson models for each of the Python and R communities using network characteristics, as well as author and package attributes. We find that the more derivative a package is (the more dependencies it has), the less likely it is to have a high impact. We also show that the centrality of a package in the dependency network measured by the out-degree, closeness centrality, and pagerank has a significant effect on its impact. Moreover, the closeness and weighted degree centralities of the developers in the Python and R contributor networks play an important role. We also find that introducing network features to a baseline model using only package features (e.g., number of authors, number of commits) improves the performance of the models.

2018

Keller, S., Korkmaz, G., Robbins, C., and Shipp, S. (2018). “Opportunities to Observe and Measure Intangible Inputs to Innovation: Definitions, Operationalization, and Examples.” Proceedings of the National Academy of Sciences, 115(50), 12638-12645. [Link to Paper]

Measuring the value of intangibles is not easy, because they are critical but usually invisible components of the innovation process. Today, access to nonsurvey data sources, such as administrative data and repositories captured on web pages, opens opportunities to create intangibles based on new sources of information and capture intangible innovations in new ways. Intangibles include ownership of innovative property and human resources that make a company unique but are currently unmeasured. For example, intangibles represent the value of a company’s databases and software, the tacit knowledge of their workers, and the investments in research and development (R&D) and design. Through two case studies, the challenges and processes to both create and measure intangibles are presented using a data science framework that outlines processes to discover, acquire, profile, clean, link, explore the fitness-for-use, and statistically analyze the data. The first case study shows that creating organizational innovation is possible by linking administrative data across business processes in a Fortune 500 company. The motivation for this research is to develop company processes capable of synchronizing their supply chain end to end while capturing dynamics that can alter the inventory, profits, and service balance. The second example shows the feasibility of measurement of innovation related to the characteristics of open source software through data scraped from software repositories that provide this information. The ultimate goal is to develop accurate and repeatable measures to estimate the value of nonbusiness sector open source software to the economy. This early work shows the feasibility of these approaches.

Korkmaz, G., C. Kelling, C. Robbins, & S. Keller. (2018). “Modeling the Impact of R Packages Using Dependency and Contributor Networks.” 2018 IEEE/ACM International Conference on Advances in Social Network Analysis and Mining, pp. 511-514. [Link to Paper].

This paper aims to identify the factors that affect the impact of Open Source Software (OSS), measured by number of downloads and citations, with a case study of R packages. We generate the dependency and contributor networks of the packages using data collected from Depsy.org, and develop statistical models that use the network characteristics, as well as author and package attributes. We find that there are common network and package attributes that are important in determining both the number of downloads and citations of a package, including degree, closeness and betweenness centralities, as well as package attributes such as number of authors and number of commits.