From 288ad6be9e947f3cdde7666341b0da4c6e3c9738 Mon Sep 17 00:00:00 2001 From: "J.C. Subida" Date: Fri, 2 Aug 2024 10:47:39 -0500 Subject: [PATCH 01/10] Update bofs.md --- pages/program/abstracts/bofs.md | 182 ++++++++++++++++++++++++++++++++ 1 file changed, 182 insertions(+) diff --git a/pages/program/abstracts/bofs.md b/pages/program/abstracts/bofs.md index 5bf1af7..286b66e 100644 --- a/pages/program/abstracts/bofs.md +++ b/pages/program/abstracts/bofs.md @@ -12,46 +12,228 @@ set_last_modified: true _Julia Damerow, Jeffrey C. Carver, Jason Yalim_ +Research software engineering is a profession without a clear educational pathway. The +international RSE survey from 2022 (Hettrick et al. 2022) found that about 23% of all +respondents finished their highest level of education in computer science, closely followed by +physics and astronomy (21%), and biological sciences (10%). Respondents from the US had a +very similar educational background with computer science (23%), physics and astronomy +(19%), followed by mathematics (12%). Moreover, 42% of respondents worldwide and 30% of +respondents from the US did not consider themselves a professional software developer. These +statistics raise the question how we can train and prepare future research software engineers +given that the “traditional” educational pathway of software developers in the private sector does +not apply for research software engineers. Furthermore, there are many researchers and +students who develop software for research that do not aspire to become full-time research +software engineers but rather use programming as a tool in their research. How can we train +them in research software engineering best practices in order to increase code quality and code +sustainability and reusability? + +We propose a Birds of a Feather session to discuss different approaches, projects, and +initiatives that teach the principles and practice of research software engineering. We plan to +start the session with four 7-minute presentations about teaching efforts in different settings +followed by an audience discussion with planned breakout groups depending on the number of +participants. The presentations are as follows: + +- “Teaching the Fundamentals of Research Software Engineering to Graduate Students” by +Julia Damerow +- “INnovative Training Enabled by a Research Software Engineering Community of Trainers +(INTERSECT)” by Jeff Carver and Ian Cosden +- “The Better Scientific Software (BSSw) Tutorial Series” by David E. Bernholdt and Anshu +Dubey +- “Incorporating research software engineering best practices into NSF CIREN +cyberinfrastructure professional training” by Marisa Brazil and Gil Speyer +To start off the discussion with the audience, we will create a poll and compile a list of +discussion points with questions such as: + - In your experience, what is the biggest challenge when teaching research software +engineering? + - What makes research software engineering unique? + - What’s the typical background of the people you teach? + - At a minimum, what should an aspiring RSE be trained in? + - What’s the best career timing(s) for RSEng training? + --- ## Mapping Open Source Science _Jonathan Starr_ +This BoF will bring the RSE perspective to MOSS – RSEs are perhaps best positioned to +provide guidance on these next steps. We will discuss what has been developed already, +familiarize ourselves with the map by adding to it, and work to more closely align the map with +the goals of RSEs. We will explore and discuss impact metrics, dependency graphing, +stewardship of open source repositories by organizations, corporations, and institutions, along +with citation graphing for open source research software. We will highlight work that has been +done in this field, including ETO’s ORCA and Map of Science projects, and we will chart the +path forward for the development of an Encyclopédie for the digital knowledge and research +ecosystem. + --- ## Exploring the Potential Impact of Advancements in Artificial Intelligence on the RSE Profession _David Luet_ +I am proposing an exploratory discussion on the many ways AI could impact the +RSE profession. I have discussed this topic with many different RSEs and each +time I received a different answer. I would like to get a sense of the distribution of +feelings towards AI within the community. Once the potential impacts of AI on the +professions have been discussed in the first part of the BoF, a second part would +explore the potential strategies to best position ourselves to take advantage of the +predicted changes. + +AI is already transforming the methodologies and practices in software devel- +opment and engineering. For example, GitHub Copilot is a popular coding assis- +tant that helps improve the programmer’s productivity. Some work is done +at Meta to have an AI write tests. They are also AI-driven code review tools, +automated documentation generation, and AI-assisted debugging. + +What other aspects of the profession beyond software development could be +impacted? Project management and data analysis are two examples. How will +the wide adoption of AI impact Junior developers? Will it free them from the +menial tasks and let them focus on the creative ones or will it take away tasks +that are stepping stones for becoming a better programmer? AI can be a valuable +teaching tool, but there is a risk of training Copy/Paste Engineers instead. How +about intellectual property? What happens to the code generated by an AI +trained on your open-source code published under a GPL license? Before the conference, the author would like to run a survey similar to within the US-RSE community. The main questions for this survey would be: "In the next 5 years, how much impact do you think the use of artificial intelligence will have on your job?". The goal of the survey would be to get a sense of where the community stands between being dismissive ("This won’t affect my job at all"), optimistic ("This will make me much more productive that it will help speed up research"), pessimistic ("There won’t be a need for RSEs, the researchers will just explain their problems to an AI which will then write the code for them”) + --- ## Brainstorming Strategies for Cultivating Successful and Collaborative RSE Teams _Abbey Roelofs and Kristina Riemer_ +Research software engineering groups face unique challenges in functioning effectively and +smoothly together. Technical skills and development, project management, and collaboration +methods both within the group and also with researchers and domain experts all need to be +taken into consideration. RSEs often have diverse backgrounds and skill sets with regard to +these various aspects of the profession, and it can be a challenge to balance these competing +concerns. How do RSE groups address these challenges to work well together as productive, +collaborative, and satisfied teams? + +We see this BoF session as an opportunity for everyone involved with RSE work as part of a +group to come together and share experiences and ideas that have helped teams work well +together better, and also lessons learned about what has not been effective for teamwork. We +will choose topics together and break out into smaller groups for discussion. We expect these +topics to potentially include (but not be limited to) professional development, project +management, communication tools, retention approaches, goal setting, interdisciplinary work, +and mentorship models; these are topics that repeatedly arise on Slack channels, in Community +Calls, and in various Working Group conversations and are clearly of interest to the community. +Following our breakout discussions, groups will report back to the session as a whole. +This session will facilitate an open discussion of ideas and methods for building and +strengthening RSE teams. Our hope is that participants will be able to take some of these ideas +back to their own teams, follow up with the community via Slack or future conversations on how +these ideas worked, and build our collective knowledge base of tools and practices for +cultivating successful RSE teams. + --- ## Navigating the Remote Landscape: Working Effectively with Stakeholders _Troy Comi_ +The global pandemic has significantly reshaped the landscape of work, prompting a surge in +remote work arrangements. Advances in communication technologies and virtual collaboration +tools have facilitated interactions among geographically dispersed teams.1,2 Research Software +Engineers (RSEs) have not been immune to this shift, as organizations increasingly recognize +the benefits of virtual collaboration.3 In this BoF, we explore the impact of remote work on RSEs, +emphasizing both its advantages and challenges. We also delve into strategies for effective +collaboration with stakeholders in a distributed work environment. + +Remote work offers RSEs greater flexibility in managing their schedules. Freed from the +constraints of physical office spaces, they can tailor their work hours to align with their +productivity peaks. The ability to work remotely also contributes to improved work-life balance. +RSEs can better integrate professional responsibilities with personal commitments, leading to +enhanced overall well-being. Finally, eliminating daily commutes translates to less stress, more +time for productive work, and lower carbon emissions.4 The perceived benefits are best +quantified by Barrerok et. al. who report 2-3 days/week of remote work is valued at a 5-10% +raise by employees. + +Of course remote and WFH arrangements have disadvantages. Family members, pets, and +household chores can disrupt focus and lead to underworking. At the same time, the absence of +clear physical boundaries between work and personal life can lead to overworking.5 It is +essential to establish a dedicated workspace and routines to mitigate possible interruptions and +challenges. Remote workers may be more likely to be passed over for promotion due to being +physically absent. Finally, an often cited concern with fully remote working agreements is the +difficulty in maintaining effective communication with stakeholders.3 RSEs must bridge the gap +between remote work and in-person interactions, ensuring that project goals are met even as an +exception in office-first settings. + +To address these challenges, this BoF session will feature experienced RSEs who have +successfully navigated fully remote work environments and stakeholders. During the panel +discussion, we will share insights, strategies, and practical tools for fostering collaborations and +personal connections. Our aim is to empower RSEs to thrive in a virtual work setting, whether +their stakeholders are local or globally distributed. + --- ## Better Scientific Software Fellowship Community _Elsa Gonsiorowski, Erik Palmer and Mary Ann Leung_ +Software developers face increasing complexity in computational models, computer +architectures, and emerging workflows. In this environment Research Software Engineers need +to continually improve software practices and constantly hone their craft. To address this need, +the Better Scientific Software (BSSw) Fellowship Program launched in 2018 to seed a +community of like-minded individuals interested in improving all aspects of the work of software +development. To this aim, the BSSw Fellowship Program fosters and promotes practices, +processes, and tools to improve developer productivity and software sustainability. + +Our community of BSSw Fellowship alums serve as leaders, mentors, and consultants, thereby +increasing the visibility of all those involved in research software production and sustainability in +the pursuit of discovery. This session will present the BSSw Fellowship (BSSwF) Program, +briefly discussing our successes in developing a community around software development +practices. We will invite conference participants to benefit from our resources and explain how +they can join us in this effort. The session will also highlight the work of recent fellowship +awardees and honorable mentions in attendance at US-RSE’24. As many in the BSSwF +community identify as RSEs, and BSSwF projects are of particular relevance, this forum will +serve to amplify the connections between our communities. + --- ## Sharing lessons learned on the challenges of fielding research software proof-of-concepts / prototypes in Department of Defense (DoD) and other Government environments _Daniel Strassler_ +This Birds of a Feather discussion will focus on the challenges of fielding research software +proof-of-concepts / prototypes in Department of Defense (DoD) / Government environments. +These environments vary widely and have both technical and non-technical challenges. +Navigating these challenges can often be extremely taxing and hard. The group of expert +panelists will discuss their experiences and answer moderator questions. They will also field +questions from the audience. The lessons learned and guidance provided will help inform +research software engineers on the challenges they will encounter in DoD / Government +environments and how to mitigate them. + --- ## RSEs in domain-specific ecosystems _Julia Damerow, Rebecca Sutton Koeser, Laure Thompson and Jeri E. Wierenga_ +Research Software Engineers (RSEs) are still newer and relatively rare in the humanities and +social sciences, and the term is being adopted to cover a wide range of technical work. As a +result, the role of the RSE in these disciplines is often distinct from that role in the sciences. +Those who write about the RSE career path discuss the value of domain knowledge (Cosden et +al., US-RSE & IEEE), but our experience suggests RSE work differs widely by domain. + +Each research domain has its own existing ecosystem and how an RSE fits into it varies. In some cases, this is a clear continuation and professionalization of existing work. In other fields, this is +a partially or wholly new research endeavor. In the humanities, most scholars are new to +collaborative research, to thinking of their research content as data, and to computational +approaches. How do these ecosystems impact the RSE role, and how might RSEs be +intentionally or accidentally different? How might more nascent perspectives expand the +affordances of the RSE role? + +In our experience, the most successful digital humanities (DH) projects involve an RSE as +co-author or co-PI. The RSE must take an active role in operationalizing research questions +because humanities researchers don’t know which methodologies and measurements are possible +or useful to answer their questions; something that is taken for granted in other spaces because +researchers are trained in this kind of work. For instance, to determine whether a corpus of +documents supports a particular hypothesis, specific statistics (such as word frequencies) need to +be generated, which can then be interpreted. Researchers need to understand how these statistics +are created and what they mean, which is typically not part of the standard curriculum. + +We propose a BoF session to share and compare experiences of RSEs in specific research +domains. How do these differences impact the training needs of RSEs? What does it mean for +institutional support and infrastructure? How might the DH RSE experience benefit and inform +RSE roles in other domains? We plan to start the session with a panelist discussion followed by +a conversation with the audience. + --- From 27d85f50f6c3bc8d6e2b51e07d34109a7f3bcfe2 Mon Sep 17 00:00:00 2001 From: "J.C. Subida" Date: Fri, 2 Aug 2024 10:49:36 -0500 Subject: [PATCH 02/10] Update workshops.md --- pages/program/abstracts/workshops.md | 55 ++++++++++++++++++++++++++++ 1 file changed, 55 insertions(+) diff --git a/pages/program/abstracts/workshops.md b/pages/program/abstracts/workshops.md index 256db05..7853701 100644 --- a/pages/program/abstracts/workshops.md +++ b/pages/program/abstracts/workshops.md @@ -11,20 +11,75 @@ set_last_modified: true _[Lee Liming](http://www.uchicago.edu/) and [Steve Turoscy](https://www.globus.org)_ +Learning Topic. This workshop will equip Research Software Engineers (RSEs) to automate +research data processes using Globus Compute and Globus Flows: two parts of the +Globus Platform. Globus Compute enables RSEs to securely execute Python functions on +remote computers, including campus, cloud, and national-scale systems. Globus Flows provides +secure, managed automation of complex workflows. Together, Globus Compute and Flows +enable RSEs to construct data processing pipelines, managed and executed by Globus. + --- ## Establishing RSE Programs - From early stage formalization to mature models _[Ian Cosden](https://researchcomputing.princeton.edu/services/research-software-engineering), [Sandra Gesing](https://www.sdsc.edu/) and [Adam Rubens](https://rsenyc.org/)_ +Discussion Topic. Research Software Engineering has emerged as a critical +discipline at the intersection of research and software engineering. In recent years, as the +demand for Research Software Engineers (RSEs) has increased, so has the emergence of +dedicated RSE groups for such roles within research organizations. This trend underscores the +recognition of the unique skill set and expertise that RSEs bring to research projects, bridging +the gap between domain-specific knowledge and software development proficiency. However, +the sustainability of these RSE teams relies heavily on effective business models that balance +financial sustainability with academic and research goals in the research enterprise. This +workshop will discuss the strategies for structuring and supporting RSE teams within research +organizations, emphasizing the importance of aligning business objectives with the overarching +mission of advancing science. By exploring diverse business models tailored to the specific +needs and goals of RSE initiatives, organizations can not only ensure the longevity of their RSE +teams but also maximize their impact on driving innovation and discovery across disciplines. +We will discuss how to identify need, understand existing use cases, catalog assets to convey +feasibility and how to determine the best program models to pursue. We aim to facilitate +discussions on different organizational approaches, sustainable funding and growth strategies, +and the evolution towards robust group models. + --- ## Emerging as a Team Leader through Cultural Challenges _[Elaine M. Raybourn](https://www.sandia.gov/-emraybo/), Angela Herring and Ryan Shaw_ +In this interactive workshop we explore how meeting technical objectives depends critically on +meeting cultural challenges within teams. Many small teams evolve naturally on a positive path +because teams are often formed by friendly collaborators with common interests and goals. +But as teams grow, they face new challenges, including fragmented time commitments, +physically distributed developers, and growing numbers of stakeholders. These challenges +make it significantly harder to maintain a healthy team culture. In addition, it becomes +important to understand how to identify and responsibly use one’s informal influence to +navigate organizations and galvanize team members. Although agile methodologies and +modern software tools can help developers manage these challenges, these processes alone +are insufficient. Thus, a growing number of teams are combining these methodologies with +techniques that support a collaborative culture. We explore lightweight progress tracking, +informal "tea-time", hybrid scrums, human-centered design, empathy building, listening games, +and hands on exercises with Legos. The session is 180 min of guided small group exercises and +debriefing, which includes 30 min of large group discussion and retrospective. We would like to +allow the workshop to continue to work through the break. We recap lessons learned and +explore how culture affects team performance and emerging as a team leader. We focus on +skill building in intercultural and interpersonal communication, and inclusion. + --- ## Special Workshop: Community discussion: teachingRSE project _Jan Philipp Thiele_ + +Discussion Topic. In this workshop we want to include the wider community on the +teachingRSE project that started out at deRSE23 in Paderborn. +As a first output a paper on RSE competencies and responsibilities was written +and we want to invite the wider RSE community to provide feedback to hopefully arrive at a +common understanding of the skills of an RSE. +Currently, the group is working on how to include teaching of RSE competencies into the wider +academic infrastructure and for this we will break out into different groups centered around more +specific questions, e.g. ‘Strategies for getting RSE topics included’, +‘Lessons learned in teaching at academic institutions’. +Finally, we will discuss the plan going forward, including opportunities to participate regularly in +the project. From 378756fdeedb7bce9e57d9bab3a1a09515a0356e Mon Sep 17 00:00:00 2001 From: "J.C. Subida" Date: Fri, 2 Aug 2024 11:08:58 -0500 Subject: [PATCH 03/10] Add program papers --- _data/menus/program.yml | 2 + _data/navigation.yml | 2 + pages/program/abstracts/papers.md | 113 +++++++++++ pages/program/abstracts/papers.tbd | 293 ----------------------------- pages/program/program.md | 1 + 5 files changed, 118 insertions(+), 293 deletions(-) create mode 100644 pages/program/abstracts/papers.md delete mode 100644 pages/program/abstracts/papers.tbd diff --git a/_data/menus/program.yml b/_data/menus/program.yml index 39e4340..2e741dc 100644 --- a/_data/menus/program.yml +++ b/_data/menus/program.yml @@ -4,6 +4,8 @@ link: program/ - name: Birds of a Feather link: program/bofs/ + - name: Papers + link: program/papers/ - name: Tutorials link: program/tutorials/ - name: Workshops diff --git a/_data/navigation.yml b/_data/navigation.yml index 494b8f8..8ce1edd 100644 --- a/_data/navigation.yml +++ b/_data/navigation.yml @@ -9,6 +9,8 @@ link: program/ - name: Birds of a Feather link: program/bofs/ + - name: Papers + link: program/papers/ - name: Tutorials link: program/tutorials/ - name: Workshops diff --git a/pages/program/abstracts/papers.md b/pages/program/abstracts/papers.md new file mode 100644 index 0000000..2e5f837 --- /dev/null +++ b/pages/program/abstracts/papers.md @@ -0,0 +1,113 @@ +--- +layout: page +title: Papers +description: +menubar: program +permalink: program/papers/ +menubar_toc: true +set_last_modified: true +--- + +
+
+
+
+ +
+
+ +
+
+

The demand for efficient and innovative tools in research environments is ever-increasing in the rapidly + evolving landscape of artificial intelligence (AI) and machine learning (ML). This paper explores the + implementation of retrieval-augmented generation (RAG) to enhance the contextual accuracy and applicability of + large language models (LLMs) to meet the diverse needs of researchers. By integrating RAG, we address various + tasks such as synthesizing extensive questionnaire data, efficiently searching through document collections, + and extracting detailed information from multiple sources. Our implementation leverages open-source libraries, + a centralized repository of pre-trained models, and high-performance computing resources to provide + researchers with robust, private, and scalable solutions.

+
+
+
+
+
+
+
+ +
+
+ +
+
+

Lab notebooks are an integral part of science by documenting and tracking research progress in laboratories. + However, existing electronic solutions have not properly leveraged the full extent of capabilities provided by a + digital environment, resulting in most physics laboratory notebooks merely mimicking their physical counterparts + on a computer. To address this situation, we report here preliminary work toward a novel electronic laboratory + notebook, Lab Dragon, designed to empower researchers to create customized notebooks that optimize the benefits + of digital technology.

+
+
+
+ +
+
+
+ +
+
+ +
+
+

In this work we aim to partially answer the question, "Just how many research software projects are out there?” + by searching for open source GitHub projects affiliated with research universities in the United States. We + explore this through keyword searches on GitHub itself and by scraping university websites for links to GitHub + repositories. We then filter these results by using a large language model to classify GitHub repositories as + research software engineering projects or not, finding over 35,000 RSE repositories. We report our results by + university. We then analyze these repositories against metrics of popularity, such as stars and repository + forks, and find just under 14,000 RSE repositories meet our minimum criteria for projects which have a + community. Based on the time since a developer last pushed a change to a RSE repository with a community, we + further posit that 3,300 RSE repositories with communities and a link to a research university are at risk of + dying, and thus may benefit from sustainability support. Finally, across all RSE projects linked to a research + university, we empirically find the top repository languages are Python, C++, and Jupyter Notebook.

+
+
+
+ +
+
+
+ +
+
+ +
+
+

In the realm of scientific software development, adherence to best practices is often advocated. However, + implementing these can be challenging due to differing opinions. Certain aspects, such as software licenses and + naming conventions, are typically left to the discretion of the development team. Our team has established a set + of preferred practices, informed by, but not limited to, widely accepted best practices. These preferred + practices are derived from our understanding of the specific contexts and user needs we cater to. To facilitate + the dissemination of these practices among our team and foster standardization with collaborating domain + scientists, we have created a project template for Python projects. This template serves as a platform for + discussing the implementation of various decisions. This paper will succinctly delineate the components that + constitute an effective project template and elucidate the advantages of consolidating preferred practices in + such a manner.

+
+
+
diff --git a/pages/program/abstracts/papers.tbd b/pages/program/abstracts/papers.tbd deleted file mode 100644 index 18d83a4..0000000 --- a/pages/program/abstracts/papers.tbd +++ /dev/null @@ -1,293 +0,0 @@ ---- -layout: page -title: Papers -description: -menubar: program -permalink: program/papers/ -menubar_toc: true -set_last_modified: true ---- - -## Long Papers - -
-
-
-
- -
-
- -
-
-

Research software once was a heroic and lonely activity, particularly in research computing and in HPC. But today, research software is a social activity, in the senses that most software depends on other software, and that most software that is intended to be used by more than one person is written by more than one person. These social factors have led to generally accepted practices for code development and maintenance and for interactions around code. This paper examines how these practices form, become accepted, and later change in different contexts. In addition, given that research software engineering (RSEng) and research software engineers (RSEs) are becoming accepted parts of the research software endeavor, it looks at the role of RSEs in creating, adapting, and infusing these practices. It does so by examining aspects around practices at three levels: in communities, projects, and groups. Because RSEs are often the point where new practices become accepted and then disseminated, this paper suggests that tool and practice developers should be working to get RSE champions to adopt their tools and practices, and that people who seek to understand research software practices should be studying RSEs. It also suggests areas for further research to test this idea.

-
-
-
-
-
-
- -
-
-
-
-

Artificial intelligence (AI) and machine learning (ML) have been shown to be increasingly helpful tools in a growing number of use-cases relevant to scientific research, despite significant software-related obstacles. There exist large technical costs to setting up, using, and maintaining AI/ML models in production. This often prevents researchers from utilizing these models in their work. The growing field of machine learning operations (MLOps) aims to automate much of the AI/ML life cycle while increasing access to these models. This paper presents the initial work in creating a nuclear energy MLOps platform for use by researchers at Idaho National Laboratory (INL) and aims to reduce the barriers of using AI/ML in scientific research. Our goal is to promote the integration of the latest AI/ML technologies into researchers' workflows and create more opportunity for scientific innovation. In this paper we discuss how our MLOps efforts aim to increase usage and the impact of AI/ML models created by researchers. We also present several use-cases that are currently integrated. Finally, we evaluate the maturity of our project as well as our plans for future functionality.

-
-
-
-
-
-
- -
-
-
-
-

Software sustainability is critical for Computational Science and Engineering (CSE) software. Highly complex code makes software more difficult to maintain and less sustainable. Code reviews are a valuable part of the software development lifecycle and can be executed in a way to manage complexity and promote sustainability. To guide the code review process, we have developed a technique that considers cyclomatic complexity levels and changes during code reviews. Using real-world examples, this paper provides analysis of metrics gathered via GitHub Actions for several pull requests and demonstrates the application of this approach in support of software maintainability and sustainability.

-
-
-
-
-
-
- -
-
- -
-
-

Increasingly, scientific research teams desire to in- corporate machine learning into their existing computational workflows. Codebases must be augmented, and datasets must be prepared for domain-specific machine learning processes. Team members involved in software development and data maintenance, particularly research software engineers, can foster the design, implementation, and maintenance of infrastructures that allow for new methodologies in the pursuit of discovery. In this paper, we highlight some of the main challenges and offer assistance in planning and implementing machine learning projects for science.

-
-
-
-
-
-
- -
-
-
-
-

Sandia National Laboratories is a premier United States national security laboratory which develops science-based technologies in areas such as nuclear deterrence, energy production, and climate change. Computing plays a key role in its diverse missions, and within that environment, Research Software Engineers (RSEs) and other scientific software developers utilize testing automation to ensure quality and maintainability of their work. We conducted a Participatory Action Research study to explore the challenges and strategies for testing automation through the lens of academic literature. Through the experiences collected and comparison with open literature, we identify these challenges in testing automation and then present strategies for mitigation grounded in evidence-based practice and experience reports that other, similar institutions can assess for their automation needs.

-
-
-
-
-
-
- -
-
-
-
-

Visual Studio Code (VSCode) has emerged as one of the most popular development tools among professional developers and programmers, offering a versatile and powerful coding environment. However, configuring and setting up VSCode to work effectively within the unique environment of a shared High-Performance Computing (HPC) cluster remains a challenge. This discusses the configuration and integration of VSCode with the diverse and demanding environments typically found on HPC clusters. We demonstrate how to configure and set up VSCode to take full advantage of its capabilities while ensuring seamless integration with HPC-specific resources and tools. Our objective is to enable developers to harness the power of VSCode for HPC applications, resulting in improved productivity, better code quality, and accelerated scientific discovery.

-
-
-
-
-
-
- -
-
- -
-
-

We provide an overview of the software engineering efforts and their impact in QMCPACK, a production-level ab-initio Quantum Monte Carlo open-source code targeting high-performance computing (HPC) systems. Aspects included are: (i) strategic expansion of continuous integration (CI) targeting CPUs, using GitHub Actions runners, and NVIDIA and AMD GPUs in pre-exascale systems, using self-hosted hardware; (ii) incremental reduction of memory leaks using sanitizers, (iii) incorporation of Docker containers for CI and reproducibility, and (iv) refactoring efforts to improve maintainability, testing coverage, and memory lifetime management. We quantify the value of these improvements by providing metrics to illustrate the shift towards a predictive, rather than reactive, sustainable maintenance approach. Our goal, in documenting the impact of these efforts on QMCPACK, is to contribute to the body of knowledge on the importance of research software engineering (RSE) for the sustainability of community HPC codes and scientific discovery at scale.

-
-
-
-
-
-
- -
-
-
-
-

Evidence-based practice (EBP) in software engineering aims to improve decision-making in software development by complementing practitioners' professional judgment with high-quality evidence from research. We believe the use of EBP techniques may be helpful for research software engineers (RSEs) in their work to bring software engineering best practices to scientific software development. In this study, we present an experience report on the use of a particular EBP technique, rapid reviews, within an RSE team at Sandia National Laboratories, and present practical recommendations for how to address barriers to EBP adoption within the RSE community.

-
-
-
-
-
-
- -
-
-
-
-

As social media platforms continue to shape modern communication, understanding and harnessing the wealth of information they offer has become increasingly crucial. The nature of the data that these platforms provide makes them the emerging resource for data collection to conduct research ranging from measuring sentiments of the people over a particular trend in society to drafting a major policy by governing agencies. This paper presents PULSE, a powerful tool developed by Decision TheaterTM (DT) at Arizona State University in the United States, designed to extract valuable insights from Twitter. PULSE provides researchers and organizations with access to a curated dataset of public opinions and discussions across diverse research areas. Further, the tool uses various machine learning and data analytical algorithms to derive valuable insights on the subject under research. These insights are efficiently displayed using an interactive dashboard to assist the researchers in extracting useful insights to draw appropriate conclusions. The paper also illustrates the technical functionalities and visualization capabilities of the tool with the case study on Hurricane Laura.

-
-
-
-
- ------- - -## Short Papers - -
-
-
-
- -
-
- -
-
-

This paper introduces CACAO, a research software platform that simplifies the use of cloud computing in scientific research and education. CACAO's cloud automation and continuous analysis features make it easy to deploy research workflows or laboratory experimental sessions in the cloud using templates. The platform has been expanding its support for different cloud service providers and improving scalability, making it more widely applicable for research and education. This paper provides an overview of CACAO's key features and highlights use cases.

-
-
-
-
-
-
- -
-
-
-
-

Science gateways connect researchers to high-performance computing (HPC) resources by providing a graphical interface to manage data and submit jobs. Scientific research is a highly collaborative activity, and gateways can play an important role by providing shared data management tools that reduce the need for advanced expertise in file system administration. We describe a recently implemented architecture for collaborative file management in the Core science gateway architecture developed at the Texas Advanced Computing Center (TACC). Our implementation is built on the Tapis Systems API, which provides endpoints that enable users to securely manage access to their research data.

-
-
-
-
-
-
- -
-
-
-
-

High-fidelity pattern of life (PoL) models require realistic origin points for predictive trip modeling. This paper develops and demonstrates a reproducible method using open data to match synthetic populations generated from census surveys to plausible residential locations (building footprints) based on housing attributes. This approach presents promise over extant methods based on housing density, particularly in small neighborhood areas with heterogeneous land-use.

-
-
-
-
-
-
- -
-
-
-
-

Community resilience assessment is critical for the anticipation, prevention and mitigation of natural and anthropic disaster impacts. In the digital age, this requires reliable and flexible cyberinfrastructure capable of supporting research and decision processes along multiple simultaneous, interconnected concerns. To address this need, the National Center for Supercomputing Applications (NCSA) developed the Interdependent Networked Community Resilience Modeling Environment (IN-CORE) as part of the NIST-funded Center of Excellence for Risk-Based Community Resilience Planning (CoE), headquartered at Colorado State University. The Community App is a web-based application that takes a community through the resilience planning process using IN-CORE analyses for the measurement science to measure community resilience. Complex workflows are managed by DataWolf, a scientific workflow management system, running jobs on the IN-CORE platform utilizing the underlying Kubernetes cluster resources. Using the community app, users can perform realistic and complex scenarios and visualize the results to understand their resilience to different hazards and enhance their decision-making capabilities.

-
-
-
-
-
-
- -
-
-
-
-

This paper examines the potential of containerization technology, specifically the CyVerse Discovery Environment (DE), as a solution to the reproducibility crisis in research. The DE is a platform service designed to facilitate data-driven discoveries through reproducible analyses. It offers features like data management, app integration, and app execution. The DE is built on a suite of microservices deployed within a Kubernetes cluster, handling different aspects of the system, from app management to data storage. Reproducibility is ensured by maintaining records of the software, its dependencies, and instructions for its execution. The DE also provides a RESTful web interface, the Terrain API, for creating custom user interfaces. The application of the DE is illustrated through a use case involving the University of Arizona Superfund Research Center, where the DE's data storage capabilities were utilized to manage data and processes. The paper concludes that the DE facilitates efficient and reproducible research, eliminating the need for substantial hardware investment and complex data management, thereby propelling scientific progress.

-
-
-
-
-
-
- -
-
-
-
-

The INTERSECT Software framework project aims to create an open federated library that connects, coordinates, and controls systems in the scientific domain. It features the Adapter, a flexible and extensible interface inspired by the Adapter design pattern in object-oriented programming. By utilizing Adapters, the INTERSECT SDK enables effective communication and coordination within a diverse ecosystem of systems. This adaptability facilitates the execution of complex operations within the framework, promoting collaboration and efficient workflow management in scientific research. Additionally, the generalizability of Adapters and their patterns enhances their utility in other scientific software projects and challenges.

-
-
-
-
-
-
- -
-
-
-
-

Full-stack research software projects typically include several components and have many dependencies. New projects benefit from co-development of these components within a well-structured monolith. While this is preferred, over time this can become a burden to deploy in different contexts and environments. What we would like is to independently deploy components to reduce size and complexity. Maintaining separate packages however allows for developmental drift and other problems. So called 'monorepos' allow for the best of both approaches, but not without its own difficulties. There is almost no formal treatment in the literature of this particular dilemma however. The technology industry has started using monorepos to solve similar challenges, but perhaps in the academic context we should be cautious to not simply replicate industry practices. This short paper merely propositions the research software engineering (RSE) community into a discussion of the positives and negatives in structuring projects as monorepos of discrete packages.

-
-
-
-
-
-
- -
-
-
-
-

Single-page applications (SPAs) have become indispensable in modern frontend development, with widespread adoption in scientific applications. The process of creating a single-page web application development environment which accurately reflects the production environment isn’t always straightforward. Most SPA build systems assume configuration at build time, while DevSecOps engineers prefer runtime config- uration. This paper suggests a framework-agnostic approach to address issues that encompass both development and deployment, but are difficult to tackle without knowledge in both domains.

-
-
-
-
-
-
- -
-
-
-
-

Documentation is a crucial component of software development that helps users with installation and usage of the software. Documentation also helps onboard new developers to a software project with contributing guidelines and API information. The INTERSECT project is an open federated hardware/software library to facilitate the development of autonomous laboratories. A documentation strategy using Sphinx has been utilized to help developers contribute to source code and to help users understand the INTERSECT Python interface. Docstrings as well as reStructuredText files are used by Sphinx to automatically compile HTML and PDF files which can be hosted online as API documentation and user guides. The resulting documentation website is automatically built and deployed using GitLab runners to create Docker containers with NGINX servers. The approach discussed in this paper to automatically deploy documentation for a Python project can improve the user and developer experience for many scientific projects.

-
-
-
-
-
-
- -
-
-
-
-

Collaboration networks for university research communities can be readily rendered through the interrogation of coauthorships and coinvestigator data. Subsequent computation of network metrics such as degree or various centralities offer interpretations on collaborativeness and influence and can also compose distributions which can be used to contrast different cohorts. In prior work, this workflow provided quantitative evidence for ROI of centralized computing resources in contrasting researchers with and without cluster accounts, where significance was found across all metrics. In this work, two similar cohorts, those with RSE-type roles at the university and everyone else, are contrasted in a similar vein. While a significantly higher degree statistic for the RSE cohort suggests its collaborative value, a significantly lower betweenness centrality distribution indicates a target for potential impact through the implementation of a centralized RSE network.

-
-
-
-
- - diff --git a/pages/program/program.md b/pages/program/program.md index c1a285b..ef39358 100644 --- a/pages/program/program.md +++ b/pages/program/program.md @@ -11,6 +11,7 @@ set_last_modified: true - [Birds of a Feather]({{ site.baseurl }}/program/bofs/) - [Tutorials]({{ site.baseurl }}/program/tutorials/) +- [Papers]({{ site.baseurl}}/program/papers/) - [Workshops]({{ site.baseurl }}/program/workshops/) ## Program Timetable From 152fd9c1214fbeb1ccc581d1774f8b1f1c5385be Mon Sep 17 00:00:00 2001 From: "J.C. Subida" Date: Fri, 2 Aug 2024 11:43:27 -0500 Subject: [PATCH 04/10] Add program notebooks --- _data/menus/program.yml | 2 ++ _data/navigation.yml | 2 ++ pages/program/abstracts/notebooks.md | 39 +++++++++++++++++++++++ pages/program/abstracts/notebooks.tbd | 45 --------------------------- pages/program/program.md | 3 +- 5 files changed, 45 insertions(+), 46 deletions(-) create mode 100644 pages/program/abstracts/notebooks.md delete mode 100644 pages/program/abstracts/notebooks.tbd diff --git a/_data/menus/program.yml b/_data/menus/program.yml index 2e741dc..f87badf 100644 --- a/_data/menus/program.yml +++ b/_data/menus/program.yml @@ -4,6 +4,8 @@ link: program/ - name: Birds of a Feather link: program/bofs/ + - name: Notebooks + link: program/notebooks/ - name: Papers link: program/papers/ - name: Tutorials diff --git a/_data/navigation.yml b/_data/navigation.yml index 8ce1edd..693355e 100644 --- a/_data/navigation.yml +++ b/_data/navigation.yml @@ -9,6 +9,8 @@ link: program/ - name: Birds of a Feather link: program/bofs/ + - name: Notebooks + link: program/notebooks/ - name: Papers link: program/papers/ - name: Tutorials diff --git a/pages/program/abstracts/notebooks.md b/pages/program/abstracts/notebooks.md new file mode 100644 index 0000000..02df9d3 --- /dev/null +++ b/pages/program/abstracts/notebooks.md @@ -0,0 +1,39 @@ +--- +layout: page +title: Notebooks +description: +menubar: program +permalink: program/notebooks/ +menubar_toc: true +set_last_modified: true +--- + +## Community Resilience Research Using IN-CORE - Case Study with 2011 Tornado Event at Joplin, MO + +_Chen Wang, Jong Lee, Chris Navarro, Rashmil Panchani, Yong Wook Kim, Y-Lan Yang, Rob Kooper, Vismayak Mohanarajan, Wanting Lisa Wang, Lisa Watkins_ + +Community resilience research is essential for anticipating, preventing, and mitigating the impacts of natural and anthropogenic disasters. To support this research, the Center for Risk-Based Community Resilience Planning, funded by the National Institute of Standards and Technology (NIST), developed the measurement science and metrics that can help communities in planning, adapting and recovering from disasters. This measurement science is implemented on an open-source platform called the Interdependent Networked Community Resilience Modeling Environment (IN-CORE). On IN-CORE, users can run scientific analyses that model the impact of natural hazards and community resilience against these impacts. + +This Jupyter Notebook uses the Joplin, MO community and the historical 2011 EF-5 Tornado event as an example of how to use IN-CORE to analyze community resilience. The city of Joplin, Missouri, USA, was hit by an EF-5 tornado on May 22, 2011 (NIST Report). Note that IN-CORE supports various hazards including earthquake, tornado, tsunami, flood, and hurricane. + +The notebook contains the following analyses: structural damage analysis on buildings, electric power network damage, building functionality, economic impact analysis on the community's economy, population dislocation analysis, housing household recovery analysis, and retrofit analysis on buildings. In addition, the notebook demonstrates the visualization of outputs from these analyses. + +Lastly, the core logic of this notebook is used to power the IN-CORE Community Resilience Playbook, an interactive guide for community resilience planning. It has been used in workshops with the city planners and government officials, making it a valuable resource for resilience planning. + +------ + +## Using Radicals to Empower Budding Computational Chemists + +_Jacob States, Isaac Spackman, Shubham Vyas_ + +The integration of computational physical chemistry into undergraduate laboratories presents a unique opportunity for collaboration with the research software engineering field. To promote more efficient computational workflows and foster engagement among budding programmers in computational modeling, we present this notebook investigating small molecules with unpaired electrons (radicals). The CF3 radical has been extensively explored in the chemical literature owing to its importance in ozone depletion from CFCs (chlorofluorocarbons) and its unusual geometric structure which deviates from the planar structure of the CH3 radical, despite the similar size of the fluorine atom and the hydrogen atom. Exploring trends along chemical groups is commonplace in the chemical literature, and as such we have created a notebook demonstrating the facile preparation and analysis of a simple experiment substituting the F atoms in the CF3 radical for other halogens in the same group (Cl, Br, I) in a combinatorial fashion. From a single excel sheet, input files for the quantum modeling software ORCA can be reproducibly generated. Upon completion of the requested calculations, the meaningful data is systematically extracted from the produced log files. This method contrasts with traditional practices in undergraduate labs in which students manually construct input files and scroll through log files to copy/paste data and demonstrates a more efficient and reproductible alternative. The notebook not only serves as an educational tool but also acquaints future research software engineers with the specialized software developed by computational chemists. + +------ + +## Hawaiʻi Climate Data Portal API Demo + +_Jared McLean, Sean Cleveland_ + +The Hawaiʻi Climate Data Portal (HCDP) (available ) provides access to 30+ years of climatological data collected from sensor stations around the state of Hawaiʻi and gridded data products derived from these values. The HCDP is a publicly available web application and is backed by an API that is accessible to researchers on request. This notebook demonstrates some of the abilities and usage of the HCDP API and the data provided by it. The notebook will show the user how to retrieve and map sensor station metadata and values, retrieve gridded data products, produce timeseries of station and gridded data, and generate data packages for large amounts of data that can be downloaded directly or sent to the user’s email. + +------ diff --git a/pages/program/abstracts/notebooks.tbd b/pages/program/abstracts/notebooks.tbd deleted file mode 100644 index a726d50..0000000 --- a/pages/program/abstracts/notebooks.tbd +++ /dev/null @@ -1,45 +0,0 @@ ---- -layout: page -title: Notebooks -description: -menubar: program -permalink: program/notebooks/ -menubar_toc: true -set_last_modified: true ---- - -## Extrapolation and Interpolation in Machine Learning Modeling with Fast Food and astartes - -_Jackson Burns, Kevin Spiekermann, and William Green_ - -Machine learning is a groundbreaking tool for tackling high-dimensional datasets -with complex correlations that humans struggle to comprehend. An important nuance -of ML is the difference between using a model for interpolation or extrapolation, -meaning either inference or prediction. This work will demonstrate visually what -interpolation and extrapolation mean in the context of machine learning using -astartes, a Python package that makes it easy to tackle in ML modeling. Many -different sampling approaches are made available with astartes, so using a very -tangible dataset - a fast food menu - we can visualize how different approaches -differ and then train and compare ML models. - ------- - -## Stylo2gg: Visualizing reproducible stylometric analysis - -_James Clawson_ - -For researchers using R to do work in stylometry, the Stylo package in R is -indispensable, but it also has some limitations. The Stylo2gg package addresses -some of these limitations by extending the usefulness of Stylo. Among other -things, Stylo2gg adds logging and replication of analyses, keeping necessary -files and introducing a systematic way to reproduce past work. With visualization -as its initial purpose, Stylo2gg also makes exploring stylometric data easy, -providing options for labeling, highlighting subgroups, and double coding data -for added legibility in black and white or in color. Finally, as hinted by the -name, the conversion of graphics from base R into Ggplot2 changes the style of -the output and introduces more options to extend analyses with many other -packages and addons. The reproducible notebook shown here, `01-stylo2gg.qmd` or -rendered in `01-stylo2gg.html`, walks through much of the package, including many -features that were added in the past year. - ------- diff --git a/pages/program/program.md b/pages/program/program.md index ef39358..663c490 100644 --- a/pages/program/program.md +++ b/pages/program/program.md @@ -10,8 +10,9 @@ set_last_modified: true ## Accepted Submissions - [Birds of a Feather]({{ site.baseurl }}/program/bofs/) +- [Notebooks]({{ site.baseurl }}/program/notebooks/) +- [Papers]({{ site.baseurl }}/program/papers/) - [Tutorials]({{ site.baseurl }}/program/tutorials/) -- [Papers]({{ site.baseurl}}/program/papers/) - [Workshops]({{ site.baseurl }}/program/workshops/) ## Program Timetable From 74f99e79cec235b0bb145ed0abe8101308e597b1 Mon Sep 17 00:00:00 2001 From: "J.C. Subida" Date: Fri, 2 Aug 2024 12:24:36 -0500 Subject: [PATCH 05/10] Add program talks --- _data/menus/program.yml | 2 + _data/navigation.yml | 2 + pages/program/abstracts/talks.md | 2043 +++++++++++++++++++++++++++++ pages/program/abstracts/talks.tbd | 414 ------ pages/program/program.md | 1 + 5 files changed, 2048 insertions(+), 414 deletions(-) create mode 100644 pages/program/abstracts/talks.md delete mode 100644 pages/program/abstracts/talks.tbd diff --git a/_data/menus/program.yml b/_data/menus/program.yml index f87badf..2b97f67 100644 --- a/_data/menus/program.yml +++ b/_data/menus/program.yml @@ -8,6 +8,8 @@ link: program/notebooks/ - name: Papers link: program/papers/ + - name: Talks + link: program/talks/ - name: Tutorials link: program/tutorials/ - name: Workshops diff --git a/_data/navigation.yml b/_data/navigation.yml index 693355e..51010e2 100644 --- a/_data/navigation.yml +++ b/_data/navigation.yml @@ -13,6 +13,8 @@ link: program/notebooks/ - name: Papers link: program/papers/ + - name: Talks + link: program/talks/ - name: Tutorials link: program/tutorials/ - name: Workshops diff --git a/pages/program/abstracts/talks.md b/pages/program/abstracts/talks.md new file mode 100644 index 0000000..2a91887 --- /dev/null +++ b/pages/program/abstracts/talks.md @@ -0,0 +1,2043 @@ +--- +layout: page +title: Talks +description: +menubar: program +permalink: program/talks/ +menubar_toc: true +set_last_modified: true +--- + +
+
+
+
+ +
+
+ +
+
+

Historically, US research software has predominantly been utilized within the country by domestic + researchers. However, recent years have seen a surge in international collaboration, with + research software playing a pivotal role. International users can represent a substantial user + base for some research software, and foreign engineers have huge potential to contribute + significantly to the US software community. As a result, it is crucial for RSEs and researchers + to recognize the importance of software internationalization and localization, and to acquire + the methodologies necessary for their effective implementation. This talk will offer guidance on + designing, developing, and testing internationalized research software, ensuring that it meets + the needs of a global audience in the future.

+

This talk will be structured from broad concepts to specific skills (i.e., Internationalization + → Localization → Translation) to present software design principles that can prepare for + research software a global impact in the future.

+

The first section (~3 mins) will cover globalization/internationalization, focusing on the + product design perspective. This part will cover the concept of internationalization, some + regulations for product owners to keep in mind, product design principles, and potential costs, + with examples for demonstration. The audience will learn the importance of including + internationalization considerations at the proposal drafting stage, rather than leaving it as a + task during the development stage.

+

The second section (~4 mins) will transition to localization. The presenter will discuss the + meaning of localization and how the lack of localization can hinder the global promotion of US + research software. Examples will illustrate the steps of designing, developing, and testing + software localization. At the end of this section, the presenter will provide a checklist for + RSEs and researchers to refer to in future localization processes.

+

The third section (~4 mins) will focus on translation, a crucial component of localization. The + presenter will introduce software design and development principles for adding translation + capabilities, followed by a discussion of common translation tools that RSEs can use for popular + frontend and backend frameworks. The section will conclude with a focus on utilizing AI tools to + enhance translation quality.

+

Overall, this talk aims to inspire the research software community to rethink software from an + international perspective and empower them with the knowledge to promote U.S. research worldwide + in the future. +

+
+
+
+
+
+
+ +
+
+ +
+
+

Many academics feel comfortable wrangling and analyzing data in R, but have little to no + experience working on the command line and may find job scheduling systems like SLURM + intimidating. This can be a significant barrier for using high performance computing which + generally requires creating BASH scripts and submitting jobs via the command line. The {targets} + R package provides many benefits to researchers, one of which is running steps of an analysis + automatically as job requests on an HPC all from the comfort of R.

+

The {targets} package allows for workflow management of analysis pipelines in R where + dependencies among steps are automatically detected. When a {targets} pipeline is modified and + re-run, any steps (called “targets”) that do not need to be rerun are automatically skipped, + saving compute time. By default, {targets} pipelines are launched in a “clean” R session, which + enforces reproducibility (a blessing to some and a curse to others). It is relatively trivial to + parallelize a {targets} workflow so that independent targets are run on parallel workers either + locally as multiple R sessions, or using HPC or cloud computing resources. Users can define a + controller that runs their pipeline on the HPC using multiple workers running as separate SLURM + jobs (or PBS, SGE, etc.). It is also possible to define multiple controllers with different + resources for different targets so that tasks with heavier computational needs are run with more + CPUs, for example. All of this happens from the comfort of R without users needing to manually + create multiple R scripts and/or multiple SLURM submission scripts for each task.

+

RSEs and HPC professionals can help enable this powerful combination of technologies in a few + ways. At University of Arizona, our group has created a template GitHub repository for a + {targets} project that can run on the UA HPC either through the command line where targets are + run as SLURM requests or with Open OnDemand where targets are run on multiple R sessions. It + includes code for a controller function that works with the UA HPC and documentation about how + to modify the template and get it onto the HPC with git clone. We have previously run {targets} + workshops that help researchers re-factor their analysis scripts into {targets} pipelines. We + hope to work with HPC professionals to increase awareness of {targets} as an option for R users + to harness the power of cluster computing. +

+
+
+
+ +
+
+
+ +
+
+ +
+
+

Building on discussions first started at the German RSE conference in 2023 (de-RSE23), a recent + pre-print, Foundational Competencies and Responsibilities of a Research Software Engineer, + identifies a set of core competencies required for RSEng and describes possible pathways of + development and specialisation for RSEs. It is the first output of a group with broad interests + in the teaching and learning of RSEng skills.

+

With continuing growth in RSE communities around the world, and sustained global demand for + RSEng skills, US-RSE24 presents an opportunity to align international efforts towards

+

* training the next generation of RSEs + * providing high-quality professional development opportunities to those already following the + career path + * empowering RSE Leaders to further advocate for the Research Software Engineering needs and + expertise in their teams, institutions, and communities.

+

Therefore, we want to give an overview of what the group has been working on so far, discuss the + aims of our future work, and invite members of the international RSE community to contribute and + provide feedback. We particularly encourage members of regional groups focused on RSEng training + and skills to attend and share their perspectives. +

+
+
+
+ +
+
+
+ +
+
+ +
+
+

The University Corporation for Atmospheric Research (UCAR) Software Engineering Assembly (SEA) + was formed in 2005 to provide an informal meeting space, instructional content including + tutorial series and seminars, and an evolving compilation of best practices for those staff and + collaborators at the organization interested in software engineering. Over time, the SEA + membership grew, events were regularly conducted, and in 2012 a yearly conference was + established with the focus being scientific software engineering.

+

Communities of practice like the SEA benefit from motivated members actively cultivating the + organization and adding some formal structure and legitimacy. Unfortunately, staff turnover and + budget (and thus time) constraints led to a gradual atrophying of SEA activity. While our yearly + conference - eventually titled the Improving Scientific Software Conference - remained a robust + fixture throughout, other offerings tapered to a nadir during the COVID pandemic. Soon after, + the longtime chair of the SEA left the organization, and it appeared that it may sunset + entirely.

+

As the SEA was shrinking, the US-RSE became a growing presence at National Labs. When a + new committee did eventually take over SEA governance, this presented an opportunity to align + our Assembly with the principles and best practices being developed by the research software + engineering community.

+

This talk will describe our Assembly in its current state, the changes that have been made to + modernize it thus far, and our goals for the future. Much of the focus has and will be on + building + a community of practice through events like open discussions on best practices, but some of the + more mundane challenges will also be described - such as revitalizing our web presence and + ensuring collaboration instead of competition with peer groups within and outside of our + organization. We will also give a brief overview of our Improving Scientific Software + Conference, our efforts to modernize it (i.e. using Jupyter Notebooks for proceedings), and how + we use the Conference to drive interest in the SEA and vice versa. Finally, we will discuss some + lessons learned about sustaining a long-running interest group, and mention some of the things + we wish we had known at the start of this revitalization effort. +

+
+
+
+ +
+
+
+ +
+
+ +
+
+

Honeycomb is a template repository that standardizes best practices for building jsPsych-based + tasks. It offers continuous deployment for use in research settings, at home, and on the web. + The project's main aim is to improve the ability of psychiatry researchers to build, deploy, + maintain, reproduce, and share their own psychophysiological tasks (“behavioral experiments”).

+

Behavioral experiments are a useful tool for studying human behavior driven by mental processes + such as cognitive control, reward evaluation, and learning. Neural mechanisms during behavioral + tasks are often studied in the lab via simultaneous electrophysiological recordings. Uniquely + registered participants may be asked to concurrently complete the task at home where connecting + such specialized equipment is not feasible. Furthermore, online platforms such as Amazon + Mechanical Turk (MTurk) and Prolific enable deployment of tasks to large populations + simultaneously and at repeated intervals. Online distribution methods enable far more + participation than what labs can handle in a reasonable amount of time.

+

Honeycomb addresses the key challenge of using a single code base to deliver a task in each of + these environments. The benefits of Honeycomb were first seen in an ongoing study of deep brain + stimulation for obsessive compulsive disorder. Subsequent projects have included research on + decision making processes for people with obsessive compulsive disorder as well as gameplay + style differences between control, obsessive compulsive disorder, and + attention-deficit/hyperactivity disorder patients. The CCV additionally maintains a curated + public library, termed BeeHive, of ready-to-use tasks.

+

The project is open-source and directly supported by the Center for Computation and + Visualization at Brown University. It has been in active development since August of 2019 + (currently version 3.4) with version 4 and 5 releases roadmapped. An ultimate goal of the + project is to publish it as its own library to the node package manager (npm) registry. +

+
+
+
+ +
+
+
+ +
+
+ +
+
+

Reading computer program code and documentation written by others is, we are told, one of the + best ways to learn the art of writing readable, intelligible and maintainable code and + documentation. This talk introduces the concept of software resurrection as a tool for learning + from program code and documentation that are remote in time (e.g. 20 years old) and space (e.g. + unfamiliar algorithms and tools). The software resurrection exercise requires a motivated + learner to compile and test a historical software release version of a well maintained and + widely adopted open source software on a modern hardware and software platform. The learner + develops fixes for the issues encountered during compilation and testing of the software on a + modern platform that could not have been foreseen at the time of its release. The exercise + concludes by writing a critique which provides an opportunity to critically reflect on the + experience of maintaining the historical software. An illustrative example of this exercise + pursued on a version of the SQLite database engine released 20 years ago shows that software + engineering principles (or, programming pearls) emerge during the reflective learning cycle of + the software resurrection exercise.

+

The concept of software resurrection is similar to the "Learning by doing" methodology which is + based on the experiential learning theory. Engaging with program code and documentation that are + remote in time or space helps learners actively explore the experience of software maintenance. + These experiences reveal the factors that contribute to readability, intelligibility and + maintainability of program code and documentation.

+

Prerequisites + This talk is aimed at students, researchers and professionals who develop, support and maintain + computer software. The talk includes an illustrative example based on a software written in the + C programming language and therefore a basic understanding of the C programming language will be + useful. Since, the concept of software resurrection applies in general to the field of software + engineering, the attendees will still be able to understand the key ideas even if they do not + have a background in the C programming language.

+

Expected Outcomes + The attendees will learn about a novel method for teaching and learning software engineering + principles by engaging with existing software code and documentation. The concepts described in + this talk will allow the attendees to view the impact of existing software development and + documentation practices from the perspective of a software maintainer.

+
+
+
+ +
+
+
+ +
+
+ +
+
+

Creating population estimates for the entire globe using machine learning is a challenging task. + One challenge is gathering and combining vast amounts of global GIS data at high resolutions. + Another challenge is processing the amount of complex GIS data required to make population + estimation possible in a reasonable amount of time. Speed is important in research because of + the need to iterate and evaluate the data for validity and accuracy. In this work, we present + the challenges of taking research code from a Jupyter notebook and creating a cloud optimized + solution using infrastructure as code (IaC) to deploy a cluster in OpenStack. We show the code + modifications for speed performance improvements, comparisons of running machine learning on + multithreaded CPUs versus GPUs, and the architecture design for running a global dataset on a + Kubernetes cluster. +

+
+
+
+ +
+
+
+ +
+
+ +
+
+

As both hardware and software becomes more prevalent in research computing, the user base of + these systems has broadened considerably. Novice users from many backgrounds and at many stages + of their careers are looking to make effective use of these resources, while retaining focus on + their domain work.

+

Formal curricula for students of those domains may not have room for computational training. + Non-student researchers face challenges making time to acquire the relevant expertise. Moreover, + the parallelism of HPC systems adds complexity to an already demanding software development + task. Consequently, large subsets of researchers have access to HPC resources without the + technical skills to use them effectively.

+

These challenges are familiar to the Research Software Engineering community.

+

HPC Carpentry provides training solutions complementary to the research software engineering + role, supporting effective use of novel, shared computing resources.

+

HPC Carpentry workshops are modeled after those of The Carpentries1 and take place over one or + two days, providing a hands-on mode of instruction, where learners type along with instructors + to acquire the basic skills necessary to get started on HPC systems. Learners are not expected + to come away as experts, but instead with the "muscle memory" of how basic operations work on + HPC systems, with a mental model of the shared HPC system and its resources, and with enough + vocabulary to make self-directed training more accessible and effective.

+

This talk will describe the current state of the HPC Carpentry project, our strategic + development plan for the workshops, current challenges, and the lesson content that we develop, + teach, host, and cite.

+
+
+
+ +
+
+
+ +
+
+ +
+
+

Science gateways have emerged as a popular and powerful interface to computational resources for + researchers. Most if not all of these science gateways now rely on container technology to + improve portability and scalability while simplifying maintenance. However, this can lead to + problems where the container image size can grow as more domain-specific packages and libraries + are needed for the tools deployed on these containers. This is particularly relevant in the case + of JupyterHub-based gateways, where the Python virtual environments or Conda environments + underlying the Jupyter kernels can often grow in size and number.

+

For example, a JupyterHub gateway that I work on as part of the NSF-funded I-GUIDE institute + required the installation of a large number of geospatial libraries, leading to the Jupyter + notebook container images approaching several gigabytes in size. To combat this, our team + decided to integrate the CernVM File System (also known as CVMFS) with Kubernetes, which acts as + a software distribution service and can provide software packages to the containers from a + separate server.

+

As a first step in this integration, we had to deploy our own CVMFS server on a separate virtual + machine and load it with the packages that were needed for distribution. CVMFS itself has two + main servers, which are the stratum 0 and the stratum 1. The stratum 0 is the main server for + configuration and packages, while the stratum 1 acts as a mirror of the stratum 0. Following the + deployment of the stratum 0 server, we were able to install the necessary Conda environments and + modules. The I-GUIDE JupyterHub platform is deployed on a Kubernetes cluster using the + Zero2JupyterHub recipes. In order to integrate the CVMFS server with this Kubernetes cluster, we + installed a CSI (container storage interface) driver provided by the developers of CVMFS to + connect the stratum server to the Kubernetes cluster. This then enabled us to create the + necessary storage class and persistent volumes in Kubernetes that could then be mounted into the + Jupyter notebook containers to serve the necessary Conda environments. Ultimately, this resulted + in containers having their sizes reduced significantly, from multiple gigabytes to only half of + a single gigabyte!

+

In conclusion, science gateways are incredibly impactful for researchers, but can take more + effort to maintain than most realize. Containers and virtual machines make this easier, yet can + contribute to their own issues by becoming bloated over time. These size issues can be resolved + with CVMFS, making the container sizes around three to four times smaller compared to before.

+

In this talk I will be presenting our deployment design as well as our experience through this + deployment process and lessons learned.

+
+
+
+ +
+
+
+ +
+
+ +
+
+

To tackle the problem of sustainably training and developing a workforce, SDSC has experimented + over the past decade with various strategies to shape a seasonal internship program that has met + and exceeded its original goal of research software developer workforce training. Using modern + agile frameworks, a novel summer training program, and minimal resources, SDSC has supported + over 200 interns over the past four years who have learned about and supported research software + development. Come hear the internship program founders, Ryan Nakashima and Jenny Nguyen, share + both unsuccessful and successful strategies used to build the SDSC software development + internship program and connect with them for follow-up discussions.

+
+
+
+ +
+
+
+ +
+
+ +
+
+

Reproducibility of research that is dependent on software and data is a persistent and ongoing + problem. In this talk, I invite the emerging research software engineering community to leverage + the half century of specific knowledge offered by cybersecurity professionals. I demonstrate + that the practical needs of cybersecurity engineering and research software engineering overlap + significantly. In addition to enabling reliable reproducibility of research, I illustrate how + well-understood cybersecurity tools enable independent verification of research integrity and + increase the public trust in open science. +

+
+
+
+ +
+
+
+ +
+
+ +
+
+

Johns Hopkins Applied Physics Laboratory (APL) is the U.S.’s largest university-affiliated + research center, home to over 9000 staff dedicated to making “Critical Contributions to Critical + Challenges” for our various federal agencies. APL’s Space Exploration Sector alone has designed, + built, and operated over 70 spacecraft missions; developed hundreds of specialized instruments + for yet more missions, and collectively has visited every planet in the solar system. Within the + sector’s Space Science Research branch resides one of the largest RSE organizations we are + currently aware of: our very own Space Analysis & Applications group – a team of 60+ research + software engineers that directly support our missions and the scientific research enabled by + them. Our talk will explore the history, functions, and operation of this group as a means to + examine a mature RSE organization and to share our insights and experience with those US RSE + colleagues developing and managing their own. Individual topics covered will include + organizational structure, team composition, funding sources, work discovery, intake, and some + brief visuals or demonstrations of the group’s software products and the research we have + enabled.

+
+
+
+ +
+
+
+ +
+
+ +
+
+

Student opportunities are important for diversifying RSE and getting students hooked on research + software engineering. Engaging with undergraduate and graduate students interested in scientific + computing is both rewarding and beneficial to the RSE community, given there is not yet a clear + academic curriculum nor career path to becoming a research software engineer. This talk will + cover the macro aspects of RSE internships through the lens of SIParCS, a successful, long + running internship program, and deep-dive into the micro aspects of working with student RSEs by + sharing experiences at DART, an open source project at the intersection of science and software. +

+

We’ll give a lightning overview of the 17 years of SIParCS, the Summer Internships in Parallel + Computational Science at the NSF National Center for Atmospheric Research, including background, + history and motivations. SIParCS provides opportunities for undergraduate and graduate students + to gain hands-on experience in computational science, particularly focusing on high-performance + computing, scientific computing, and data analysis. The program’s goal is to develop and + diversify the next generation of computational scientists and engineers by offering holistic + mentorship, professional development, and the chance to work on cutting-edge projects alongside + experienced researchers.

+

DART, the Data Assimilation Testbed, has been fortunate to have various interns though SIParCS + as well as part-time student RSE employees working year-round. We'll share our specific + experiences, challenges and triumphs, working with student RSEs. What worked, what didn’t, and + how summer internship RSEs differ from year-round part-time student RSEs. Everyone is different, + what motivates and incentivizes people varies from person to person, and can change over time. + People’s time has value, we want people spending that time on the most interesting and impactful + thing they can be working on. Working with students requires a balance between getting quality + work from them, and the students finding benefit in this work and progressing their career. + We'll conclude with thoughts on future student interactions and possible community + collaborations.

+
+
+
+ +
+
+
+ +
+
+ +
+
+

AutoRA – the Automated Research Assistant – is a growing collection of python packages for + running fully automated psychological experiments online. It allows the user to automate the + specification of experimental conditions, data collection, and theory derivation, cycling back + to specifying new experimental conditions.

+

One primary goal of the PI was to allow unaffiliated developers to contribute new methods for + generating experimental conditions, and new regression techniques for theory derivation. But + taking the naïve monorepo approach would be too costly: either 1) testing all of the + contributions for every change to the code – which would be too costly as some of the + contributions train neural networks as part of their execution and require hours to run; or 2) + would require configuring the CI so that only relevant parts of the code were tested for each + pull request – which would mean a high maintenance burden. Furthermore, since this work is about + applying ML to experiments run on people, it’s vital that every submission be ethically vetted + before it can be part of the official release.

+

Thus, one primary goal of our work was to allow for decentralized extensibility, so that + contributors unaffiliated with the core team could easily innovate and share new functionality + without leading to a high centralized maintenance and testing costs. Another was to ensure that + contributions could be vetted and included easily.

+

We’ll present how RSEs helped to establish a common interface based around a simple functional + paradigm, with namespace packages spread across multiple repositories so that each contributor + could be owner of and responsible for their own work, and how their contributions are integrated + into the main package. We’ll also look at how we enable contributors to start their work quickly + using templates.

+
+
+
+ +
+
+
+ +
+
+ +
+
+

Ecological Momentary Assessments (EMA) are often used in the field of Psychology to deliver + multiple data collection instruments to study participants at several time points in the day. As + these are often used to study participants current (at the time of receiving the notification) + mood, activity or immediate company, it is important that they do not anticipate the + notification arrival at fixed times throughout the day. However most traditional electronic data + capture (EDC) systems require participant notification schedules to be pre-determined with + little to no room for sending random events individualized to participants environment (wake up + time, etc.). Given this problem, our team of RSE’s has developed a cloud first random EMA + notification system, that can serve 1000’s of participants multiple random EMA push + notifications throughout the day. The system is capable of tracking user wake and sleep times, + adapt to weekend or weekday modes and configurable to work with different randomization logic + and anchors (points around which to randomize). During development the team prioritized the use + of proven architectural building blocks to maximize uptime, reduce cost and speed up + development. More importantly the system was built to evolve hand in hand with the changing + requirements from research stakeholders. In this talk we will go over how we build this system + using Amazon Web Services Event Schedulers, low cost serverless components, lessons learned from + testing across various time zones, and compliance monitoring. We will look at the initial design + choices, their limitations and how they had to be adapted. Finally, we will go over how the + solution integrates with existing commercial EDC products such as Care Evolution’s MyDataHelps + offering. The solution is currently open sourced and can be adapted by RSE teams for their own + stakeholder’s studies.

+
+
+
+
+
+
+ +
+
+ +
+
+

Fortran still occupies a significant fraction of the workloads at scientific computing centers, and + many + projects are still under active development by research software engineers (RSEs). In this talk I + will + describe how the National Energy Research Scientific Computing Center (NERSC) provides a holistic + support structure for our users, and especially RSEs, that take advantage of Fortran.

+
+
+
+ +
+
+
+ +
+
+ +
+
+

Computational notebooks provide a dynamic and interactive approach to scientific communication and + significantly enhance the reproducibility of scientific findings by seamlessly integrating code, + data, + images, and narrative texts. While notebooks are increasing in popularity among researchers, the + traditional academic publishing paradigm often requires authors to extract elements from their + notebooks + into another format, losing the interactive and integrative benefits of notebook format.

+

In response to this evolving landscape, the Software Engineering Assembly (SEA) Improving Scientific + Software Conference (ISS) has built a framework for paper submissions that accommodates Jupyter + Notebooks and Markdown files. This approach is designed to enhance the transparency and + accessibility of + research, enabling authors to submit and share their work in a more dynamic and interactive format. +

+

In this presentation, we will talk about this deployed framework and how it can be easily adopted + for + future conferences and journals. This framework is built on top of open-source tools such as Jupyter + Notebooks, JupyterBook, and Binder. In this framework, we utilized GitHub Workflows for the + automated + build and deployment of submissions into a cohesive JupyterBook format. The presentation will cover + the + challenges and solutions encountered in implementing this framework, aiming for its application in + future conferences. Additionally, we will share insights and experiences from developing and + deploying + this ecosystem, emphasizing how it can fundamentally change the way research is published, shared, + and + assessed within the open science and reproducibility paradigm.

+
+
+
+ +
+
+
+ +
+
+ +
+
+

Research software engineers and research data curators face similar challenges in their efforts to + support truly reproducible science. This talk anticipates a future in which the research software + engineering and research data curation communities identify ways to align their respective efforts + in + promoting best practices. We present a project to develop specialized training for curating research + software as one such opportunity.

+

There is great interest on the part of both scientific communities and funding agencies to see that + science research is reproducible, and that the products of research—both data and code—are + “Findable, + Accessible, Interoperable, and Reproducible” (FAIR) now and into the future. To enhance + reproducibility + and FAIRness, funding agencies typically require that grant applicants file so-called Data + Management + Plans (DMPs); journals increasingly require that authors deposit code and data in a certified + repository + and link those artifacts to their publications; and community and institutional repositories work to + ensure quality of deposits by employing curators to examine, approve, and sometimes validate and + even + improve deposited code and data. The curation step is critical in maintaining viable research data + lifecycles, and requires that curation workflows be implemented and that curation staff be funded + and + trained.

+

The Data Curation Network (DCN) is a membership organization of institutional and non-profit data + repositories whose vision is to advance open research by making data more ethical, reusable, and + understandable. Its mission is to empower researchers to publish high quality data in an ethical and + FAIR way, collaboratively advance the art and science of data curation by creating, adopting, and + openly + sharing best practices, and supporting thoughtful, innovative, and inclusive data curation training + and + professional development opportunities.

+

Last year the DCN, with project leadership from Duke University, obtained funding from the Institute + of + Museum and Library Services (IMLS) to create course curricula to train new curators in addressing + curation of data types that require specialized knowledge and often warrant specific types of + treatment + and analysis (IMLS Award no. RE-252343-OLS-22). These specialized data types include geospatial + data, + scientific images, simulations and models, and, last but not least, code. The presenters are members + of + the cohort that developed specialized training for curating code. In our talk, we will give an + overview + of the topics covered in the pilot workshop, which include introductory topics such as dependency + management, licensing, and documentation, as well as more advanced topics such as containerization + and + nondeterminism. We will also describe how these topics apply to research software developers wishing + to + create more reproducible and sustainable code.

+
+
+
+ +
+
+
+ +
+
+ +
+
+

At the National Center for Supercomputing Applications (NCSA), our team of research software UIX + (User + Interface and User Experience) designers is dedicated to enhancing academic research applications + through innovative design thinking and user-centric methodologies. Since our expansion in 2021, we + have + successfully collaborated on over 30 applications across diverse scientific domains, underscoring + the + growing demand for design as part of the research software development process.

+

Our presentation will delve into the principles of design and design thinking, highlighting the + distinction between UI and UX design (and why both are important), and describe our role as user + advocates. We will outline our comprehensive design process, which includes discovery, ideation, and + implementation phases, and the highly iterative and user-engaged form this process takes. We will + give + an overview of design workflow tasks such as user research, wireframing, rapid prototyping, + high-fidelity design production and usability testing, all facilitated by tools like Figma for + collaboration with stakeholders, streamlined handoff and communication with developers.

+

We will also showcase an example project to illustrate our project lifecycle, from initial + requirements + gathering to design audits and continuous process improvement. Our collaborative, cross-functional + teams, which integrate designers, developers, and research scientists, are pivotal in producing + high-quality, sustainable software. By prioritizing user experience, we ensure that our applications + not + only meet the technical needs of scientists and researchers but also provide delightful and + efficient + user interactions, fostering faster onboarding and greater adoption. Join us to explore how + thoughtful + design and interdisciplinary collaboration can lead to more effective and impactful research + software + solutions.

+
+
+
+ +
+
+
+ +
+
+ +
+
+

In comparison to traditional software engineers, research software engineers (RSEs) often come to + software engineering from scientific domains and may lack formal training. As the field continues to + develop, direct educational pathways and formal training are likely to expand. Questions about best + practices for training students and early career RSEs must be answered to ensure new RSEs are able + to + contribute high quality code. How should we train students with minimal experience to work on + real-world + projects? How do we bridge the gap between classroom learning and the expectations of writing + reproducible code? What assumptions can be made about what students can (and should) learn + themselves, + and what do they need to be explicitly taught? The University of Chicago’s Data Science Institute + has + been able to wrestle with these questions over the past three years by engaging with students via + its + experiential, project-based, Data Science Clinic course.

+

The Data Science Clinic is a useful setting for asking these questions and testing related + hypotheses. + The clinic works with 3 cohorts of students each year and typically has more than 50 students per + cohort. This provides a great environment for iterating on best practices. Preparing students who + are + interested in research software engineering careers is similar to training early-career RSEs who are + coming from backgrounds with limited computer science education. Students in the clinic come from + diverse backgrounds, with both master’s and undergraduate students well-represented, but most have + one + university computer science course. This level of formal computer science background is similar to + many + RSEs coming from non-computer science backgrounds. Additionally, code reproducibility and code + quality + are often lower priorities on both student projects and RSE projects.

+

The Data Science Clinic has led to the important conclusion that relying on assumptions about + student + background knowledge leads to negative outcomes. When using experiential learning or project-based + classes, it's easy to have a biased view of student understanding since only the most confident and + engaged students are likely to volunteer to participate. These advanced students can shoulder most + of + the load of a project and make quite a bit of progress, while allowing the students who would gain + the + most from direct instruction to coast undetected. In reality, many students lack robust mental + models of + computer operation, familiarity with essential terms and concepts, and an appreciation for software + engineering best practices. Overcoming these challenges requires significant investment from + experienced + mentors.

+

The purpose of this talk is to share these lessons and conclusions, discuss why these problems are + so + difficult, and to consider next steps.

+
+
+
+ +
+
+
+ +
+
+ +
+
+

Researchers support reproducibility and rigorous science by sharing and reviewing research + artifacts—the + documentation and code necessary to replicate a computational study (Hermann et al., 2020, + Papadopoulos + et al. 2021). Creating and sharing quality research artifacts and conducting reviews for conferences + and + journals are considered to be time consuming and poorly rewarded activities (Balenson et al., 2021; + Collberg & Proebsting, 2016; Levin & Leonelli, 2017). To simplify these scholarly tasks, we studied + the + work of artifact evaluation for a recent ACM conference. We analyzed artifact READMEs and reviewers’ + comments, reviews, and responses to surveys. Through this work, we recognized common issues + reviewers + faced and the features of high quality artifacts. To lessen the time and difficulty of artifact + creation + and evaluation, we suggest ways to improve artifact documentation and identify opportunities for + research infrastructure to better meet authors' and reviewers' needs. By applying the knowledge + gleaned + through our study, we hope to improve the usability of research infrastructure and, consequently, + the + reproducibility of research artifacts.

+
+
+
+ +
+
+
+ +
+
+ +
+
+

Most research projects today involve the development of some research software. Therefore, it has + become + more important than ever to make research software reusable to enhance transparency, prevent + duplicate + efforts, and ultimately increase the pace of discoveries. The Findable, Accessible, Interoperable, + and + Reusable (FAIR) Principles for Research Software (or FAIR4RS Principles) provide a general framework + for + achieving that.1 Just like the original FAIR Principles2, the FAIR4RS Principles are as designed to + be + aspirational and do not provide actionable instructions. To make their implementation easy, we + established the FAIR Biomedical Research Software (FAIR-BioRS) guidelines, which are minimal, + actionable, and step-by-step guidelines that biomedical researchers can follow to make their + research + software compliant with the FAIR4RS Principles.3,4 While they are designed to be easy to follow, we + learned that the FAIR-BioRS guidelines can still be time-consuming to implement, especially for + researchers without formal software development training. They are also prone to user errors as they + require several actions with each new version release of a software.

+

To address this challenge, we are developing codefair, a free and open-source GitHub app that acts + as a + personal assistant for making research software FAIR in line with the FAIR-BioRS guidelines.5,6 The + objective of codefair is to minimize developers’ time and effort in making their software FAIR so + they + can focus on the primary goals of their software. To use codefair, developers only need to install + it + from the GitHub marketplace. By leveraging tools such as Probot and GitHub API, codefair monitors + activities on the software repository and communicates with the developers via a GitHub issue + “dashboard” that lists issues related to FAIR-compliance (updated with each new commit). For each + issue, + there is a link that takes the developer to the codefair user interface (built with Nuxt, Naive UI + and + Tailwind) where they can better understand the issue, address it through an intuitive interface, and + submit a pull request automatically with necessary changes to address the related issue. Currently, + codefair is in the early stages of development and helps with including essential metadata elements + such + as a license file, a CITATION.cff metadata file, and a codemeta.json metadata file. Additional + features + are being added to provide support for complying with language-specific standards and best coding + practices, archiving on Zenodo and Software Heritage, registering on bio.tools, and much more to + cover + all the requirements for making software FAIR.

+

In this talk, we will highlight the current features of codefair, discuss upcoming features, explain + how + the community can benefit from it, and also contribute to it. We believe codefair is an essential + and + impactful tool for enabling software curation at scale and turning FAIR software into reality. The + application of codefair is not limited to just making biomedical research software FAIR as it can be + extended to other fields and also provide support for software management aspects outside of the + FAIR + Principles, such as software quality and security. We believe this work is fully aligned with the + US-RSE’24 topic of “Software engineering approaches supporting research”. The conference + participants + will benefit greatly from this talk as they will learn about a tool that can enhance their software + development practices. We will similarly benefit as we are looking for community awareness and + contribution in the development of codefair, which is not currently supported through any funding + but is + the result of the authors aim to reduce the burden of making software FAIR on fellow developers.

+
+
+
+ +
+
+
+ +
+
+ +
+
+

Small to medium-sized research projects require increasingly sophisticated software stacks as the + demand + continues to grow for more high performance computing (HPC) resources and Kubernetes clusters for + web-based applications. Frequently these smaller projects do not have funding for dedicated DevOps + engineers, and require their RSEs to perform the task of dedicated DevOps engineers. The effort + required + to manually provision each layer of this stack, from cluster node operating system configuration to + application deployment, especially given the scarcity of RSEs, will become infeasible without force + multiplying innovations. Often these tasks are done early in the project, and need to be re-learned + for + the next project. Additionally, the wealth of knowledge from the DevOps engineer, securing these + systems + and upgrading them during the project will fall on the RSE, reducing the often scarce time to + develop + the application even more.

+

We present the approach developed at NCSA to address this problem: a GitOps-based method of + bootstrapping virtual computing resources and Kubernetes clusters for composable deployment of + collaborative tools and services. Leveraging industry-standard software solutions we provide a free + and + open source foundation upon which open science can flourish, with an emphasis on decentralized + applications and protocols where possible. Leveraging this infrastructure, we can add new layers on + this + called DecentCI, allowing an RSE to quickly get a complex system up and running, allowing for shared + access to data, sharing ideas in forums, private messaging, websites, etc.

+

Building on the knowledge gained from many projects, we have created a set of recipes allowing for a + new + project to be up and running in under 30 minutes. For example in the case of kubernetes, nodes will + be + created and configured, and clusters will be initialized with ingress controllers, secret + management, + storage classes etc (all of this is configurable on a per cluster basis). The clusters deployed can + easily be upgraded by applying newer centrally managed modules in these clusters. New functionality + added centrally can be added over time to the clusters.

+

During this talk we will discuss what tools are used and are centrally managed, and what tools are + installed in each cluster. We will describe how an RSE can add their applications to the system and + use + well understood GIT workflows to deploy new applications, and work with other RSE on the project. + The + end goal is a system that will be decentralized and empower the RSE to get new applications to the + scientists faster and securely to help with their research.

+
+
+
+ +
+
+
+ +
+
+ +
+
+

Domain research, particularly in the life sciences, has become increasingly complex due to the + diversity + of types and amounts of data concomitantly with the associated analytical methods and software. + Simultaneously, researchers must consider the trustworthiness of the software tools they use with + the + highest regard. As with any new physical laboratory technique, researchers should test and assess + any + software they use in the context of their planned research objectives.

+

As examples, bioinformatics software developers and contributors to community platforms that host a + variety of domain-specific tools, such as KBase (the DOE Systems Biology Knowledgebase) and Galaxy, + should design their tools with consideration for how users can assess and validate the correctness + of + their applications before opening their applications up to the community.

+

More attention should be placed on ensuring that computational tools offer robust platforms for + comparing experimental results and data across diverse studies. Many domain tools suffer from + inadequate + documentation, limited extensibility, and varying degrees of accuracy in data representation. This + lack + of standardization in biological research, in particular, diminishes the potential for + groundbreaking + insights and discoveries while also complicating domain scientists' ability to experiment, compare + findings, and confidently trust results across different studies.

+

Through several examples of tools in the biology domain, we demonstrate the issues that can arise in + these types of community-built domain-specific applications. Despite their open-source nature, we + note + issues related to transparency and accessibility resulting in unexpected behaviors that required + direct + engagement with developers to resolve. This experience underscores the importance of deeper openness + and + clarity in scientific software to ensure robustness and reliability in computational analyses.

+

Finally, we share several lessons learned that extend to research software in general and discuss + suggestions for the community.

+
+
+
+ +
+
+
+ +
+
+ +
+
+

Citing software in research is crucially important for many reasons, from reproducibility, to + bolstering + the career of the research software engineers who worked on the code, to understanding the + provenance of + ideas. With the inclusion of DOIs on Zenodo, CRAN, and through integrations with GitHub, it is + easier + than ever to cite software as a first order research object. However, there are no standards on what + software should be cited in a paper, and authors often fail to cite software, or only cite + well-known, + charismatic, user-facing packages. There are few attempts at citing dependencies, in particular. +

+

Here, we took citing software as an ethical research goal to its logical but unfeasible conclusion, + citing all dependencies for software used in a research package, not only the top-level package + itself. + We present our open source tool we used for finding DOIs and citation.cff files within dependencies, + and + talk about the implications of large amounts of citations within paper that uses research software. + In + particular, we encourage the adoption of software bills of materials (SBOMs) for citing software, + especially research software.

+
+
+
+ +
+
+
+ +
+
+ +
+
+

The Department of Energy Systems Biology Knowledgebase (KBase) is a community-driven research + platform + designed for discovery, knowledge creation, and sharing. KBase is a free, open access platform that + integrates data and analysis tools into reproducible Narratives, provides access to scalable DOE + computing infrastructure, and enables collaborative research and publishing of findable, accessible, + interoperable, and reusable (FAIR) analyses.

+

The KBase Narrative - the primary user interface for the KBase platform - is built on top of the + Jupyter + Notebook application. This interface enables platform users to access an array of wrapped tools + (apps) + that are used throughout the computational biology community, many of which produce their own data + visualizations, which are also made available in Narratives. Within a Narrative, users can perform + analyses, display interactive results, and record interpretations. In contrast to analysis workflows + commonly used in bioinformatics, where researchers will run individual tools (potentially hosted in + different locations) sequentially, KBase allows users to build custom analytical pathways with + notebooks, where all the tools and data are contained in a single place to enable reproducibility of + analysis and provide data provenance.

+

The platform is built around creating a welcoming user experience for users with a broad range of + biology, bioinformatics, and computational expertise. To accomplish this, the KBase Narrative uses a + GUI + to generate code that runs analysis tools on DOE computing infrastructure. The output of these apps + can + be made more comprehensible for users in the form of interactive data visualizations, reports, or a + simple list of data objects created by a tool. At the same time, the Jupyter Notebook allows more + advanced users to supplement their app runs with custom Python code.

+

Reproducibility is one of the major concerns of the system: Narratives and apps are versioned, and + all + data in the system receives a persistent unique ID, so analyses can be rerun to ensure that the same + results are achieved. There is also an emphasis on tracking data through KBase and recording the + transformations it undergoes through the provenance system; every data object has an immutable + record of + how it was produced, and this provenance chain can be followed forwards or backwards to view the + original inputs or see the eventual output of a set of analyses.

+

KBase also strives to provide FAIR data access. Recent work has focused on assigning credit to users + who + do analyses and publish data generated through KBase. In an early step toward creating a publishable + Narrative, these documents can be exported to a “static” format providing a frozen snapshot + detailing + the analysis steps and data. Furthermore, DOIs (digital object identifiers) can be assigned to the + static Narrative. These features can be used for reproducibility in a publication and ensure being + credited for work. Markdown cells provide a mechanism for users to extend the documentation + automatically created by data provenance and add additional context to explain the background of the + data imported into KBase.

+

Together, these features make the KBase Narrative an application where analyses, results, and + interpretations can be viewed and shared in a single interactive document. There are still many + challenges ahead for KBase as it steps toward making publishable Narratives. These include updating + the + user experience as the platform expands and caters to a growing user-base, and challenges with + adapting + to recent advances in large language models and their utility in biological data science. We welcome + and + encourage community feedback and discussion.

+
+
+
+ +
+
+
+ +
+
+ +
+
+

The Research Engineering Group at The Alan Turing Institute started our RSE journey 8 years ago as a + new + team at a new institute. Founded in 2016 and without the usual constraints and advantages of RSE + groups + based in universities, the team has had to find its own path in a rapidly evolving Institute and + field.

+

Over this time the Turing has grown and evolved considerably as a national institute, adding AI to + its + initial data science remit and refining its science and innovation agenda from a broad + programme-based + approach to a more focussed challenge-led one. As the Institute has grown and evolved over the + years, so + has the Research Engineering Group, growing from 4 to 45 and going through several iterations of how + we + structure ourselves and operate as a team in order to support the Institute's research.

+

As the team has evolved, we've expanded our range of research engineering roles to include those + more + focussed on data and computing, and we've built a sustainable career pathway for these roles within + the + Institute. Over the years we have refined our approach to recruitment, professional development and + career progression to attract a diverse range of talent and support them in their professional + journey, + with a clear pathway from our Junior level training role to our Principal level team leadership + role.

+

This journey has been guided by our principles: transparency in leadership and decisions; diversity + of + talent, people and experience; supporting individuals in their career journey; and focus on our role + as + expert collaborators. As we've progressed along this journey ourselves, we've also looked to support + others in doing so - both within the Turing as it has established other teams of related research + infrastructure professionals, and across the wider RSE community as other organisations have looked + to + establish or scale their own similar teams.

+

In this talk, we will share our journey and the lessons we've learned along the way. We hope that + our + story will be of interest both to those in leadership roles looking to establish or grow RSE teams + at + their own organisations and to team members within existing teams thinking about how they organise + themselves, their work and their culture as they grow and evolve as a team.

+
+
+
+ +
+
+
+ +
+
+ +
+
+

We present RainFlow as a case study in how collaborations between bench scientists and software + developers can deliver impactful solutions to the reproducibility crisis in scientific Research. +

+

RainFlow is a MacOS desktop application developed for Reproducible Analysis and Integration of + Flow-cytometry data. Flow cytometers are routinely used to collect rich biological data in clinical + settings and research laboratories alike. Modern flow cytometers are extremely sensitive instruments + that can measure the expression of 25-40 different proteins in millions of single cells in a matter + of + minutes. However, deriving actionable insights from this high-dimensional, high-volume data is + hindered + by the lack of reproducible analysis techniques.

+

Lack of reproducibility affects two aspects of the research data pipeline. First, technical noise in + the + sensitive instrumentation can confound accurate protein signal measurement during data collection. + Second, variations in the analytical choices made by individual researchers can confound + reproducibility + during data analysis.

+

We developed RainFlow in an effort to automate the process of data cleaning and analysis for flow + cytometry experiments. First, to decouple technical noise from true biological variation, we + developed + custom machine learning pipelines which reproducibly correct technical noise in the signal, as well + as + produce a quality score for each sample. The quality score can then be used to select for + high-quality + samples before integrating several batches of data together for downstream analysis. Second, to aid + researchers in making good analytical choices and recording every analytical choice, we packaged the + algorithms into a user-friendly MacOS desktop application called RainFlow.

+

RainFlow was specifically designed to be accessible to researchers with little to no coding ability + i.e. + researchers who have “bench skills” for experimental data collection, but not necessarily + computational + data analysis skills. RainFlow takes the researcher step-by-step to transform raw flow-cytometry + data + into cleaned, batch-normalized, quality-controlled data ready for integration. In addition, it + automatically records every data decision taken by the researcher during the analysis process and + exports the parameters for easy sharing. At every step, the researcher is able to visualize the + effect + of the machine learning algorithmic corrections on the data distribution. Helpful informational + guides + are provided to explain what each individual step or algorithm does, and how best to select the + required + analytical parameters. Additionally, we sought to automate parameter selection as much as possible, + so + that fewer total decisions relied on manual expertise.

+

This talk will focus on the lessons learnt during the development of RainFlow, which we hope will be + more broadly applicable for the research software engineering community. RainFlow was released in + the + Apple App Store in May 2024 and is available for free download.

+
+
+
+ +
+
+
+ +
+
+ +
+
+

Research Software Engineering (RSE) covers a wide spectrum of people who fall somewhere in between + domain + research science and software engineering. While this makes the community highly inclusive, it can + be + difficult for some to know if they qualify as an RSE or not and hesitate to engage. In this talk I + will + share my personal journey from research in software engineering (SER) to RSE.

+

As someone who was never formally a software engineer in the classic sense but a researcher using + software engineering methods in domain science, I never felt like I had any particular identity. + Upon + first hearing the term “RSE”, I immediately identified. However, over the next two years of slowly + engaging with the community - including attending US-RSE’23 - I was still hesitant to see myself as + one + as my journey and position looked different than most of who I was seeing. It wasn’t until + attendance in + a recent Dagstuhl seminar that brought together SERs and RSEs that I was able to debate my + insecurities + first-hand and settle into my identity.

+

Throughout my experiences I have met a wide array of different types of RSEs. Each coming from their + own + backgrounds, skill sets, job titles, daily practices, team composition, career priorities, and + challenges. Many of these types which I have yet to see well-represented or understood in the + community. + In my talk I will not only share my personal experiences, but also highlight several examples of + diverse + types of people who identify as RSEs in order to provide a broader representation to the community + and + encourage anyone on the edges as I once needed that they do belong. +

+
+
+
+ +
+
+
+ +
+
+ +
+
+

In the spirit of this year's theme, we will present the past, present, and possible future of RSEs + at the National Center for Supercomputing Applications (NCSA), which was founded in 1986 as + one of the original five centers in NSF’s Supercomputing Centers program. While High + Performance Computing (HPC) was the center's initial emphasis, software was also a key part + of NCSA's work from the start, ebbing and flowing over time with a number of broad reaching + applications, early insights into areas such as applied AI, and the need to support UIX within + research software This led to the growth of RSEs at NCSA to a body of 50 or so RSEs today + supporting scores of projects across every scientific domain, identifying common needs, and + through that building larger more sustainable software frameworks.

+

During the early years of the Center, the Software Development Group was formed and it + quickly began to produce a number of globally impactful software packages for the community + such as NCSA Telnet, Iperf, HDF (Hierarchical Data Format), Habanero and other tools. This + work continued and in 1993, NCSA released NCSA Mosaic, the first wide-spread graphical web + browser that directly led to Netscape, Internet Explorer, and Spyglass, and NCSA httpd, which + led to Apache httpd that in turn drove 90% of web servers at its peak. Though all were built + around enabling the use of supercomputers during the growth of the internet, they all also had + an enormous broader impact with the general public. During this period, NSF funded NCSA + (and likewise our sibling centers as part of the Supercomputing Centers Program) through a + "block grant" model that supported the majority of activity at NCSA; funding was ~$35M per + year. The funding model was a key to success since it allowed NCSA staff to more freely + explore ideas and thus we saw the significant contributions NCSA made. In 1997, that changed + as NSF shifted from the block grant model to funding efforts through a set of independently + awarded grants for specific work. This resulted in software developers being scattered across + smaller groups that supported less traditional users who did not need HPC, leaving the Center, + or supporting others software on HPC resources after the Center took a much more HPC + support and hardware focus across a chain of large NSF efforts such as TeraGrid, XSEDE, and + Blue Waters.

+

The subsequent evolution of RSEs at NCSA had a very grassroots beginnings when a handful + of these small groups developing software decided to join forces: rather than competing with + each other in terms of collaborations, grants, and staff, they instead worked together, jointly + pursued funding, shared resources, and added greater security to all by having a larger portfolio + of collaborators and projects. The initial coalition was founded on a charter that prioritized trust + for improved efficiency. It emphasized respecting the PI’s role on projects and refraining from + interfering in another group’s project unless invited. The coalition also committed to supporting + each other if one group experienced a shortfall in projects. Over time other groups also joined + and through that, software had a larger voice enabling it to push for changes such as more + efficient hiring practices, support for green cards, standing up more flexible on-prem cloud + based resources to support interactive web services and data sharing, adoption of the RSE title + as an official campus title, and a recognized career path. Today software exists as a top + level directorate within the NCSA organization. This talk will walk through the key changes + during the evolution of NCSA's RSE role in a manner that can be leveraged by other RSEs + starting new groups. +

+
+
+
+ +
+
+
+ +
+
+ +
+
+

Have you ever asked, "did this output use the right version of the inputs and code?”, "what software + does + this program require to execute again?”, "how can I convert this pile-of-scripts to a containerized + workflow?”, "how does the ancestry of these two outputs differ?", "who is using my software + library?”, + or a similar question. All these are example questions that computational provenance can help + answer. + Computational provenance is the process by which a certain computational artifact was generated, + including its inputs (e.g., data, libraries, executables) and the computational provenance of those + inputs, for example, the figure on the right.

+

How to collect computational provenance? We could ask the application developers to emit this kind + of + data. That approach requires a herculean effort to get all applications to comply. We could require + the + user to use workflow systems that explicitly declare the inputs and outputs of every node. This + approach + shifts the compliance burden onto the user. If the user mis-specifies the workflow, it may still + execute + but the provenance would be wrong. The "holy-grail” would be to collect provenance data at the + system-level without modifying any application code and not needing superuser privileges or harming + performance.

+

Almost all prior attempts at unprivileged system-level provenance collection used ptrace syscall, + which + asks the Linux kernel to switch to the tracing process every time the tracee process executes a + system + call (like how strace binary works). Ptrace-based tracers meet most of the technical requirements + but + are prohibitively slow. Our recently accepted work (Grayson et al. 2024) observed a geometric mean + of + traced runtime 1.5x for CARE (Janin et al. 2014), 2.5x for RR (O'Callahan et al. 2017), and 3x for + ReproZip (Chirigati et al. 2016) over the untraced runtime. Ptrace involves a context-switch from + the + tracee process to the tracer process and back every system call, of which there could be thousands + per + second. Each context switch causes scheduler overhead and clears caches (especially the translation + lookaside buffer).

+

We propose PROBE (Provenance for Replay OBservation Engine), a tool that collects system-level + provenance using library interpositioning (Curry 1994), also known as the LD_PRELOAD trick. Library + interpositioning happens all within the same virtual address space, which does not involve any + additional context switching. We have a working research prototype operating this way.

+

Another possible reason system-level provenance hasn't caught on is the lack of downstream tooling. + We + are developing several consumers of provenance including a graphical viewer, an automatic + containerization tool, an environment "diff”, and a software citation generator. Furthermore, we + export + our provenance to Process Run RO Crate format (Leo et al. 2023), so that it can be interoperable + with + other provenance consumers.

+
+
+
+ +
+
+
+ +
+
+ +
+
+

Life sciences research is increasingly requiring researchers to do more difficult tasks, as datasets + are + becoming larger and more complex and new statistical methods are being advanced. Researchers are + consequently needing to create and use more research software tools to manage data and analyses. The + majority of researchers in life sciences are lagging behind on the computational skills that they + need + to stay on the cutting edge of modern research. Most are self-taught, as computational skills are + mostly + still not being taught in the curriculum. As these are challenging skills to teach oneself, the + process + is difficult and leads to gaps or inaccuracies in knowledge. Additionally, most researchers do not + have + the time to become software engineering experts while doing all of their other necessary tasks.

+

Our group at the University of Arizona is addressing this problem by helping life sciences + researchers + increase their computational skills through training and collaboration. We are a small group of data + scientists and research software engineers embedded in a division that includes departments for + agriculture, plant and animal sciences, and environmental science. We devote a substantial amount of + our + time teaching researchers in the division through a variety of programming. We develop curriculum + and + hold workshops, workshop series, learning groups, and lab-level trainings on a variety of + intermediate + topics on good software practices, programming libraries, and version control. We also teach a lot + of + people one-on-one. We approach teaching in an inclusive and accessible way, and hold the philosophy + that + almost everyone is capable of learning these skills. We build community among our research + community, + connecting folks who are isolated in their labs or departments. We have also discussed with many + people + what paths there are to move into research software engineering as a career.

+

The second part of our group's approach is to have devoted practitioners collaborate with life + sciences + researchers. We have advanced skillsets that researchers cannot have themselves because we devote + our + focus on learning skills and new tools to develop software, advance data management, and improve + reproducibility. By teaching researchers when possible and doing the necessary when they cannot, we + enable research to be done that could not be otherwise. We are able to do this because we only help + with + the research of others and do not have our own research program. Everyone in our group also has + domain + expertise in life sciences fields, and so are more familiar with those fields' challenges, data + types, + and language. We also have strong communication skills that are needed for excellent collaborative + work. + Lastly, our collaboration success comes from being a small and flexible group embedded in the domain + unit.

+

There are some challenges to how our group is helping improve scientific software use and creation + in + life sciences. Our approach is slow and not very scalable because we are working with individuals or + small groups. It can also be difficult for us to track our impact, even with gathering a diverse set + of + data and information about how we are helping others. We are making a substantial difference in the + research that our institution's life sciences researchers are able to do.

+
+
+
+ +
+
+
+ +
+
+ +
+
+

In 2022 the Princeton Research Software Engineering group, in collaboration with Human Resources, + established a multi-level career path job family for Research Software Engineers (RSEs) at Princeton + University. Expanding on the existing "Research Software Engineer” and "Senior Research Software + Engineer” roles, the new job family creates a structured career ladder that includes roles for six + individual contributors (Associate RSE, RSE I, RSE II, Senior RSE, Lead RSE, Principal RSE), two + working + managers (Lead RSE, Principal RSE), and three leadership positions (Associate Director, Director, + and + Senior Director). This formally establishes guidelines for defining and differentiating between RSE + roles, enabling equitable hiring of RSEs at substantially different experience levels, and + establishing + promotional pathways for RSEs employed at Princeton University.

+

In this talk we will describe how the vision for the career path originated, the process behind + defining + the roles and grades within the career ladder, and how we bridged gaps in technical understanding + with + administrative partners who were unversed in the role of Research Software Engineers. By minimizing + technical jargon to ‘standardize' job descriptions, roles were able to be defined with essential + requirements that allowed for proper compensation review and enabled the model to be effective for + broader use across campus departments. Finally, we will discuss important lessons learned from the + model + creation process through its implementation and use as we have successfully hired, reviewed, + promoted, + and retained RSEs at Princeton.

+
+
+
+ +
+
+
+ +
+
+ +
+
+

This talk will look at some containers that are actively used in research computing. It will try and + examine how easy they are to dissect and understand as software engineering artifacts. The talk will + aim + to provoke everyone to think about what good practices and guidance the RSE community might put + forward + around the use and role of containers.

+

In many ways containers provide an elegant solution to ensuring reproducibility and portability of + codes. Each layer in a container has a unique hash that ensures the full stack of a container is + defined + unambiguously. Containers can carry with them a deep set of software dependencies that help simplify + the + challenge of making a portable code. Container repositories and publication services make containers + findable and easily cloned for shared use. These are all unambiguously valuable features. However, + it is + not uncommon to come across containers in active use that contain incredibly expansive layers, so + that + they in effect encompass entire operating system distributions. Often in such containers it is not + clear + what is key to an application and what is more of an expedient, included to enhance short-term + productivity.

+

In this talk I will dissect a few large containers and examine what structure is or isn't present + and + how their formulation sits with regard to traditional software engineering practices. In particular + the + talk will look through a lens that channels Edsger Dijkstra's motivation for promoting structured + programming. Djikstra argued persuasively that programs should not only be functional but they + should + also strive to be comprehensible and digestible to a "a slow-witted human being with a very small + head". + It is interesting to look at some containerized software through this lens. For the RSE community + especially some very effective and expedient practices in container publication and software + distribution can appear to be in tension with other software engineering design ideas around + modularity + and composability.

+
+
+
+ +
+
+
+ +
+
+ +
+
+

The sustainability of scientific software is crucial for advancing research. In the complex world of + scientific software development, it is essential to understand the diverse factors that influence + sustainability. From the health of the software community to the robustness of engineering + practices, each element plays a pivotal role in the long-term viability of a project. This talk, + presented by the Center for Open-Source Research Software Stewardship and Advancement + (CORSA), focuses on the diverse definitions of sustainability within the scientific software + community, its attributes, and the metrics used to measure and enhance it.

+

The Center for Open-Source Research Software Stewardship and Advancement (CORSA), a new community of + practice, aims to address the long-term sustainability of scientific and + research software by fostering collaboration among stakeholders, facilitating partnerships with + open-source foundations, and educating the community regarding approaches to the + stewardship and advancement of open-source software. CORSA is part of a larger initiative + funded by the U.S. Department of Energy (DOE) called the Next Generation Scientific Software + Technologies (NGSST) project, which includes stakeholders from a broad cross-section of the + scientific computing and research software community.

+

In this talk, we will provide a brief history of the NGSST project and the objectives of DOE to + create sustainability pathways for open-source scientific software. We will then discuss the key + issues that CORSA plans to address to facilitate scientific software's long-term stewardship. + These include the development of metrics and metric models that help projects assess and + understand their position in the landscape of sustainability efforts. The talk will draw on + information gathered from previous CORSA workshops and existing literature and research into + this topic, including the types of sustainability metrics identified as crucial by the community. In + particular, we will explore:

+

● Definitions of Sustainability: Understand the various ways the community defines + sustainability in the context of scientific software. + ● Attributes of Sustainability: Identify the key attributes that the community values, such + as community health, engineering practices, and funding stability. + ● Metrics for Measuring Sustainability: Discuss the different metrics and models that + help projects assess their sustainability, including how these metrics are developed and + applied. + ● Capturing and Using Metrics: Explore methods for capturing these metrics and + practical strategies for using them to improve sustainability.

+

Our goal is to create a community of practice where we can collaborate to curate, share, and + disseminate information and guidance that will strengthen and sustain the research and + scientific software community in the long term.

+
+
+
+ +
+
+
+ +
+
+ +
+
+

Combining prose, code, and outputs in a single artifact, computational notebooks are + exceptionally valuable instruments in any context where 'research' and 'software' intersect. + However, the same features that make notebooks such effective tools also result in unique + issues that need to be addressed to ensure they can fulfill their full potential for the wider + community of software and research practitioners. One of the biggest challenges with + computational notebooks is ensuring that a notebook can be run by people other than its + author(s), on computation environments, and/or at different times in the future after its creation, + an ability often known as computational reproducibility. While this is a general problem affecting + any context where notebooks (or indeed, any computational artifact) are used, these concerns + also represented a concrete issue for the computational notebooks submission track at the + US-RSE conference, affecting both authors and reviewers alike.

+

If reviewers are not able to run notebooks for the submissions they're reviewing, they'll likely be + unable to evaluate the submission based on its full intended functionality; or, they might try to + fix + the issues preventing the notebook from being run (missing dependencies, incompatible + versions, etc), which results in extra work, frustration, and/or less consistency across multiple + reviewers. Even when authors try their best to provide resources for reproducing a valid + computational environment in which their submission can be run (such as documentation, + packaging/environment metadata, etc), the lack of an automated way to test and a documented + standard for the computational environment that will be used limits their ability to validate their + resources (and, therefore, estimating how likely it is that their notebooks will run as expected + during review) before finalizing their submission. As the program subcommittee responsible for + notebooks at US-RSE’24, a vital part of our role has been to streamline the submission and + review process to enable both authors and reviewers to concentrate on their respective duties. + Additionally, given the added technical complications unique to notebooks, any solution that + required unsustainable amounts of extra work on our side would also not be feasible to adopt. + This talk will provide an overview of the workflow we developed for US-RSE’24 to help dealing + with these issues, as well as lessons learned on what worked well and what didn’t. Built using + open-source and/or freely available tools such as repo2docker, GitHub Actions, and + Binder, the infrastructure provides a set of automated checks that authors can enable to + test the repository before submission, based on the same standardized tools, specifications, + and computational environment available to reviewers.

+

Beyond the specific context of this year’s conference, we structured this talk to be relevant and + appealing for a broad audience of RSEs, especially (but not limited to) those interested in + computational notebooks, Continuous Integration and Development (CI/CD), and the challenges + and tradeoffs associated with designing workflows to be usable at all levels of prior experience. +

+
+
+
+ +
+
+
+ +
+
+ +
+
+

Leading a collaborative data science or research software engineering (RSE) team in an + academic environment can have many challenges including institutional infrastructure, funding, + and technical expertise. Even in the most challenging environment, however, leading such a + team with inclusive practices can be rewarding for the leader, the team members, and + collaborators. We describe nine leadership and management practices that are especially + relevant to the dynamics of such teams and an academic environment: ensuring people get + credit, making tacit knowledge explicit, establishing clear performance review processes, + championing career development, empowering team members to work autonomously, learning + from diverse experiences, supporting team members in navigating power dynamics, having + difficult conversations, and developing foundational management skills. Active engagement in + these areas will help those who lead data science or RSE groups – whether faculty or staff, + regardless of title – create and support inclusive teams. +

+
+
+
+ +
+
+
+ +
+
+ +
+
+

Research software is critical for scientific advancement and, like all software, is susceptible to + being targeted by malicious actors and misuse alike, meaning that security is an important + quality of research software. Implementing and evaluating security is a complex and + ever-evolving process. However, poor research software security could result in the sabotage of + data, hardware, or research findings. Proper security implementation requires security + knowledge and expertise that many research software stakeholders do not have, resulting in + more burden placed on the limited bandwidth of security resources and personnel. Therefore, it + is important to identify methods of improving methods of research software security without + increasing demand for limited security resources.

+

To improve the security of research software, we propose introducing security concepts, such as + threat modeling, to RSEs so they can be involved in ongoing security efforts. At its root, threat + modeling is the process of creating a model of a system or piece of software that is used to + theorize both potential attacks and countermeasures to prevent them. Threat modeling is a + low-cost, effective way to supplement security efforts, improve security posture, and create + cleaner software architecture. While difficult to automate, threat modeling has a host of tools + available to make it easier to perform with less required security expertise compared to other + security activities. RSEs are prime candidates for threat modeling because of their expertise in + both the research domain and in software engineering.

+

To establish a baseline for how RSEs view security, we replicated a security culture survey + originally focused on open-source software. This survey contains questions along six + dimensions: Attitude, Behavior, Competency, Governance, Subjective Norms, and + Communication. In aggregate, these six dimensions describe the security culture of the RSE + community. In addition to measuring the current security culture, we exposed participants to + three vignettes depicting security events. In the summary of these vignettes, we explained how + threat modeling could have been used to prevent or diagnose malicious or accidental damages + before they occurred.

+

We recruited 96 US and German RSEs for the survey. Our initial results show a generally + positive security culture in the RSE community. Respondents perceived all cultural dimensions + positively, except for Governance, which represents security expertise, policies, and + implementation. The respondents also responded positively to threat modeling. They saw the + value of threat modeling and thought it would fit nicely into their existing development + processes. Respondents also indicated they would need additional training to effectively threat + model and were interested in receiving this training.

+

We are using the data from the survey and vignettes to create resources that educate RSEs on + threat modeling practices that can be incorporated into their existing development processes. + We will use this talk to 1) present our findings to the US-RSE community, 2) gather feedback on + the security resources we are developing for RSEs, and 3) promote dialogue about involving + RSEs in security efforts.

+
+
+
+ +
+
+
+ +
+
+ +
+
+

Since the term research software engineer (RSE) was coined over a decade ago, the field has enjoyed + rapid + growth with the establishment of RSEs groups at labs and universities, professional societies, and + conferences and workshops. Today, RSEs worldwide make impactful contributions to science and + engineering + through excellence in software, but we believe the best is yet to come. RSEs represent an emerging + profession, one that continues to develop its identity, values, and practices (Sims 2022). There is + a + growing body of literature around who RSEs are and the future of the field, with many works written + by + RSEs themselves. Concurrently, it is also important to consider how RSEs relate to other professions + within research organizations. RSEs regularly interact with staff from a diverse range of + backgrounds, + including domain researchers and engineers, computing facility and IT professionals, data + scientists, + technical librarians, and managers and HR specialists. When we examine this organizational context, + we + are led to ask many important questions. How non-RSE allies can best support RSEs? How can we create + a + supportive ecosystem in which RSEs will thrive? How do we integrate RSEng with allied professions to + achieve mutual success? In this talk, we consider the case of RSEs and software engineering + researchers + (SERs). Both SE academics and practitioners have a common interest in improving the quality of + software + and its production (Stol and Fitzgerald 2018). While CSE software development has historically + received + little attention from mainstream software engineering, RSEs have been successful in building bridges + between the two worlds. We believe the SE research community should work more closely with RSEs and + serve their needs. Based on our experiences, which include three of the authors participating in a + recent Dagstuhl workshop on this topic, we discuss (1) SE-related needs that RSEs report having, (2) + what SE researchers can do to address those needs, and (3) how to foster productive relationships + between RSEs and SERs.

+
+
+
+ +
+
+
+ +
+
+ +
+
+

At Sandia National Laboratories, computational modeling and simulation is ubiquitous + across the labs’ diverse missions. Computational models—that is, digital + representations of physical systems and/or phenomena and their behaviors— are + regularly developed and provide empirical justification to critical mission decisions; this + spans workflows, scripts, and notebooks that drive simulations as well as the complex + software stacks underneath them. As the number and variety of models continues to + grow, however, our limited ability to maintain and govern them becomes a bottleneck to + further productivity improvements. They are created in a highly manual process, may be + created in duplicate, lost because of personnel changes, or deteriorate over time due + ever-changing computing environments.

+

Researchers and engineering analysts often lack the time, resources, and/or skills to + build sustainable models and to make them discoverable. RSEs and allied professionals + can play an important role in encouraging the adoption of better practices, but to affect + enduring change, we must go even further: to realize a culture of sharing, collaboration, + and reusability around modeling, we need software and organizational infrastructures + that can support that culture.

+

For these reasons, we are building the Engineering Common Model Framework + (ECMF), a platform for computational model sustainment at Sandia. ECMF will enable + the automated evaluation of models over time and ensure that models created at + Sandia are discoverable and ready to be revisited, extended, and reused. We have + demonstrated the capability of virtually air-gapped automated execution in a + containerized environment and have a prototype, user-friendly frontend where users + can submit models, schedule model executions, and monitor model status. In our talk, + we will discuss our current and planned capabilities, review our lessons learned, and + discuss the role of RSEs in the present and future of software and data stewardship. +

+
+
+
+ +
+
+
+ +
+
+ +
+
+

In the field of research software engineering, Large Language Models (LLMs) have emerged as + powerful tools for enhancing coding practices. This presentation, "Leveraging LLMs for Effective + Coding," delves into the practical applications of LLMs in automating and improving various + aspects of the development process. By providing some empirical evidence from our firsthand + experience using LLMs for software development, we explore how LLMs can significantly + augment a developer's toolkit, making what are often time-consuming tasks more efficient and + Reliable.

+

Automated test generation, for instance, not only speeds up the testing process but also + ensures a more comprehensive coverage, leading to robust software products. Similarly, + leveraging LLMs for code review can preemptively identify potential issues, optimizing code + quality before it reaches human reviewers. Furthermore, the ability of LLMs to generate and + update documentation in tandem with code changes addresses one of the most common + challenges in software development, maintaining accurate and helpful documentation. + Beyond these key areas, the presentation also touches upon additional use cases where LLMs + can make a significant impact, including debugging assistance, code implementation, and code + refactoring, among others. We discuss effective ways for integrating LLMs into the development + workflow, emphasizing the importance of clear communication, context provision, and iterative + refinement. Ethical considerations, particularly in addressing potential biases and ensuring + responsible use, are also explored.

+

Join us as we navigate the practicalities of incorporating LLMs into coding practices, aiming to + inspire developers to harness these AI tools for more efficient, high-quality software + development. +

+
+
+
+ +
+
+
+ +
+
+ +
+
+

Science gateways provide an easy-to-use computational platform for research and educational purposes, + abstracting underlying infrastructure complexities while promoting an intuitive interface. In the + last + 15 years, quite a few mature science gateway frameworks and Application Programming Interfaces + (APIs) + have been developed fostering distinct communities and strengths that meet a diverse set of needs. + Examples such as HUBzero, Tapis, Galaxy, and OneSciencePlace are well-sustained science gateway + frameworks that create production quality gateways that facilitate collaborative workspaces. These + gateways enhance the research process by democratizing access to computational resources and + supporting + users in their exploration of research. Researchers benefit from streamlined access to various + resources, such as high-performance computing (HPC) systems, data repositories, and specialized + software + tools. The shared workspaces enable collaborative projects, facilitating communication and + cooperation + across different disciplines. Interdisciplinary collaboration is crucial to addressing many grand + scientific challenges such as climate modeling, genomics, or materials sciences. The standardized + environments these gateways provide promote data sharing and set the stage for the reproducibility + of + computational experiments, a cornerstone in science. For research software engineers, engagement + with + science gateways offers numerous advantages. These frameworks provide standardized interfaces and + mechanisms to interact with software libraries and tools, streamlining the development process and + ensuring compatibility. This reduces development time and complexity, allowing engineers to focus on + each community's unique requirements without dealing with low-level technical details. Automated + deployment features supported by many gateways further ease the process. Beyond the technical + benefits + above, engaging within a science gateway framework also means engaging with a larger community of + developers and users. This collaborative environment leads to shared knowledge, rapid issue + resolution, + and the opportunity to participate in joint development efforts. Continuous user feedback from + researchers using the tools allows continuous improvement, ensuring the software evolves to meet + evolving user needs. From a professional development perspective, active participation in science + gateway frameworks exposes engineers to cutting-edge computational methodologies, cloud computing + principles, and big data techniques. This both enhances their skills and keeps them up-to-date with + the + latest technological advancements. Furthermore, experience with science gateways and the relevant + tech + stacks being used, can open up career opportunities in academia and industry, given the growing + demand + for expertise in these areas. In summary, science gateway frameworks play a pivotal role in + accelerating + scientific research, providing enhanced accessibility and collaboration. For research software + engineers, these frameworks offer a rich environment for skill development, collaboration, and + career + advancement. As scientific research increasingly relies on collaborative, data-intensive approaches, + the + role of science gateways will continue to expand in the research ecosystem.

+
+
+
+ +
+
+
+ +
+
+ +
+
+

In the rapidly evolving field of Artificial Intelligence and Machine Learning (AI/ML), the journey + from innovative research to scalable, robust deployment is fraught with challenges. This + presentation delves into the critical lessons learned from our experiences in navigating this + complex transition, offering insights that are vital for researchers and practitioners alike. + The initial phase of any AI/ML project is marked by excitement and potential. However, we + quickly learned the importance of grounding this enthusiasm with practical considerations, + particularly the early and thorough definition of metrics and benchmarks. This foundational step, + often overlooked, became the cornerstone of our project's success, enabling us to evaluate research + solutions effectively and pivot our strategies as needed.

+

Another significant hurdle we encountered was the transition from the exploratory and often + chaotic environment of Jupyter notebooks to the structured realm of development-ready code. + The ability of our research engineers to write modular and reproducible Python code was + instrumental in bridging the gap between research findings and development, highlighting the + necessity of coding best practices in the research phase.

+

Deployment presented its own set of challenges, notably the infamous "It works on my + computer" syndrome. Our solution was a strategic embrace of Docker images, which not only + streamlined our deployment process but also ensured consistency and reliability across different + environments. This approach, coupled with a focus on cache optimization, significantly reduced + deployment headaches.

+

Perhaps the most profound lesson was the value of rapid prototyping. Moving swiftly from + concept to an end-to-end solution, even if imperfect, provided multiple benefits. It accelerated + our learning about the problem space, facilitated iterations based on real-world feedback, and + improved stakeholder engagement by providing a tangible product for demonstration. This + approach also forced us to make critical decisions about technology investments and workflow + design, laying a foundation for continuous improvement.

+

This talk aims to share these insights and more, exploring strategies that can help bridge the + often daunting gap between AI/ML research and impactful deployment. Join us to learn how to + navigate this transition effectively, ensuring that your projects are not only innovative but also + ready for the real world. +

+
+
+
+
diff --git a/pages/program/abstracts/talks.tbd b/pages/program/abstracts/talks.tbd deleted file mode 100644 index d8e3680..0000000 --- a/pages/program/abstracts/talks.tbd +++ /dev/null @@ -1,414 +0,0 @@ ---- -layout: page -title: Talks -description: -menubar: program -permalink: program/talks/ -menubar_toc: true -set_last_modified: true ---- - -
-
-
-
- -
-
- -
-
-

The growing recognition of research software as a fundamental component of the scientific process has led to the establishment of both Open Source Program Office (OSPO) as a Research Software Engineering (RSE) groups. These groups aim to enhance software engineering practices within research projects, enabling robust and sustainable software solutions. The integration of an OSPO into an RSE group within a university environment provides an intriguing fusion of open source principles and research software engineering expertise. The utilization of students as developers within such a program highlights their unique contributions along with the benefits and challenges involved.

-

Engaging students as developers in an OSPO-RSE group brings numerous advantages. It provides students with valuable experience in real-world software development, enabling them to bridge the gap between academia and industry. By actively participating in open source projects, students can refine their technical skills, learn industry best practices, and gain exposure to collaborative software development workflows. Involving students in open source projects enhances their educational experience. They have the opportunity to work on meaningful research software projects alongside experienced professionals, tackling real-world challenges and making tangible contributions to the scientific community. This exposure to open source principles and practices fosters a culture of innovation, collaboration, and knowledge sharing.

-

This approach also raises questions. How can the objectives and metrics of success for an academic OSPO-RSE group be defined and evaluated? What governance models and collaboration mechanisms are required to balance the academic freedom of researchers with the community-driven nature of open source? How can the potential conflicts between traditional academic practices and the open source ethos be effectively addressed? How can teams balance academic commitments with project timelines? These questions highlight the need for careful consideration and exploration of the organizational, cultural, and ethical aspects associated with an OSPO acting as an RSE group within a university.

-

Leveraging student developers in an OSPO-RSE group also presents challenges that need careful consideration. Students may have limited experience in software engineering practices, requiring mentoring and guidance to ensure the quality and sustainability of the research software they contribute to. Balancing academic commitments with project timelines and expectations can also be a challenge, necessitating effective project management strategies and clear communication channels. Furthermore, the ethical considerations of involving students as developers in open source projects must be addressed, ensuring the protection of intellectual property, respecting licensing requirements, and maintaining data privacy.

-

The involvement of students as developers within an OSPO-RSE group offers valuable benefits. The effective integration of students in this context requires thoughtful planning, mentorship, and attention to ethical considerations. This talk will examine the experience of the Open Source with SLU program to explore the dynamic role of student developers in an OSPO-RSE program and engage in discussions on best practices, challenges, and the future potential of this distinctive approach to research software engineering within academia.

-
-
-
-
-
-
- -
-
-
-
-

Conda is a multi-platform and language agnostic packaging and runtime environment management ecosystem. This talk will briefly introduce the conda ecosystem and the needs it meets, and then focus on work and enhancements from the past 2 years. This includes speed improvements in creating packages and in deploying them; work to establish and document standards that broaden the ecosystem; new support for plugins to make conda widely extensible; and a new governance model that incorporates the broader conda community.

-

The conda ecosystem enables software developers to create and publish easy to install versions of their software. Conda software packages incorporate all software dependencies (written in any language) and enable users to install software and all dependencies with a single command. Conda is used by over 30 million users worldwide. Conda packages are available in both general and domain specific repositories. conda-forge is the largest conda-compatible package repository with over 20,000 available packages for Linux, Windows, and macOS. Bioconda is the largest domain specific repository with over 8,000 packages for the life sciences.

-

Conda also enables users to set up multiple parallel runtime environments, each running different combinations and versions of software, including different language versions. Conda is used to support multiple projects that require conflicting versions of the same software package.

-
-
-
-
-
-
- -
-
-
-
-

Alchemical free energy methods, which can estimate relative potency of potential drugs by computationally transmuting one molecule into another, have become key components of drug discovery pipelines. Despite many ongoing innovations in this area, with several methodological improvements published every year, it remains challenging to consistently run free energy campaigns using state-of-the-art tools and best practices. In many cases, doing so requires expert knowledge, and/or the use of expensive closed source software. The Open Free Energy project (https://openfree.energy/) was created to address these issues. The consortium, a joint effort between several academic and industry partners, aims to create and maintain reproducible and extensible open source tools for running large scale free energy campaigns.

-

This contribution will present the Open Free Energy project and its progress building an open source ecosystem for alchemical free energy calculations. It will describe the kind of scientific discovery enabled by the Open Free Energy ecosystem, approaches in the core packages to facilitate reproducibility, efforts to enhance scalability, and our work toward more community engagement, including interactions with both industry and with academic drug discovery work. Finally, it will discuss the long-term sustainability of the project as a hosted project of the Open Molecular Software Foundation, a 501(c)(3) nonprofit for the development of chemical research software.

-
-
-
-
-
-
- -
-
- -
-
-

This talk presents research on the prevalence of research software as academic research output within international institutional repositories (IRs). This work expands on previous research, which examined 182 academic IRs from 157 universities in the UK. Very low quantities of records of research software were found in IRs and that the majority of IRs could not contain software as independent records of research output due to the underlying Research Information System (RIS) platform. This has implications for the quantities of software returned as part of the UK's Research Excellence Framework (REF), which seeks to assess the quality of research in UK universities and specifically recognises software as legitimate research output. The levels of research software submitted as research output have declined sharply over the last 12 years and this differs substantially to the records of software contained in other UK research output metadata (e.g. https://gtr.ukri.org). Expanding on this work, source data from OpenDOAR, a directory of global Open Access Repositories, were used to apply similar analyses to international IRs in what we believe is the first such census of its kind. 4,970 repositories from 125 countries were examined for the presence of software, along with repository-based metadata for potentially correlating factors. It appears that much more could be done to provide trivial technical updates to RIS platforms to recognise software as distinct and recordable research output in its own right. We will discuss the implications of these findings with a focus on the apparent lack of recognition of software as discrete output in the research process.

-
-
-
-
-
-
- -
-
-
-
-

Numerical research data is often saved in file formats such as CSV for simplicity and getting started quickly, but challenges emerge as the amount of data grows. Here we describe the motivation and process for how we moved from initially saving data across many disparate files, to instead utilizing a centralized PostgreSQL relational database. We discuss our explorations into the TimescaleDB extension, and our eventual decision to use native PostgreSQL with a table-partitioning schema, to best support our data access patterns. Our approach has allowed for flexibility with various forms of timestamped data while scaling to billions of data points and hundreds of experiments. We also describe the benefits of using a relational database, such as the ability to use an open-source observability tool (Grafana) for live monitoring of experiments.

-
-
-
-
-
-
- -
-
-
-
-

Accurately capturing the huge span of dynamical scales in astrophysical systems often requires vast computing resources such as those provided by exascale supercomputers. The majority of computational throughput for the first exascale supercomputers is expected to come from hardware accelerators such as GPUs. These accelerators, however, will likely come from a variety of manufacturers. Each vendors uses its low-level programming interface (such as CUDA and HIP) which can require moderate to significate code development. While performance portability frameworks such as Kokkos allow research software engineers to target multiple architectures with one code base, adoption at the application level lags behind. To address this issue in computational fluid dynamics, we developed Parthenon, a performance portable framework for block-structured adaptive mesh refinement (AMR), a staple feature of many fluid dynamics methods. Parthenon drives a number of astrophysical and terrestrial plasma evolution codes including AthenaPK, a performance portable magnetohydroynamics (MHD) astrophysics code based on the widely used Athena++ code. Running AthenaPK on Frontier, the world’s first exascale supercomputer, we explore simulations of galaxy clusters with feedback from a central supermassive black hole. These simulations further understanding of the thermalization of feedback in galaxy clusters. In this talk we present our efforts developing performance portable astrophysics codes in the Parthenon collaboration and our experience running astrophysics simulations on the first exascale supercomputer. LA-UR-23-26938

-
-
-
-
-
-
- -
-
- -
-
-

To address the lack of software development and engineering training for intermediate and advanced developers of research software, we present INnovative Training Enabled by a Research Software Engineering Community of Trainers (INTERSECT). INTERSECT, funded by NSF, provides expert-led training courses designed to build a pipeline of researchers trained in best practices for research software development. This training will enable researchers to produce better software and, for some, assume the role of Research Software Engineer (RSE).

-

INTERSECT sponsors an annual RSE-trainer workshop, focused on the curation and communal generation of research software engineering training material. These workshops connect RSE practitioner-instructors from across the country to leverage knowledge from multiple institutions and to strengthen the RSE community. The first workshop, held in 2022 laid the foundation for the format, structure, and content for the INTERSECT bootcamp, a multi-day, hands-on, research software engineering training event.

-

INTERSECT brings together RSE instructors from U.S. institutions to exploit the expertise of a growing RSE community. In July 2023, INTERSECT will sponsor a week-long bootcamp for intermediate and advanced participants from U.S. institutions. We will use an existing open-source platform to make the INTERSECT-curated material from the bootcamp available. The public availability of the material allows the RSE community to continually engage with the training material and helps coordinate the effort across the RSE-trainer community.

-

In this talk we will introduce the INTERSECT project. We will discuss outcomes of the 2023 bootcamp, including curriculum specifics, lessons learned, and long term objectives. We will also describe how people can get involved as contributors or participants.

-
-
-
-
-
-
- -
-
-
-
-

Machine learning models, specifically neural networks, have garnered extensive recognition due to their remarkable performance across various domains. Nevertheless, concerns pertaining to their robustness and interpretability have necessitated the immediate requirement for comprehensive methodologies and tools. This scholarly article introduces the "Adversarial Observation" framework, which integrates adversarial and explainable techniques into the software development cycle to address these crucial aspects.

-

Industry practitioners have voiced an urgent need for tools and guidance to fortify their machine learning systems. Research studies have underscored the fact that a substantial number of organizations lack the necessary tools to tackle adversarial machine learning and ensure system security. Furthermore, the absence of consensus on interpretability in machine learning presents a significant challenge, with minimal agreement on evaluation benchmarks. These concerns highlight the pivotal role played by the Adversarial Observation framework.

-

The Adversarial Observation framework provides model-agnostic algorithms for adversarial attacks and interpretable techniques. Two notable methods, namely the Fast Gradient Sign Method (FGSM) and the Adversarial Particle Swarm Optimization (APSO) technique, have been implemented. These methods reliably generate adversarial noise, enabling the evaluation of model resilience against attacks and the training of less vulnerable models.

-

In terms of explainable AI (XAI), the framework incorporates activation mapping to visually depict and analyze significant input regions driving model predictions. Additionally, a modified APSO algorithm fulfills a dual purpose by determining global feature importance and facilitating local interpretation. This systematic assessment of feature significance reveals underlying decision rules, enhancing transparency and comprehension of machine learning models.

-

By incorporating the Adversarial Observation framework, organizations can assess the resilience of their models, address biases, and make well-informed decisions. The framework plays a pivotal role in the software development cycle, ensuring the highest standards of transparency and reliability. It empowers a deeper understanding of models through visualization, feature analysis, and interpretation, thereby fostering trust and facilitating the responsible development and deployment of AI technologies.

-

In conclusion, the Adversarial Observation framework represents a crucial milestone in the development of trustworthy and dependable AI systems. Its integration of robustness, interpretability, and fairness into the software development cycle enhances transparency and reliability. By addressing vulnerabilities and biases, organizations can make well-informed decisions, improve fairness, and establish trust with stakeholders. The significance of the framework is further underscored by the pressing need for tools expressed by industry practitioners and the lack of consensus on interpretability in machine learning. Ultimately, the Adversarial Observation framework contributes to the responsible development and deployment of AI technologies, fostering public trust and promoting the adoption of AI systems in critical domains.

-
-
-
-
-
-
- -
-
-
-
-

Research software plays a crucial role in advancing research across many domains. However, the complexity of research software often makes it challenging for developers to conduct comprehensive testing, which leads to reduced confidence in the accuracy of the results produced. To address this concern, developers have employed peer code review, a well-established software engineering practice,to improve the reliability of software. However, peer code review is less prevalent in research software than in open-source or traditional software domains. This presentation addresses this topic by describing a previous investigation of peer code review in research software. Then it concludes with a description of our current work and ways for interested people to get involved.

-

In our previous study, we interviewed and surveyed 84 developers of research software.The results show research software teams do perform code reviews, albeit without a formalized process, proper organization, or adequate human resources dedicated to conducting reviews effectively. In the talk, we will describe the results in more detail. The application of peer code review holds promise for improving the quality of research software, thereby increasing the reliability of research outcomes. Additionally, adopting peer code review practices enables research software developers to produce code that is more readable, understandable, and maintainable.

-

This talk will then briefly outline our current work to engage interested participants. Our current work focuses on peer code review processes as performed specifically by Research Software Engineers (RSEs). The research questions we aim to address in this ongoing study are: RQ1: What processes do RSEs follow when conducting peer code review?; RQ2: What challenges do RSEs encounter during peer code review?; and RQ3: What improvements are required to enhance the peer code review process for RSEs?

-

To answer these questions, we plan to conduct the following phases of the project:Phase 1: Surveying RSEs to Gain Insights into Peer Code Review Practices; Phase 2: Conducting Interviews and Focus Groups with RSEs to Explore Peer Code Review Experiences; Phase 3: Observational Study of RSE Peer Code Review

-

There are numerous places for members of the US-RSE community to get involved in our research. We will highlight these opportunities in the talk.

-
-
-
-
-
-
- -
-
- -
-
-

Princeton University’s central Research Software Engineering (RSE) Group is a central team of research software engineers who work directly with campus research groups to create the most efficient, scalable, and sustainable research codes possible in order to enable new scientific and scholarly advances. As the Group has grown, it has evolved in numerous ways, with new partners across academic units, new partnership models and operational procedures, and a reshuffled internal organization. In the summer of 2023, the RSE Group further evolved by incorporating, for the first time, two formal programs for RSE interns and fellows.

-

We present an experience report for the inaugural RSE summer internship and fellowship programs at Princeton University. These two programs, separate but concurrently held during the summer of 2023, represented our first formal attempt to introduce currently enrolled students to the RSE discipline in a structured setting with assigned mentors and well-defined projects. The projects varied widely in nature, spanning academic units in mathematics, social sciences, machine learning, molecular biology, and high energy physics. The interns and fellows were exposed to a diverse range of RSE programming languages, software packages, and software engineering best practices.

-

The two programs, with eight total student participants, further spanned in-person and remote work, undergraduate and graduate students, and multiple continents. We report on the experience of the interns, fellows, and mentors, including lessons learned and recommendations for improving future programs.

-
-
-
-
-
-
- -
-
-
-
-

In codes used to simulate multi-physics hydrodynamics, it is common for variables to reside on different parts of a mesh, or on different, but related, meshes. For example, in some codes all variables may reside on cell centers, while in others, scalars may reside on cell centers, vectors on cell faces and tensors on cell corners, etc. Further, different methods may be used for the calculation of derivatives or divergences of different variables, or for the interpolation or mapping of variables from one mesh to another. This poses a challenge for libraries of 3D physics models, where the physical processes have dependencies on space, and therefore, mesh dependency. For such libraries to be able to support a general set of multi-physics hydrodynamics host codes, they must be able to represent the physics in a way that is independent of mesh-related details. To solve this problem, we present a Multi-Mesh Operations (MMOPS) library for the mesh-agnostic representation of 3D physics.

-

MMOPS is a light-weight C++ abstraction providing an interface for the development of general purpose calculations between variables of different user-defined types, while deferring the specification of the details of these types to be provided by the host code. As an example, consider three variables, a, b, c representing a vector and two scalar quantities residing on the cell corners, cell centers and cell corners of a mesh, respectively. MMOPS provides a `class mapped_variable` to represent these variables, for which the host code provides an arbitrary compile-time tag indicating where the variable data resides on the mesh, and a mapping function method used to indicate how to map the variable from one part of the mesh to another, using tag dispatching under the hood. This way, we can perform operations using `a`, `b`, `c`, namely `mapped_variable` instantiations representing a, b, c, such as `c(i) = a(c.tag(), i, dir) + b(c.tag(), i)` where, `i` is an index representing the ith cell on `c`’s mesh, and `dir` is an index representing the desired directional component of vector `a`. In general, if either `a` or `b` have a different mesh representation than `c`, then they get mapped to `c`’s mesh using the mapping functions provided by the host code when constructing `a` and `b`, which can be different. In the above example, since `a` is on the same mesh as `c`, it doesn’t get mapped but instead is directly accessed, and since `b` lives on a different mesh than `c`, it gets mapped to `c`’s mesh, i.e. to cell corners.

-

We demonstrate that MMOPS provides a zero to nearly zero-cost abstraction of the de- scribed functionality, in the sense that it incurs in little to no performance penalty (depending on compiler) compared to a manual benchmark implementation of the same functionality. A description of the library, usage examples, and benchmarking tests will also be presented.

-
-
-
-
-
-
- -
-
-
-
-

Background: The NSAPH (National Studies on Air Pollution and Health) lab focuses on analyzing the impact of air pollution on public health. Typically, studies in Environmental Health require merging diverse datasets coming from multiple domains such as health, exposures, and population demographics. Exposure data is obtained from various sources and presented in different formats, posing challenges in data management and integration. The manual process of extracting and aggregating data is repetitive, inefficient, and prone to errors. Harmonizing formats and standards is difficult due to the mix of public and proprietary data sources. Reproducibility is crucial in scientific work, but the heavy computations involved in processing exposure data are sensitive to the computational environment, making different versions of components potentially affect the results. Additionally, while some exposure data are public datasets that are shareable for reproducible research, there are some exceptions of proprietary data, such as ESRI shapefile, that cannot be publicly shared, further complicating reproducibility efforts in NSAPH.

-

Aim: Our main objective is to leverage our expertise in bioinformatics to create a robust data platform tailored for aggregating exposure and spatial data. We are building a data platform that is focused on exposure data, such as pollution, temperature, humidity, and smoke. This platform incorporates a deployment system, package management capabilities, and a configuration toolkit to ensure compatibility and generalizability across various spatial aggregations, such as zip code, ZCTA (zip code tabulation areas), and counties. Through the development of this data platform, our aim is to streamline exposure data processing, enable efficient transformation of exposure data, and facilitate reproducibility in working with exposure data.

-

Methods: The methodology employed in this study utilizes the CWL (Common Workflow Language) programming language for the data pipeline. Docker is utilized for deployment purposes, while PostgreSQL serves as the data warehouse. Apache Superset is employed for data exploration and visualization. The study incorporates various types of data transformations, including isomorphic transformations for reversible conversions and union transformations for combining data from different sources. Additionally, rollups are performed to extract specific data elements, and approximations are used for imprecise conversions. In the case of tabular data, simple aggregations are conducted using SQL functions. For spatial operations, the methodology includes adjustable rasterization methods such as downscaling and resampling (nearest, majority). Furthermore, a data loader is developed to handle diverse spatial file types, including NetCDF, Parquet, CSV, and FST.

-

Results: We developed DORIEH (data for observational research in environmental health), a comprehensive system designed for the transformation and storage of exposure data. Within this system, we implemented various utilities, notably the gridMET (Grid-based Meteorological Ensemble Tool) data pipeline, which encompasses multiple crucial steps. These steps include retrieving shapefiles from an API, importing exposure spatial data, aggregating exposure data from grids into selected spatial aggregations (e.g ZCTA, county, zip codes) utilizing adaptable rasterization methods, and ultimately storing the transformed data into our dedicated database. Furthermore, we constructed a versatile data loader capable of handling diverse file types, while also incorporating parallelization techniques to enhance processing efficiency. Additionally, DORIEH provides basic visualization capabilities that serve as quality checks for the data.

-

Conclusion: Our study addresses the challenges faced by the NSAPH group in analyzing the impact of environmental exposures on health. By developing the DORIEH data platform and implementing various utilities, including the flexible and configurable spatial aggregation data pipeline and a flexible data loader, we have made significant progress in overcoming the limitations of manual data extraction and aggregation. The platform enables streamlined exposure data processing, efficient transformation, and storage, while ensuring compatibility and reproducibility in working with exposure and spatial data.

-
-
-
-
-
-
- -
-
-
-
-

The High Throughput Discovery Lab at the Rosalind Franklin Institute aims to iteratively expand the reaction toolkit used in drug discovery to enable new regions of biologically relevant chemical space to be explored. The current reaction toolkit, that underpins traditional drug discovery workflows, has to date been dominated by a small number (<10) of reaction classes that have remained largely unchanged for 30 years. It has been argued that this has contributed to attrition and stagnation in drug discovery. We are working to create a semi-automated approach to explore large regions of chemical space to identify novel bioactive molecules. The approach involves creating arrays of hundreds of reactions, in which different pairs of substrates are combined. The design of subsequent reaction arrays is informed by the biological activity of the products that are obtained. However, the multi-step laboratory process to create, purify and test the reaction products introduces high requirements for data linkage, making data management incredibly important.

-

This talk focuses on how we built the data infrastructure using a mix of open-source and licensed technology. We will discuss how the Franklin aims to increase automation in data processing and how the technology we have implemented will make this possible.

-
-
-
-
-
-
- -
-
-
-
-

The National Energy Research Scientific Computing Center, NERSC, is a U.S. Department of Energy high-performance computing facility, used by over 9000 scientists and researchers for novel scientific research. NERSC staff support and engage with users to improve the use of this resource. We have now begun working towards creating a community of practice consisting of NERSC users, staff, and affiliates to pool resources, knowledge and build networks across scientific fields. Our highly interdisciplinary users have expertise in research computing in their respective science fields, and as such, access to a collective knowledge in the form of a community allows resource sharing, peer-mentorship, peer teaching and collaboration opportunities. Thus, we believe the benefits of such a community could lead to improved scientific output due to better peer-support for technical and research related issues.

-

In order to prepare a community creation strategy, gain insight into user needs, and understand how a community of practice could support those needs, NERSC staff conducted focus groups with users in the Spring of 2023. The findings from these focus groups provided significant insight into the challenges users face in interacting with one another and even with NERSC staff, and will inform our next steps and ongoing strategy. This presentation will outline the current state of the NERSC user community, the methodology for running user focus groups, qualitative and quantitative findings, and plans for building the NERSC user community of practice based on these findings.

-
-
-
-
-
-
- -
-
-
-
-

The emergence of the Research Software Engineer (RSE) as a role correlates with the growing complexity of scientific challenges and the diversity of software team skills. At the same time, it is still a challenge for research funding agencies and institutions to directly fund activities that are explicitly engineering focused.

-

In this presentation, we describe research software science (RSS), an idea related to RSE, that is particularly suited to research software teams. RSS focuses on using the scientific method to understand and improve how software is developed and used in research. RSS promotes the use of scientific methodologies to explore and establish broadly applicable knowledge. Specifically, RSS incorporates scientific approaches from cognitive and social sciences in addition to existing scientific knowledge already present in software teams. By leverage cognitive and social science methodologies and tools, research software teams can gain better insight into how software is developed and used for research and share that insight by virtue of the scientific approaches used to gain that insight.

-

Using RSS, we can pursue sustainable, repeatable, and reproducible software improvements that positively impact research software toward improved scientific discovery. Also, by introducing an explicit scientific focus to the enterprise of software development and use, we can more easily justify direct support and funding from agencies and institutions whose charter is to sponsor scientific research. Direct funding of RSS activities is within these charters and RSE activities are needed, more easily justified, and improved by RSS investments.

-
-
-
-
-
-
- -
-
-
-
-

Assembly and analysis of metagenomics datasets along with protein sequence analysis are among the most computationally demanding tasks in bioinformatics. ExaBiome project is developing GPU accelerated solutions for exascale era machines to tackle these problems at unprecedented scale. The algorithms involved in these software pipelines do not fit the typical portfolio of algorithms that are amenable to GPU porting, instead, these are irregular and sparse in nature which makes GPU porting a significant challenge. Moreover it is a challenge to integrate complex GPU kernels within a CPU optimized software infrastructure that depends largely on dynamic data structures. This talk will give an overview of development of sequence alignment and local assembly GPU kernels that have been successfully ported and optimized for GPU based systems and the integration of these kernels within Exabiome software stack for demonstrating unprecedented capability of solving scientific problems in bioinformatics.

-
-
-
-
-
-
- -
-
-
-
-

LINCC (LSST Interdisciplinary Network for Collaboration and Computing) is an ambitious effort to support the astronomy community by developing cloud-based analysis frameworks for science expected from the new Large Survey of Space and Time. The goal is to enable the delivery of critical computational infrastructure and code for petabyte-scale analyses, mechanisms to search for one-in-a-million events in continuous streams of data, and community organizations and communication channels that enable researchers to develop and share their algorithms and software.

-

We are particularly interested in supporting early science and commissioning efforts. The team develops software packages in different areas of interest to LSST; such as RAIL, a package for estimating redshift from photometry, and KMBOD, a package for detecting slowly moving asteroids. I will concentrate on our effort with LSDB and TAPE packages that focus on cross-matching and time-domain analysis. We are currently developing capabilities to: i) efficiently retrieve and cross-match large catalogs; ii) facilitate easy color-correction and recalibration on time-domain data from different surveys to enable analysis on long lightcurves; iii) provide custom functions that can be efficiently executed on large amounts of data (such as structure function and periodogram calculations); iv) enable large-scale calculation with custom, user-provided functions, e.g., continuous auto-regressive moving average models implemented in JAX.

-

LINCC is already supporting the community through an incubator program and a series of technical talks. I will present these efforts and show our current status, results, and code; and discuss the lesson learned from collaboration between scientists and software engineers!

-
-
-
-
-
-
- -
-
-
-
-

This presentation describes training activities carried out under the auspices of the U.S. Department of Energy’s; in particular, those facilitated by the Exascale Computing Project (ECP). While some of these activities are specific to members of ECP, others are beneficial to the HPC community at large. We report on training opportunities and resources that the broad Computer Science and Engineering (CS&E) community can have access to, the relevance of coordinated training efforts, and how we envision such efforts beyond ECP’s scope and duration.

-
-
-
-
-
-
- -
-
-
-
-

Scientific software plays a crucial and ever-growing role in various fields by facilitating complex modeling, simulation, exploration, and data analysis. However, ensuring the correctness and reliability of these software systems presents significant challenges due to their computational complexity, their explorative nature, and the lack of explicit specifications or even documentations. Traditional testing methods fall short in validating scientific software comprehensively – in particular for explorative software and simulation tools suffer from the Oracle Problem. In fact, Segura et al. show that scientific and explorative software systems are inherently difficult to test. In this context, metamorphic testing is a promising approach that addresses these challenges effectively. By exploiting the inherent properties within scientific problems, metamorphic testing provides a systematic means to validate the accuracy and robustness of scientific software while avoiding the challenges posed by the Oracle Problem. The proposed talk will highlight the importance of metamorphic testing in scientific software, emphasizing its ability to uncover subtle bugs, enhance result consistency, and show approaches for a more rigorous and systematic software development process in the scientific domain.

-
-
-
-
-
-
- -
-
-
-
-

Introduction: We are pleased to submit a proposal to the US-RSE to include Anotemos, a media annotation software developed by the GRIP (Grasping the Rationality of Instructional Practice), an education research lab at the University of Michigan. Anotemos is designed to enhance research analysis by providing an efficient media annotation solution. This proposal highlights the key features, and benefits of Anotemos, showcasing its relevance to the research community.

-

Background and Objectives: Anotemos addresses the challenges faced by researchers in analyzing multimedia data, enabling them to extract valuable insights efficiently. Anotemos aims to streamline and automate the annotation workflow, offering researchers a user-friendly interface and advanced features for seamless annotation.

-

Key Features and Benefits: Using Anotemos, researchers can create Commentaries that are centered around a Multimedia Item. Anotemos offers a range of features that set it apart: (i) Comprehensive Annotation: Researchers can annotate various media types, including images, videos, and audio, with the support of diverse customizable annotation types such as text, icons, drawings, bounding boxes, and audio recordings both on-screen and off-screen; (ii) Real-time Collaboration: Anotemos enables multiple researchers to collaborate in real-time simultaneously, fostering knowledge exchange, reducing redundancy, and improving productivity; (iii) Share & Publish: Anotemos offers both private and public sharing, enabling the secure sharing of Commentaries with select individuals or the wider research community; (iv) Customizable Workflows: Anotemos supports the creation of customized workflows by making it easier to create identical Commentaries, and manage different sets of collaborators using Commentary sections, enabling researchers to tailor the platform to their projects; (v) Analysis & Reports: Using Anotemos, users can perform an in-depth analysis of annotated data and generate comprehensive reports, providing valuable insights and facilitating data-driven decision-making in research projects; (vi) Integrations: Anotemos offers seamless integration with Learning Tools Interoperability (LTI), allowing users to easily embed and access Anotemos Commentaries within LTI-compliant learning management systems. Furthermore, Anotemos supports LaTeX code, empowering users to annotate and display mathematical equations and formulas with precision. Additionally, Anotemos can be embedded into Qualtrics Surveys enabling the researchers to collect and analyze the survey data along with Anotemos annotations.

-

Conclusion: We believe that Anotemos has the potential to significantly enhance research analysis by providing an efficient, user-friendly, and customizable media annotation solution. It is developed using the Angular-Meteor framework, leveraging its robustness, scalability, and real-time capabilities. The software is currently in beta testing, with positive feedback from researchers in diverse fields. Its advanced features make it an ideal tool for researchers across various domains. We request the consideration of Anotemos for presentation at the US-RSE Conference, where researcher software engineers can gain insights into its features, benefits, and potential impact on research projects.

-
-
-
-
-
-
- -
-
-
-
-

Managing massive volumes of data and effectively making it accessible to researchers poses significant challenges and is a barrier to scientific discovery. In many cases, critical data is locked up in unwieldy file formats or one-off databases and is too large to effectively process on a single machine. This talk explores the role of Kubernetes, an open-source container orchestration platform, in addressing research data management challenges. I will discuss how we are using a set of publicly available open-source and home-grown tools in the National Renewable Energy Lab (NREL) Data, Analysis, and Visualization (DAV) group to help researchers overcome data-related bottlenecks.

-

The talk will begin by providing an overview of the data challenges faced in research data management, including data storage, processing, and analysis. I will highlight Kubernetes' ability to handle large-scale data by leveraging containerization and distributed computing, including distributed storage. Kubernetes allows researchers to encapsulate data processing infrastructure and workflows into portable containers, enabling reproducibility and ease of deployment. Kubernetes can then schedule and manage the resource allocation of these containers to enable efficient utilization of limited computing resources, leading to more efficient data processing and analysis.

-

I will discuss some limitations of traditional, siloed approaches to dealing with data and emphasize the need for solutions which foster collaboration. I will highlight how we are using Kubernetes at NREL to facilitate data sharing and cooperation among research teams. Kubernetes' flexible architecture enables the deployment of shared computing environments, such as Apache Superset, where researchers can seamlessly access and analyze shared datasets. Providing the ability to have one research team easily consume data generated by another, utilizing Kubernetes' as a central data platform, is one of the major wins we’ve encountered by adopting the platform.

-

Finally, I will showcase real-world use cases from NREL where we have used Kubernetes to solve some persistent data challenges involving large volumes of sensor and monitoring data. I will discuss the challenges we encountered when creating our cluster and making it available as a production-ready resource. I will also discuss the specific suite of tools, including Postgres and Apache Druid for columnar and timeseries data, and Redpanda Kafka for streaming data we have deployed in our infrastructure, and the process that went into the selection of these tools.

-

Attendees of this talk will gain insights into how Kubernetes can address data challenges in research data management. This talk aims to provide researchers with a framework and a set of building blocks which have worked well for us, in order for them to unlock the full potential of their data in the era of software-enabled discovery.

-
-
-
-
-
-
- -
-
-
-
-

The Globus platform enables research applications developed by research teams to leverage data and compute services across many tiers of service—from personal computers and local storage to national supercomputing centers—with minimal deployment and maintenance burden. Globus is operated by the University of Chicago and is used by nearly all R1 universities, national labs, and supercomputing centers in the United States, as well as many smaller institutions.

-

In this talk, we’ll introduce the Globus Platform-as-a-Service, including how to register an application and how to access Globus APIs using our Python SDK. We will present examples of how the various Globus services, interfaces, and tools may be used to develop research applications. We will demonstrate authentication and access control with Globus’s Auth and Groups APIs; making data findable and accessible using Globus guest collections, data transfer API, and indexed Search API; and automating research with Globus Flows and Compute APIs.

-
-
-
-
-
-
- -
-
-
-
-

The expected long-term outcome of our research is to determine the ways in which a focus on scientific model code quality can improve both scientific reliability and model performance. In this work, we focus on climate models which are complex software implementations of the scientific and mathematical abstractions of systems that govern the climate.

-

Climate models integrate the physical understanding of climate and produce simulations of climate patterns. The model code is often hundreds of thousands of lines of highly sophisticated software, written in high-performance languages such as Fortran. These models are developed with teams of experts, including climate scientists and high-performance computing developers. The sheer complexity of the software behind the climate models (which have evolved over many years) makes them vulnerable to performance, usability, and maintainability bugs that hinder performance and weaken scientific confidence. Understanding how the social structures behind the software and model function have been examined but the complex interactions of domain experts and software experts are poorly understood.

-

The expected short-term outcomes of our research are a) develop a set of software quality indicators for the climate models, providing model maintainers and users with specific areas to improve the model code; b) develop new techniques for analyzing large bodies of unstructured text to explain how users perceive a research software project’s capabilities and failings aligned with international efforts to improve the quality of scientific software.

-

We follow two main approaches. (1) \textit{Analytics}: analysis of climate models and their artifacts. These artifacts include the software code itself, including the way the model is deployed and run with end users; the bug reports and other feedback on the software quality; and the simulation testing used to validate the model outputs. Our analysis will be incorporated into a Fortran analysis tool to identify Fortran quality issues automatically. This analysis is incomplete however without a clear understanding of the social context in which it was produced. (2) \textit{Social Context}: we examine the socio-technical artifacts created around climate models. These include a) outputs from workshops, interviews, and surveys with stakeholders, including climate scientists and model software developers; b) explicit issue and bug reports recorded on the model, such as the fact the model is failing/crashing at a particular place; c) implicitly discussed software problems and feature requests from related discussion forums. We hypothesize that both approaches will help to identify technical debts in the climate models.

-

Manually eliciting software requirements from these diverse sources of feedback can be time-consuming and expensive. Hence, we will use state-of-the-art Natural Language Processing (NLP) approaches that require less human effort. The web artifacts, interviews, and scientific texts of climate models provide reflections on the Climate Community of Practice (CoP). We will use an unsupervised topic model, Latent Dirichlet Allocation (LDA) to study how different narratives (such as specific modules of climate models, research interests, and needs of the community members) of a climate CoP have evolved over a period.

-

LDA assumes that documents are mixtures of topics, and a topic is a probability distribution over words. The gist of a topic can be inferred from its most probable words. We focus on the discussion forum of the Community Earth System Model (CESM)\footnote{\url{https://www.cesm.ucar.edu/}}. CESM is a collaboratively developed fully coupled global climate model focused on computer simulations of Earth's climate states.

-

We infer 15 topics on around 7000 posts from the year 2003 to 2023 on the discussion forum using MALLET topic modeling library\footnote{\url{https://mimno.github.io/Mallet/}}. We ask a domain expert to assign a label to a topic and use the labels to analyze the posts. We plot the proportions of words in a document assigned to a topic over a period. We observe certain trends over around 20 years. The topic of \textit{version management of CESM models} gained attention from 2018 on-wards till 2021. There is a steady increase in discussion of the topic of \textit{Coupled Model Intercomparison Project (CMIP) and Shared Socio-economic Pathways (SSPs)}\footnote{\url{https://en.wikipedia.org/wiki/Shared_Socioeconomic_Pathways}}. Discussions about issues related to \textit{installing and setting up CESM models} and \textit{parallel computing} have declined since 2016 and 2020 respectively. Topics of \textit{source code related changes}, \textit{ocean modeling}, and \textit{errors encountered while running CESM models} are discussed throughout the period. We also observe that questions related to \textit{parallel computing} received fewer responses, while questions related to \textit{compiling CESM models} received higher responses. These trends and insights can be used by the software engineering group of CESM to prioritize its actions to facilitate the use of the models.

-

We believe that this qualitative study scaffolded using topic models will complement climate science research by facilitating the generation of software requirements, onboarding of new users of the models, responding to new problems using solutions to similar problems, and preventing reinvention.

-
-
-
-
-
-
- -
-
-
-
-

Academic research collaborations involving members from multiple independent, often international institutions are inherently decentralized, and they encounter the same general challenges in sharing data and collaborating online as does the internet at large when there are no viable or desirable centralized solutions. At NCSA we have developed a free and open source full-stack cyberinfrastructure (CI) for research collaborations based on OpenStack and Kubernetes that is reproducible, flexible, portable, and sustainable. We are embracing the lessons and technology of the (re)decentralized web to curate a suite of open source tools and web services that prioritize data ownership to give researchers as much control as possible over their data, communications, and access controls. I will present the architecture of our framework as well as example use cases from existing projects ranging from astronomy to nuclear physics, showcasing the flexible deployment system and some collaborative tools and services for identity and access management, messaging, data sharing, code development, documentation, high-performance computing and more. By the end of the talk I hope you will have some appreciation for the value decentralized tech brings to the research enterprise and the potential for innovation that lies in the creative integration of existing federated and peer-to-peer applications and protocols.

-
-
-
-
-
-
- -
-
-
-
-

Airtable is an increasingly popular format for entering and storing research data, especially in the digital humanities. It combines the simplicity of spreadsheet formats like CSV with a relational database’s ability to model relationships; enterers or viewers of the data do not need to have programming knowledge. The Center for Digital Research in the Humanities at the University of Nebraska uses Airtable data for two projects on which I work as a developer. African Poetics has data focusing on African poets and newspaper coverage of them, and Petitioning for Freedom has data on habeas corpus petitions and involved parties. At the CDRH, our software can take data in many formats, including CSV, and ingest it for an API based on Elasticsearch. This data is then available for search and discovery through web applications built on Ruby on Rails. The first step in ingesting the Airtable data into our system is to download it. I will cover the command line tools that can do this, the formats that can be downloaded (JSON turned out to be the most convenient), and the requirements for authentication. Python (aided by Pandas) can transform this data into the CSV format that our back end software. I will discuss how to rename and delete columns, change data back into JSON (which is sometimes more parsable by our scripts), and clean troublesome values like blanks and NaNs. One advancement of Airtable over CSV is join tables that have similar functionality to SQL databases. Incorporating them into other projects has particular challenges. When downloaded directly from Airtable, the join data is in a format that cannot be interpreted by humans or programs other than Airtable. But it can be converted (with the help of some processing within airtable) into formats that can be parsed by external scripts so that it can be human-readable. With these transformations, our software can use the table to populate API fields and parse it into arrays and hashes to replicate relationships within the data. Finally I will discuss the advantages and disadvantages of Airtable for managing data, from the perspective of a developer who uses the data on the front and back end of web applications.

-
-
-
-
diff --git a/pages/program/program.md b/pages/program/program.md index 663c490..acc2c3d 100644 --- a/pages/program/program.md +++ b/pages/program/program.md @@ -12,6 +12,7 @@ set_last_modified: true - [Birds of a Feather]({{ site.baseurl }}/program/bofs/) - [Notebooks]({{ site.baseurl }}/program/notebooks/) - [Papers]({{ site.baseurl }}/program/papers/) +- [Talks]({{ site.baseurl }}/program/talks/) - [Tutorials]({{ site.baseurl }}/program/tutorials/) - [Workshops]({{ site.baseurl }}/program/workshops/) From ead6dad2198063861e6ada647bcb3cbcd920b393 Mon Sep 17 00:00:00 2001 From: "J.C. Subida" Date: Fri, 2 Aug 2024 12:50:46 -0500 Subject: [PATCH 06/10] Add program posters --- _data/menus/program.yml | 2 + _data/navigation.yml | 2 + pages/program/abstracts/posters.md | 1605 +++++++++++++++++++++++++++ pages/program/abstracts/posters.tbd | 324 ------ 4 files changed, 1609 insertions(+), 324 deletions(-) create mode 100644 pages/program/abstracts/posters.md delete mode 100644 pages/program/abstracts/posters.tbd diff --git a/_data/menus/program.yml b/_data/menus/program.yml index 2b97f67..a3b84f2 100644 --- a/_data/menus/program.yml +++ b/_data/menus/program.yml @@ -8,6 +8,8 @@ link: program/notebooks/ - name: Papers link: program/papers/ + - name: Posters + link: program/posters/ - name: Talks link: program/talks/ - name: Tutorials diff --git a/_data/navigation.yml b/_data/navigation.yml index 51010e2..bf8ea70 100644 --- a/_data/navigation.yml +++ b/_data/navigation.yml @@ -13,6 +13,8 @@ link: program/notebooks/ - name: Papers link: program/papers/ + - name: Posters + link: program/posters/ - name: Talks link: program/talks/ - name: Tutorials diff --git a/pages/program/abstracts/posters.md b/pages/program/abstracts/posters.md new file mode 100644 index 0000000..de3d5f5 --- /dev/null +++ b/pages/program/abstracts/posters.md @@ -0,0 +1,1605 @@ +--- +layout: page +title: Posters +description: +menubar: program +permalink: program/posters/ +menubar_toc: true +set_last_modified: true +--- + +
+
+
+
+ +
+
+ +
+
+

NCSA is piloting a new role, Research Solutions Architect. Common in industry settings, Solutions + Architecture is a novel concept in research computing. We start with a portfolio of standardized + hosted + services that address common research computing and data problems. We then apply Human Centered + Design + techniques to understand the problems facing researchers and assemble complete solutions based on + products from the portfolio. The result is a low-cost process for meeting the needs of many + researchers + across campus. +

+

+ In this poster we will show the process we go through to design these solutions and explain some of + the + products in NCSA’s current portfolio.

+
+
+
+ +
+
+
+ +
+
+ +
+
+

Social Media Intelligence and Learning Environment (SMILE), a key application of the Social Media + Macroscope (SMM) project, is an open-source platform tailored for social media research. SMILE + addresses + the limitations of existing social media analytics tools, such as high costs and restricted access + to + data acquisition and analysis methodologies. It offers a comprehensive suite of tools via an + easy-to-use + web interface, facilitating free academic research. By maintaining open-source code, SMILE ensures + transparency and reproducibility of the research results. Additionally, it uses the user's + credentials + for data collection, adhering to social media platform protocols and guidelines to ensure proper + data + collection. +

+

+ One notable feature of SMILE is its sentiment analysis capability, which includes the VADER (Valence + Aware Dictionary and sEntiment Reasoner) algorithm, SentiWordNet algorithm, and a machine + learning-trained sentiment model with debiased word embeddings. This makes SMILE an invaluable tool + for + researchers aiming to understand and interpret the emotional tone of social media content. Other + analytics offered by SMILE include natural language processing, network analysis, and machine + learning + classification, name entity recognition, and topic modeling, each incorporating multiple algorithms + with + appropriate academic citations. +

+

+ SMILE's microservice architecture ensures portability and scalability. CIlogon provides secure and + lightweight identity management. The backend, powered by Node.js/Express and GraphQL, handles user + requests and social media data collection. Analytics modules run in separate containers, with + RabbitMQ + managing communication between the backend and analytics modules. MinIO is employed to provide + secure + data storage and an integration with NCSA Clowder facilitates community data sharing. +

+

+ SMILE leverages Docker containers managed by Kubernetes and Helm charts for continuous integration + and + deployment. All images and deployment charts are accessible on GitHub and Docker Hub. Automation + scripts, leveraging GitHub Actions, are in place to simplify build, publication, and deployment + tasks, + making migration, upgrade, and deployment processes efficient and user-friendly.

+
+
+
+ +
+
+
+ +
+
+ +
+
+

This poster describes the research software engineering and user experience approaches, as well as + lessons learned from initial deployments, for a new open-source web UI tool, OGRRE, which enables + rapid + human review of Google Doc AI-digitized oil and gas regulatory documents. +

+

+ The US has hundreds of thousands of orphaned oil and gas wells that no longer have a responsible + owner. + These wells pose an environmental and human health risk, as they can leak methane and other + pollutants + into the air and groundwater if they become compromised. In 2021, the Bipartisan Infrastructure Law + was + passed which allocated $4.7 billion for orphaned well plugging in the US. One major barrier to the + Herculean task of plugging these wells is providing the stakeholders doing this work with accurate + information about their location, construction, and status. This information is often documented in + regulatory paper records created during the well permitting process. While many regulatory agencies + have + scanned their paper records, they often lack the resources necessary to digitize their content. + Extracting this information and organizing it in computer databases will help regulatory agencies as + they seek to better characterize the orphaned wells under their jurisdiction and prioritize their + plugging. +

+

+ Modern AI/ML optical character recognition (OCR) models can aid the efficient digitization of + historic + well records. However, these documents often date back over 100 years and contain handwritten fields + and + many other anomalies. This results in a significant challenge when training AI/ML algorithms to + accurately identify fields, and results are often imperfect. Human review of the documents is needed + to + find and correct problems. With hundreds of thousands of documents to process and limited resources, + an + efficient melding of the human-in-the-loop tasks with the AI/ML models becomes essential. +

+

+ This poster will describe the Oil and Gas Regulatory Record digitizEr (OGRRE) — a custom user + interface + we developed to facilitate rapid human review of digitized oil and gas regulatory documents. OGRRE + uses + retrained AI/ML models developed by Google Document AI. The tool is designed to enable the + interactive + review and correction of AI/ML extracted data. The poster will describe the combination of Google + Doc AI + modeling, software engineering, and user experience approaches used to create the OGGRE tool and + software infrastructure, as well as lessons learned from our pilot deployment of the tool. +

+

+ OGRRE development is supported by the United States Department of Energy’s Undocumented Orphaned + Well + Program under the CATALOG project.

+
+
+
+ +
+
+
+ +
+
+ +
+
+

Building on discussions first started at the German RSE conference in 2023 (de-RSE23), a recent + pre-print, Foundational Competencies and Responsibilities of a Research Software Engineer, + identifies a + set of core competencies required for RSEng and describes possible pathways of development and + specialisation for RSEs. It is the first output of a group with broad interests in the teaching and + learning of RSEng skills. +

+

+ With continuing growth in RSE communities around the world, and sustained global demand for RSEng + skills, US-RSE24 presents an opportunity to align international efforts towards +

+
    +
  • training the next generation of RSEs
  • +
  • providing high-quality professional development opportunities to those already following the + career + path
  • +
  • empowering RSE Leaders to further advocate for the Research Software Engineering needs and + expertise + in their teams, institutions, and communities.
  • +
+

+ Therefore, we want to give an overview of what the group has been working on so far, discuss the + aims of + our future work, and invite members of the international RSE community to contribute and provide + feedback. We particularly encourage members of regional groups focused on RSEng training and skills + to + attend and share their perspectives. +

+
+
+
+ +
+
+
+ +
+
+ +
+
+

Molecular software is beginning to use more GPU-accelerated compute and traditional CI solutions can + often be expensive when running longer GPU tests. In order to understand the existing solutions, we + reviewed 6 potential self-hosted GitHub Actions Runner candidates from a curated list. This poster + explains how the Open Molecular Software Foundation (OMSF) seeks to provide a new solution[8] for + cloud-based GitHub Actions Runners to make scaling of GPU compute more effective for molecular + software. + We highlight the long-term community decisions discussed, such as language selection, ability to + receive + feedback, cloud-agnostic architecture, and ease of maintainability that lead to developing a new + solution rather than updating existing tooling. In addition, we discuss common pitfalls in working + with + cloud-based infrastructure and how we implemented a testing strategy to mitigate the potential costs + of + untested infrastructure code.

+
+
+
+ +
+
+
+ +
+
+ +
+
+

The Carpentries is a community building global capacity in essential data and computational skills + for + conducting efficient, open, and reproducible research. In addition to certified Instructors teaching + Data Carpentry, Library Carpentry, and Software Carpentry workshops around the world, the community + also + includes many people contributing to and maintaining Open Source lessons. +

+

+ Recent years have seen enormous growth in the number and diversity of lessons the community is + creating, + including many teaching skills and concepts essential to Research Software Engineering: software + packaging and publication, environment and workflow management, containerised computing, etc. As + this + curriculum community has developed, demand has been growing for training opportunities, teaching how + to + design and develop Open Source curriculum effectively and in collaboration with others. +

+

+ A new program launched in 2023, The Carpentries Collaborative Lesson Development Training teaches + good + practices in lesson design and development, and open source collaboration skills, using The + Carpentries + Workbench, an Open Source infrastructure for building accessible lesson websites. +

+

+ As the discipline of Research Software Engineering continues to develop and mature, there is an + increasing need for high-quality, Open Source and community-maintained training, and for the + expertise + to develop those resources. This poster will provide an overview of the training, explain how it + meets + this need, and describe how it fits into The Carpentries ecosystem for lesson development. It will + also + explain how RSEs can enroll in the training, and give examples of lesson projects that have + benefited + from it already.

+
+
+
+ +
+
+
+ +
+
+ +
+
+

Achieving performance portability while ensuring code sustainability is often challenging, + particularly + with research software applications. In complex codes comprised of many distinct physics modules + coupled + together to form a larger system, such as climate and weather models, the challenges are compounded. + The + individual physics components are often developed independently by domain scientists from different + backgrounds and designing software for performance and maintainability is sometimes a secondary + consideration. Utilizing software frameworks can help reduce the complexity of these large coupled + application codes while also providing mechanisms to help obtain performance portability and + accessibility to domain scientists. AMReX is a software framework for multiphysics applications + targeting high performance computing and is designed with multiple levels of abstractions to ease + software design and development. In this work, we focus on extending the Energy Research and + Forecasting + (ERF) model, built on the AMReX framework, to couple with land surface models. We present a new C++ + implementation of the Simplified Land Model (SLM) in ERF based on the Fortran version in the System + for + Atmospheric Modeling (SAM) code. Performance and validation between the original and new SLM + implementation will be shown. Additionally, we will discuss the process of re-implementing SLM in + ERF + and share our experiences and lessons learned for the broader research software community.

+
+
+
+ +
+
+
+ +
+
+ +
+
+

When conducting scientific experiments, vast collections of data and subsequently scientific metadata + are + generated. Maintaining the set of FAIR (findability, accessibility, interoperability, and + reusability) + principles is often challenging given large datasets and various metadata standards. MARS (Metadata + Aggregator for Reproducible Science) aims to address this challenge as an open-source application by + providing a structured interface on top of database storage for the management and querying of + scientific metadata, particularly in the biological sciences. +

+

+ MARS introduces a nomenclature for a metadata abstraction that is intended to be customizable and + scalable across scientific contexts. A core architectural feature of MARS is the ability to + granularize + units of metadata into single “entities”. Entities may represent a physical or digital object, + ranging + from whole animals to brain slice images from that same animal. Raw data are not stored in MARS, + rather + external links to the relevant data storage platforms are stored within entities. Graph-like + connections + can be used to establish relationships between entities, and a “project” grouping system within MARS + allows entities to be grouped with each other to represent contexts such as experimental output, + physical storage contents, or data presented in a publication. +

+

+ To specify metadata associated with an entity, MARS uses key-value pairings known as “attributes”. + Attributes can be used to express metadata on a per-entity basis, or attribute templates can be + created + and reused across multiple entities. Attributes are named uniquely, and the values are restricted to + types including “date”, “string”, “number”, or a set of fixed options defined in a drop-down menu. +

+

+ MARS also provides several features useful for scientists, such as an ORCID ID authentication + requirement for all users, metadata import from CSV, Excel, and JSON files, multiple metadata export + options, and the ability to construct and execute queries across all stored metadata. MARS uses a + React + frontend written in TypeScript and utilizes components from the Chakra UI library. The Node.js + backend + couples Express.js with GraphQL to exchange data with a MongoDB database, and is currently hosted on + Microsoft Azure. +

+

+ MARS has been deployed successfully within a neuroscience laboratory, managing hundreds of unique + entities and 6 active users. Intuitive training for new users is incorporated within the software + for + each step required to manage entities and attributes across the platform. Potential future + deployment + targets for MARS include other laboratories and core facilities at Washington University in St. + Louis. +

+

+ GitHub: https://github.com/Brain-Development-and-Disorders-Lab/mars

+
+
+
+ +
+
+
+ +
+
+ +
+
+

Electroencephalography (EEG) preprocessing is the initial and critical step for accurate brain + function + and behavior analysis. There is a pressing demand for an automatic pipeline that can process huge + volumes of data, yet existing commercial software solutions are expensive, have no batch processing + capabilities and offer limited automation. While open-source libraries in Python and R can be used + to + implement a solution, it demands extensive programming expertise in researchers who very often lack + knowledge of programming and batch processing. This project introduces an accessible, scalable, EEG + processing pipeline framework tailored for high-performance computing (HPC). It features a graphical + user interface (GUI) with node editor based drag and drop pipeline creator with reusable modules, a + Python based pipeline runner to run the serialized interpretation of pipeline leveraging MNE + library, + and SLURM integration for batch processing. The framework provides modules for commonly used + preprocessing steps such as filtering, epoching, artifact rejection, baseline correction and + advanced + techniques like Independent Component Analysis (ICA) and power spectral density analysis. The + framework + has been supporting diverse EEG research applications including resting state, letter flanker and + emotional regulation analysis at BRAINS Lab with ACCRE computing cluster at Vanderbilt University, + proving its flexibility and adaptability for real world use cases. By significantly advancing the + accessibility and increasing automation of EEG preprocessing, this system holds promise for reducing + barriers and accelerating research in neuroscience and related fields.

+
+
+
+ +
+
+
+ +
+
+ +
+
+

EPMT (Experiment Performance Management Tool) is an open source tool for aggregating metadata/metrics + from batch jobs of interest at GFDL's post-processing and analysis cluster (PPAN). EPMT features + ultra-lightweight metadata annotation capabilities, minimally impacting batch job execution. EPMT + scrapes a wide array of metrics, including software metadata (high-level) and process level + data(low-level). Data scraping occurs inside of the application, and is recorded based on how the + application sees itself. The goal of EPMT is to be able to easily cut through noisy data and + identify + whether issues are software or hardware related. Sometimes the same job, run on the same file, can + have + a different result. An example explored in this poster is when we used EPMT to generate of + histograms of + all metrics, which lead to the identification of a spike in read bytes at a specific value, for + which we + were able to conduct root cause analysis. EPMT can help identify these kinds of issues, detect + anomalies, and help predict where future hardware changes will need to be made to keep a system + running. +

+
+
+
+ +
+
+
+ +
+
+ +
+
+

The Department of Energy Systems Biology Knowledgebase (KBase) is a community-driven research + platform + designed for discovery, knowledge creation, and sharing. The open-access platform integrates data + and + analysis tools into reproducible notebooks called Narratives. KBase provides access to DOE computing + infrastructure for running applications, enables collaborative research and publishing of findable, + accessible, interoperable, and reusable (FAIR) analyses. KBase’s 40,000 users come from a broad + range of + experience levels and research interests, thus characterizing the user base and targeting + engagements is + crucial to grow and support the scientific communities using the platform for research. In this + poster + we share our models for enabling users through training and documentation development, social + platforms + for community communication, and education materials development. +

+

+ The team works to support users to develop skills and knowledge of computational biology through + user + development and support activities in order to advance their research. User development activities + include hosting workshops and webinars to demonstrate the features and analytical capabilities of + the + platform. Workshops are designed to communicate scientific content knowledge, support institutional + research needs, and facilitate collaboration between research groups. Webinars are broadcast to a + global + audience and showcase powerful workflows, new tools and features, and speakers from community + developers + to subject matter experts. +

+

+ User support is provided through a help board that allows users to report bugs or ask questions to + be + addressed by project members. User help tickets are public so that questions and answers are + findable by + future users. Users may also submit new feature requests to inform platform development to better + serve + their needs. This board is supported by dedicated staff, dedicated subject matter experts, and + rotating + developers. +

+

+ Additionally, the KBase Educators Program supports professors and teachers by developing curriculum + for + computational biology. This program includes biology and data science instructors teaching students + at + community colleges, primarily undergraduate, and doctoral research institutions that represent + diverse + populations, including Minority-Serving and Emerging Research Institutions. The program uses + accessible + and reproducible modules to encourage students without programming skills or access to additional + computational resources. +

+

+ Through each of these approaches, KBase provides a platform that ensures research software reaches + the + end-user biologists and data scientists enabling them to fully utilize these sophisticated tools and + share knowledge in a FAIR manner across the scientific community.

+
+
+
+ +
+
+
+ +
+
+ +
+
+

There is a large variety of types of research software at different stages of evolution. Due to the + nature of research and its software, existing models from software engineering often do not cover + the + unique needs of RSE projects. This lack of clear models can confuse potential software users, + developers, funders, and other stakeholders who need to understand the state of a particular + software + project, such as when deciding to use it, contribute to it, or fund it. We present work performed by + a + group consisting of both software engineering researchers (SERs) and research software engineers + (RSEs), + who met at a Dagstuhl seminar, to collaborate on these ideas. +

+

+ Through our collaboration, we found many of our terminologies and definitions often vary, for + example + one person may consider a software project to be early-stage or in maintenance mode, whilst another + person might consider the same software to be inactive or failed. Because of this, we explored + concepts + such as software maturity, intended audience, and intended future use. In this poster, we will + present a + working categorization of research software types, as well as an abstract software lifecycle that + can be + applied and customized to suit a wide variety of research software types. Such a model can be used + to + make decisions and guide development standards that may vary by stage and by team. We also are + seeking + community input on improvements of these two artifacts for future iterations.

+
+
+
+ +
+
+
+ +
+
+ +
+
+

As part of the Oak the Leadership Computing Facility's (OLCF) motivation to meet modern scientific + demands, the OLCF is implementing a unified set of APIs that enable programmatic access to the + OLCF’s + computational and data resources. The OLCF Facility API supports classical HPC workloads involving + the + analysis and/or reduction of static datasets, and workloads falling into the two broad-class IRI + Science + Patterns outlined in the DOE's IRI ABA Report applicable to this project: +

+

+ Time-sensitive workloads requiring highly reliable, real time or near-real time performance for + active + decision making, experiment steering, and/or experiment calibration. + Data integration–intensive workloads requiring the combination and analysis of data from a mix of + experiments, sensors, and/or other computational runs, possibly in real time. +

+

+ Within the Facility API, we aim to develop a range of interfaces that can be consumed together in + code + to realize all common configurations of leadership-scale scientific workfloads. In this poster, we + demonstrate our work-in-progress experiences in building Compute, Status, Environment, and Stream + Management APIs across core OLCF resources, and discuss the architecture, security, and usability + challenges of each.

+
+
+
+ +
+
+
+ +
+
+ +
+
+

Most research projects today involve the development of some research software. Therefore, it has + become + more important than ever to make research software reusable to enhance transparency, prevent + duplicate + efforts, and ultimately increase the pace of discoveries. The Findable, Accessible, Interoperable, + and + Reusable (FAIR) Principles for Research Software (or FAIR4RS Principles) provide a general framework + for + achieving that. Just like the original FAIR Principles, the FAIR4RS Principles are as designed to be + aspirational and do not provide actionable instructions. To make their implementation easy, we + established the FAIR Biomedical Research Software (FAIR-BioRS) guidelines, which are minimal, + actionable, and step-by-step guidelines that biomedical researchers can follow to make their + research + software compliant with the FAIR4RS Principles. While they are designed to be easy to follow, we + learned + that the FAIR-BioRS guidelines can still be time-consuming to implement, especially for researchers + without formal software development training. They are also prone to user errors as they require + several + actions with each new version release of a software. +

+

+ To address this challenge, we are developing codefair, a free and open-source GitHub app that acts + as a + personal assistant for making research software FAIR in line with the FAIR-BioRS guidelines. The + objective of codefair is to minimize developers’ time and effort in making their software FAIR so + they + can focus on the primary goals of their software. To use codefair, developers only need to install + it + from the GitHub marketplace. By leveraging tools such as Probot and GitHub API, codefair monitors + activities on the software repository and communicates with the developers via a GitHub issue + “dashboard” that lists issues related to FAIR-compliance (updated with each new commit). For each + issue, + there is a link that takes the developer to the codefair user interface (built with Nuxt, Naive UI + and + Tailwind) where they can better understand the issue, address it through an intuitive interface, and + submit a pull request automatically with necessary changes to address the related issue. Currently, + codefair is in the early stages of development and helps with including essential metadata elements + such + as a license file, a CITATION.cff metadata file, and a codemeta.json metadata file. Additional + features + are being added to provide support for complying with language-specific standards and best coding + practices, archiving on Zenodo and Software Heritage, registering on bio.tools, and much more to + cover + all the requirements for making software FAIR. +

+

+ In this poster, we will highlight the current features of codefair, discuss upcoming features, + explain + how the community can benefit from it, and also contribute to it. We believe codefair is an + essential + and impactful tool for enabling software curation at scale and turning FAIR software into reality. + The + application of codefair is not limited to just making biomedical research software FAIR as it can be + extended to other fields and also provide support for software management aspects outside of the + FAIR + Principles, such as software quality and security. We believe this work is fully aligned with the + US-RSE’24 topic of “Software engineering approaches supporting research”. The conference + participants + will benefit greatly from this poster as they will learn about a tool that can enhance their + software + development practices. We will similarly benefit as we are looking for community awareness and + contribution in the development of codefair, which is not currently supported through any funding + but is + the result of the authors aim to reduce the burden of making software FAIR on fellow developers.

+
+
+
+ +
+
+
+ +
+
+ +
+
+

Nowadays, a lot of decision-making occurs through email communication and online meetings. A project + plan + might change over time and the final deliverable might look different than what was planned during + the + starting phase of the project. These changes might be a result of multiple events which lead to + different decisions and plans. Post-project retrospectives often require understanding the guiding + factors behind these decisions, the timing of each decision, and the events that triggered them. + This + involves a time-based analysis of internal documents, email communications, meeting notes, and other + data points pertaining to the project. +

+

+ We explore a time-based retrieval approach for RAG (Retrieval Augmented Generation) to detect events + from extensive textual knowledge bases. RAG enhances the accuracy and reliability of LLM (Large + Language + Model)-generated responses by fetching facts from external sources. Unlike traditional LLMs, which + rely + solely on pre-existing knowledge, RAG incorporates an information retrieval component that pulls + relevant data from external sources based on the user query. This new information, combined with the + LLM's training data, results in more accurate responses. A simple retrieval approach retrieves data + that + are most semantically similar to a given user query. Our approach retrieves data that are + semantically + similar and relevant to a specific time frame, performing time filtering before semantic similarity + matching. +

+

+ We used Qdrant for our knowledge store and LangChain framework to implement document ingestion, + prompt + management, and RAG. The knowledge store included emails, internal documents, meeting transcripts, + and + in-depth interviews of team members. After some data cleaning, the email data was divided into weeks + (“week 1” to “week 100”) to track weekly changes. To track the decision changes and events, we used + a + chained RAG approach, wherein RAG was first applied to “week 1” and the responses were used as + context + for RAG with a filter for “week 2”. To classify the events and to extract the entities from the + events, + we use Marvin. In this poster, we will present the architecture used to track events from email data + leveraging time-based retrieval for RAG. +

+
+
+
+ +
+
+
+ +
+
+ +
+
+

Software developers face increasing complexity in computational models, computer architectures, and + emerging workflows. In this environment Research Software Engineers need to continually improve + software + practices and constantly hone their craft. To address this need, the Better Scientific Software + (BSSw) + Fellowship Program launched in 2018 to seed a community of like-minded individuals interested in + improving all aspects of the work of software development. To this aim, the BSSw Fellowship Program + fosters and promotes practices, processes, and tools to improve developer productivity and software + sustainability. +

+

+ Our community of BSSw Fellowship alums serve as leaders, mentors, and consultants, thereby + increasing + the visibility of all those involved in research software production and sustainability in the + pursuit + of discovery. This poster will present the BSSw Fellowship (BSSwF) Program, briefly discussing our + successes in developing a community around software development practices. We will highlight the + work of + recent fellowship awardees and honorable mentions, particularly those in attendance at US-RSE’24. As + many in the BSSwF community identify as RSEs, and BSSwF projects are of particular relevance, this + forum + will serve to amplify the connections between our communities. +

+

+ The BSSw Fellowship Program has seen much success through various projects, such as tutorials, + webinars + and education materials. Fellows typically focus on creating content for scientific or minority + communities that they are already a part of. Topic areas include, but are not limited to: +

+

+ Essential Collaborative Skills for Contributing to Open-Source Software + Maintaining Multi-project Continuous Integration/Development (CI/CD) + Guidelines for Getting Better MPI Performance + Collaborative Learning in Scientific Software Development + Reproducibility in Scientific Software + A Software Gardening Almanac: Applied Guidance and Tools for Sustainable Software Development and + Maintenance +

+

+ Projects completed by Fellows are collected through blog articles, links and other content curated + at + the BSSw.io website. In this way, their work forms a growing compendium of information easily + accessible + to the wider computing community.

+
+
+
+ +
+
+
+ +
+
+ +
+
+

The Department of Energy Systems Biology Knowledgebase (KBase) explores novel machine learning and + natural language processing (ML/NLP) use-cases in the biological domain. Due to the scale, + complexity, + and non-uniformity of the data within the KBase platform, existing ML/NLP pipelines must be adjusted + to + meet these challenges. This work will detail several of these concerns and solutions in the context + of + biological research in DOE-centric scientific focus areas that are likely shared by many domains + outside + of biology. +

+

+ While mining training data from numerous and diverse sources is a common first step in ML projects, + mixed data sources can complicate the interpretation of the results. In particular, KBase has + developed + a model for classifying annotated metagenomes with the environment from which they were extracted + using + gradient-boosted decision trees (Catboost1). This model was trained on the MGnify dataset, a complex + data source comprising many different sources of annotations, including taxonomic labels, protein + domain + and full-length functional annotations using Gene Ontology2,3 and InterPro4 annotations. Deriving a + meaningful interpretation of the data relies on downstream analysis after model training and + evaluation, + such as permutation and feature importance analysis. +

+

+ Results from NLP classification tasks, including those of biological relevance, are often improved + by + augmenting input with domain-specific context (RAG5, etc.). This requirement can preclude the use of + models with small context windows. The size of the model also directs a lab’s ability to use it, as + very + large models may require prohibitively expensive pre-training or fine-tuning. In our work, we + explore + the results of using different models with varying context window sizes and their ability to improve + classification metrics for NLP models trained for genetic tool development and biomanufacturing + tasks. +

+

+ Sampling bias may occur as a result of slight variations in experimental protocols and data sources. + These issues impact a model’s ability to generalize to data on which it was not trained. In + biological + applications, like phenotype classification, these variations can impact the predictive performance + for + out-of-clade predictions by machine learning classifiers6. Addressing these effects calls for + innovative + sampling techniques that extend beyond basic random and stratified splitting. We are developing a + sampling technique based on the similarity of the most important and predictive features that use + phenotypic similarity as a proxy. +

+

+ While issues in developing a useful ML pipeline or NLP model are shared across many domains, the + exact + means of mitigating is field-dependent. KBase’s research, focusing on biological applications, is + also + subject to these issues, often requiring insight into the biological domain to guide model result + improvement.

+
+
+
+ +
+
+
+ +
+
+ +
+
+

We propose to develop methods for certifying Machine Learning (ML) models by optimizing multiple + trust + and decision objectives. By combining measures of trustworthiness with domain- specific decision + objectives, we aim to establish the criteria necessary to meet exacting standards for + high-consequence + applications of ML in support of national security priorities. In accordance with Executive + Order-14110, + our objective is to promote the safe, secure, and trustworthy development and use of Artificial + Intelligence (AI) by delivering a generalizable process that can readily inform ML certification. +

+

+ Current credibility assessments for ML are developer-focused, application-specific, and limited to + uncertainty quantification (UQ) and validation, which are necessary but insufficient for + certification. + Whereas ML developers are primarily concerned with various measurements of model accuracy, non-ML- + expert stakeholders are concerned with real-world quantities such as risk, or safety; this suggests + that + a more holistic technical basis is needed. With multiple objectives, decisions may only be + Pareto-optimal: no objective can be improved without making another worse. Designing certification + processes with multi-objective design optimization allows for the balancing of requirements for + specific + applications. +

+

+ The absence of evidence-based, generalizable approaches to certification restricts our ability to + develop and deploy credible, mature ML models. To address this gap, we will operationalize AI + risk-frameworks, such as the National Institute of Standards and Technology’s Risk Management + Framework, + by synthesizing important Trustworthy AI capabilities such as robustness testing and anomaly + detection. + To validate and refine our approach, we will engage in collaborative case studies applying our tools + to + real-world datasets and scenarios while gathering feedback to ensure their effectiveness and + usability. +

+

+ Our primary research goal is to develop a process for designing certifications for trustworthiness + of + ML, particularly in high-consequence applications. Drawing inspiration from the multi-tiered + approach of + software testing, which encompasses unit to integration tests, our strategy involves assessing + trustworthiness throughout the ML development lifecycle and conducting system-level evaluations of + non-functional properties such as safety, fairness, and privacy. +

+

+ If successful, our research will position ML to fulfill the requirements of high-consequence + domains, as + evidenced by a measurable improvement in the reliability properties of our exemplar models.

+
+
+
+ +
+
+
+ +
+
+ +
+
+

The Ecosystem Demography Biosphere Model (ED2) is an open-source, comprehensive terrestrial biosphere + model that integrates hydrology, land-surface biophysics, vegetation dynamics, and carbon + biogeochemistry. Since its inception in 2001, researchers have utilized this model to examine a + variety + of tropical and temperate ecosystems over years. ED2 presents challenges to new users due to its + implementation in Fortran 90 with required HDF5 libraries, necessitating specific prior knowledge. + Additionally, running the model over extended periods demands substantial computational resources, + complicating deployment across various computing platforms. High Performance Computing (HPC) cluster + usage requires prior knowledge about clusters and a learning curve for job submission. To lower + these + barriers for the broader ecological community, we implemented a Singularity-based containerization + of + ED2 and developed a Jupyter Notebook to simplify job submissions to HPC clusters. +

+

+ The containerized version of ED2 can be easily deployed on various High-Performance Computing (HPC) + platforms. This container includes the ED2 binary and all necessary libraries, removing the need for + additional installations and allowing the capability to "run everywhere." Additionally, the Jupyter + notebooks enable users to seamlessly modify model configurations and run the model on both local + machines and different HPC clusters. This approach simplifies the process of communicating with the + HPC + cluster and managing slurm jobs. The notebook also includes features for frequently checking the + status + of ongoing jobs on the HPC cluster and transferring the output back to the local machine. We have + created basic demo visualizations of the output, allowing users to further visualize their results + using + Python or R as they prefer. +

+

+ We ran the containerized ED2 model running on the University of Illinois Campus Cluster, TEXAS + Stampede + clusters and NCSA Delta, analyzing 20 years of model simulations over multiple sites in tropical and + temperate ecosystems. We are working towards testing and adding the configuration to run on + additional + clusters (such as the NASA HEC system). We have held a workshop at the ED2 community meeting at + Harvard + in 2024 where the ED2 community showed significant interest in the combination of notebooks and + containers. +

+
+
+
+ +
+
+
+ +
+
+ +
+
+

Domain research, particularly in the life sciences, has become increasingly complex due to the + diversity + of types and amounts of data concomitantly with the associated analytical methods and software. + Simultaneously, researchers must consider the trustworthiness of the software tools they use with + the + highest regard. As with any new physical laboratory technique, researchers should test and assess + any + software they use in the context of their planned research objectives. +

+

+ As examples, bioinformatics software developers and contributors to community platforms that host a + variety of domain-specific tools, such as KBase (the DOE Systems Biology Knowledgebase) and Galaxy, + should design their tools with consideration for how users can assess and validate the correctness + of + their applications before opening their applications up to the community. +

+

+ More attention should be placed on ensuring that computational tools offer robust platforms for + comparing experimental results and data across diverse studies. Many domain tools suffer from + inadequate + documentation, limited extensibility, and varying degrees of accuracy in data representation. This + lack + of standardization in biological research, in particular, diminishes the potential for + groundbreaking + insights and discoveries while also complicating domain scientists' ability to experiment, compare + findings, and confidently trust results across different studies. +

+

+ Through several examples of tools in the biology domain, we demonstrate the issues that can arise in + these types of community-built domain-specific applications. Despite their open-source nature, we + note + issues related to transparency and accessibility resulting in unexpected behaviors that required + direct + engagement with developers to resolve. This experience underscores the importance of deeper openness + and + clarity in scientific software to ensure robustness and reliability in computational analyses. +

+

+ Finally, we share several lessons learned that extend to research software in general and discuss + suggestions for the community.

+
+
+
+ +
+
+
+ +
+
+ +
+
+

Waggle/Sage cyberinfrastructure empowers researchers to deploy software-defined instrumentation at + the + edge, enabling real-time computation and AI inferencing for a wide range of applications. These + applications include the studies of: wildlife populations, urban flooding and traffic flow, the + impacts + of climate change, wildfire prediction, and more. In this poster, we'll focus on some of the + web-based + (TypeScript) data visualization and analysis tools available for scientists and developers. These + visualization components can be used for monitoring jobs, performance, and resource utilization + across + nodes; the inspection and validation of data and inferences; and, finally, ensuring the reliability + and + stability of hardware and software across the platform.

+
+
+
+ +
+
+
+ +
+
+ +
+
+

Many articles and discussions regarding the growth of Research Software Engineering (RSE) have + focused on + the importance of the field for supporting large-scale efforts with long-term support needs. That is + rightly so, and yet it is important not to overlook building up RSE in our institutions as a vital + resource for smaller projects with short-term needs, as well. As a cyberinfrastructure (CI) + facilitator + in the Cross-Institutional Research Engagement Network (CIREN; https://ciren.asu.edu), a program for + training and professional development of CI facilitators, including RSEs, I worked on a facilitation + and + RSE project with an ASU investigator using publicly available data from TCGA (the Cancer Genome + Atlas). + This project required some data wrangling (to make it readily usable) and a pilot analysis to assess + the + utility of using TCGA datasets to answer the particular research question at hand. While the + investigator and some lab members had the requisite skills to do this work, they did not have + sufficient + capacity. On the other hand, there were undergraduate students interested in pursuing the research + question, but they did not yet have a sufficient level of skill, having just completed an + introductory + undergraduate course to bioinformatic analysis in R. Thus, I, as an RSE, was able to bridge the + needs + gap and develop a pipeline that the undergraduate students could use to pursue the research + question, + facilitating their participation in research by lowering the barrier to entry. Additionally, and + importantly to the goals of CIREN, this project also provided the CI facilitator (myself) an + opportunity + to build professional skills with the assistance of a mentor, such as negotiating a workplan with + the + primary researcher to set out clear shared expectations. This example helps to illustrate the + advantages + of engaging with smaller projects for developing the professional skills of RSEs, and CI + professionals + more broadly, as well as highlight the value of CI professionals for supporting abundant + smaller-scale + research software needs on university campuses. +

+

+ This work was supported by the Cross-Institutional Research Engagement Network for + cyberinfrastructure + (CI) facilitators (CIREN) NSF Award #2230108. I would also like to acknowledge and thank ASU + Professor + Melissa Wilson for giving me the opportunity to work on this project in support of her research.

+
+
+
+ +
+
+
+ +
+
+ +
+
+

Software testing for open-sourced projects is a problem that is prevalent throughout the entire + software development industry. This problem has been tackled by many different programs + such as Jenkins, Travis CI, or Semaphore. However, these solutions are not adequate when the + software tests must be performed on hardware that is attractive to attackers, behind restrictive + networks, and is shared by numerous projects and users. +

+

+ Our project, AutoTester2, aims to tackle these issues and provide a way for projects to + automatically test software changes from unknown developers, while ensuring the security and + reliability of the targeted hardware. This is done using ephemeral GitHub Self-hosted runners, + custom environment containers, and linux systemd services.

+
+
+
+ +
+
+
+ +
+
+ +
+
+

The Journal of Open Source Software (JOSS) is an academic journal (ISSN 2475-9066) that publishes + short + articles describing open source software with a research application. The review process includes + checking that the software itself meets some modern standards, including having documentation, tests + (preferably automated), and community guidelines. This way, JOSS aims to give software creators a + citable artefact through which their research contribution can be recognised, and to encourage them + to + use good software practice.

+
+
+
+ +
+
+
+ +
+
+ +
+
+

To address the lack of software development and engineering training for intermediate and advanced + developers of research software, we present the NSF-sponsored INnovative Training Enabled by a + Research + Software Engineering Community of Trainers (INTERSECT), which delivers software development and + engineering training to intermediate and advanced developers of research software. INTERSECT has + three + main goals: +

+ +
    +
  1. Develop an open-source modular training framework conducive to community contribution
  2. +
  3. Deliver RSE-led research software engineering training targeting research software developers +
  4. +
  5. Grow and deepen the connections within the national community of Research Software Engineers +
  6. +
+

+ The majority of INTERSECT's funded focus is on activities surrounding the development and delivery + of + higher-level specialized research software engineering training. +

+

+ In July 2023, we held our first INTERSECT-sponsored Research Software Engineering Bootcamp + (https://intersect-training.org/bootcamp23/) at Princeton University. The bootcamp included 35 + participants from a broad range of US-based institutions representing a diverse set of research + domains. + The 4.5-day bootcamp consisted of a series of stand-alone hands-on training modules. We designed the + modules to be related, but not to rely on successful completion or understanding of previous + modules. + The primary goal of this design was to allow others to use the modules as needed (either as + instructors + or as self-guided learners) without having to participate in the entire bootcamp. +

+

+ The topics covered in the bootcamp modules were: Software Design, Packaging and Distribution, + Working + Collaboratively, Collaborative Git, Issue Tracking, Making Good Pull Requests, Documentation, + Project + Management, Licensing, Code Review & Pair Programming, Software Testing, and Continuous + Integration/Continuous Deployment. +

+

+ We are organizing a second INTERSECT bootcamp in July 2024. We expect to again have approximately 35 + attendees from a wide range of institutions covering a diverse set of research areas. Because the + format + and content of the first bootcamp was well-received, we plan to follow a very similar format for the + second workshop. +

+

+ In this poster we will provide an overview of the INTERSECT project. We will provide more details on + the + content of the bootcamp. We will discuss outcomes of both editions of the bootcamp, including + curriculum + specifics, lessons learned, participant survey results, and long term objectives. We will also + describe + how people can get involved as contributors or participants.

+
+
+
+ +
+
+
+ +
+
+ +
+
+

Research software today plays an integral role in the practice of science, and as the complexity and + wide-ranging impact of that software continues to grow, so too do concerns about software security. + In + industry, much attention has been paid to the concept of "shifting security left,” that is, + considering + security earlier in the software development life cycle (SDLC) rather than at the end. Developers + —RSEs + included— are non-experts in security, and a major aim of shifting security left is to introduce + security-enhancing practices that fit into ordinary developers' workflows. Case in point, threat + modeling is a security activity that identifies weaknesses and vulnerabilities during software + design + that serves as the foundation for the security life cycle. Compared to more complex security + activities, + threat modeling can be performed by non-security experts if given the proper guidance and support. +

+

+ In the RSE space, security resources are limited and typically focused on more complex security + issues. + With the proper resources, we propose that RSEs can perform threat modeling with minimal extra + effort, + thus alleviating pressure on existing security resources and increasing the overall security posture + of + the team and organization. This poster presents the findings of a rapid literature review on the + applicability of threat modeling techniques during the software development lifecycle. +

+

+ We performed a review of the relevant evidence, combining gray and peer-reviewed sources, focusing + on 22 + high-quality works. Given that RSEs are underrepresented in the software engineering literature, we + analyzed available evidence on threat modeling in conventional software development and made a + preliminary assessment of the transferability of those findings to RSE contexts. In our review, we + answered the following research questions: +

+
    +
  1. How is threat modeling incorporated into the development process?
  2. +
  3. How is the security impact of threat modeling during the software development process measured? +
  4. +
  5. What is the definition of threat modeling when performed during the software development + process?
  6. +
  7. What challenges do software developers face when threat modeling during the software development +
  8. + process? +
  9. How effective is threat modeling when performed during software development by non-experts in + security?
  10. +

    + The findings of this review will be used to develop a quick reference guide and educational + workshop for + RSEs to support their threat modeling efforts. This poster allows RSEs to learn more about + threat + modeling and our work. We invite participants to provide feedback to guide our future work.

    +
+
+
+ +
+
+
+ +
+
+ +
+
+

The Scientific Software Engineering Center (SSEC) at the University of Washington’s eScience + Institute works with researchers across various disciplines to build robust software that bolsters + inquiry and builds community. The resulting tools are open source, maintainable and reusable, + designed to sustain and lead to scientific breakthroughs. +

+

+ The poster we propose to present would highlight not just our model of engaging as software + engineers with the research community, but also highlight several of our successfully graduated + projects, spanning diverse research domains from neurobiology to seismology to conservation + efforts and beyond. +

+

+ By demonstrating prior successes and discussing how we achieve them, we hope to spark + conversations about our approach to supporting researchers and show an interesting approach + to research software engineering.

+
+
+
+ +
+
+
+ +
+
+ +
+
+

The U.S. Department of Energy Office of Advanced Scientific Computing Research (ASCR) has launched a + new + initiative called the Next-Generation Scientific Software Technologies (NGSST) program for + scientific + software stewardship which is the first of its kind for the office. Under this initiative, several + software stewardship organizations (SSOs) have been funded and a couple are waiting in the wings. + These + include COLABS, CORSA, PESO, S4PST, STEP, and SWAS, together with the FASTMath and RAPIDS SciDAC + Institutes. The SSOs are in the process of standing up a "Consortium for the Advancement of + Scientific + Software” (CASS) with the mission of stewarding and advancing the current and future ecosystem of + scientific computing software, including products developed or enhanced as part of the Exascale + Computing Project (ECP) Software Technologies. +

+

+ One of the SSOs in CASS, COLABS is tasked with activities that are of direct interest to US-RSE such + as + building a community of practice for RSEs engaged in software-related activities at the national + laboratories. Additional activities of common interest include training and workforce development. + These + activities include the determination of skills needed to create and maintain high-quality software + that + is well tested and carefully stewarded throughout its lifecycle, curating available training + resources + and creating new resources where none exist, understanding and formulating pathways for training + that + not only prepare the workforce for the task at hand but also expand the recruitment pool by enabling + training modules to fill the gap in skills where needed. +

+

+ In this poster, we will explain the motivation and vision for NGSST and CASS, and also the role that + CASS would play in the broader scientific software community. We will also describe our vision for + how + CASS in general and COLABS, in particular, will conduct activities related to their mission, and how + they will engage with US-RSE in these activities.

+
+
+
+ +
+
+
+ +
+
+ +
+
+

Research software requires flexible and modular architectures to accommodate rapid evolution + and interdisciplinary collaboration. Software architectures are fundamental to the + development of technically sustainable software systems as they are the primary carrier of + architecturally significant requirements, such as extensibility and maintainability, and influence + how developers are able to understand, analyse and test a software system. Hence, research + software engineering should focus on architectural metrics to evaluate and improve their code. + Architectural metrics in research software ensure the software's scalability, performance, + maintainability, and overall quality, facilitating reproducible and reliable research outcomes. + We already have many metrics and tools to measure and improve the quality of software + architecture. However, we hypothesize that research software is often built using limited + resources and without a long-term vision for maintainability, which leads to a range of rotten + symptoms, including software rigidity, fragility, immobility, and viscosity, which results in + high-maintenance and evolution costs, which are the foundation for software decay and death in + all software investment. +

+

+ In a recent Dagstuhl seminar, we, the authors (consisting of software engineers, software + engineering researchers, and research software engineers), discussed key software metrics + such as code smells, duplication, test coverage, cyclomatic complexity, etc., and how these can + be used to improve research software. We explored our hypothesis by applying existing static + analysis tools to a few open-source research software repositories. We discovered high + cyclomatic complexity, large God classes, cyclic dependencies, improper inheritance, modularity + violations, propagation error-prone and change-prone issues, and low test coverage, which + confirmed non-trivial maintenance and evolution costs. We concluded that we should explore + this idea further and that there may be an opportunity to build better tools and techniques to + help research software engineers improve the architecture of their software, which in turn can + improve its quality. +

+

+ In this poster presentation, we will demonstrate the critical role of software architecture in + ensuring software quality. We will explore its importance for research software and how it can + significantly enhance the quality of these systems. Research software can become more + maintainable, scalable, and reliable by adopting software architecture principles. We will + illustrate the tangible benefits and practical applications of adopting robust software architecture + principles through detailed example analyses of real research software projects. Attendees will + gain insights into best practices and strategies for integrating architectural frameworks into their + research software development processes. +

+

+ Our goal is to spread appropriate software architecture knowledge among research software + engineers, highlighting its transformative impact on research outcomes and the sustainability of + software projects. By adopting these principles, research teams can ensure their software is + better equipped to handle future challenges, fostering innovation and collaboration in the + research software community.

+
+
+
+ +
+
+
+ +
+
+ +
+
+

For many people who identify themselves as one, the term "Research Software Engineer” (RSE) often + feels, + almost paradoxically, both immediately recognizable and hard to define at the same time. Most if not + all + of the common definitions share a sense of RSEs as existing "somewhere in the middle” between + research + and software engineering. If we imagine a 1-dimensional axis with "pure researcher” on one end and + "pure + software engineer” on the other, classifying any of the countless RSE flavors [*] would only be a + matter + of finding the right set of points along this axis. However, while this model is generally + successful in + representing the core aspect of the RSE role, its limitations become apparent when examined more + closely. How should such "RSE coordinates” be chosen? Can a quantitative, or at least clearly + defined, + set of categories be defined at all? Are the "researcher” and "software engineer” really so distinct + and + unambiguously mutually exclusive that they can be used as opposite ends of a spectrum? Can the + complexity of the RSE-space be captured to a useful degree by a single dimension? When does one + cross + from one polar end of the spectrum to an RSE? Can one exist in multiple dimensions? +

+

+ We believe that beyond the abstract intellectual curiosity driving the desire for more effective RSE + taxonomies, these limitations also lead to concrete challenges for the inclusiveness of our + community, + as people who don't see themselves as fitting neatly within these restrictive boundaries might + interpret + the existing definitions (consciously or subconsciously) as a sign that they don't belong in it. Our + poster will be centered around exploring the limitations of the "research vs software engineering” + spectrum described above, and what possible alternatives might be used instead, building on a + multi-year + ongoing interest in this question, including participation in a recent Dagstuhl seminar bringing + together RSEs and software engineering researchers (SERs)[1], as well as personal experiences from + ourselves, collaborators, and professional networks. +

+

+ [*] RSE flavors: There exist many flavors of interests tangent to RSEs either actively involved in + the + community or should be such as: software engineering, research in software engineering, RSE + research, + training & education, funding, community building, and user experience. Simultaneously, there are + many + flavors within RSEs such as team size, team dynamics (single RSE to many RSEs), job titles, phase of + software developed (prototype to production), research domain (fixed or variable).

+
+
+
+
diff --git a/pages/program/abstracts/posters.tbd b/pages/program/abstracts/posters.tbd deleted file mode 100644 index 516bdb8..0000000 --- a/pages/program/abstracts/posters.tbd +++ /dev/null @@ -1,324 +0,0 @@ ---- -layout: page -title: Posters -description: -menubar: program -permalink: program/posters/ -menubar_toc: true -set_last_modified: true ---- - -
-
-
-
- -
-
- -
-
-

The aim of this project is to modify the Python library, snnTorch, to account for the adjacency of nearby neurons when firing off. - Once adjacency is determined, the weights of neurons are adjusted accordingly. - Our main objective was to test the theory that taking neuron adjacency into account will have a positive effect on overfitting, a common problem when designing Neural Networks. - In order to complete this project, we did research into Spiking Neural Networks, and how we could utilize the concepts of fractal growth and Cayley trees to simulate natural neuron growth. - To complete this project, we utilized common Python libraries such as NumPy, MatPlotLib, snnTorch, Math, Itertools, and Sys. - Additionally, we utilized the CIFAR-10 and MNIST datasets to train our neural network. - With these factors in place, tests were performed using the newly modified code to examine the accuracy and level of overfitting present in the new program. - The results showed that our changes did not hurt the accuracy, and that it showed a minor performance increase in regard to overfitting. - However, these results will require further testing in the future. - This project presents a way to create neural networks that has a positive effect on overfitting, without sacrificing accuracy of results during training.

-
-
-
-
-
-
- -
-
-
-
-

Geostreams is a comprehensive platform developed by the National Center for Supercomputing Applications, designed to facilitate open-source data management and visualizations for geospatial data. The framework enables the conversion of heterogeneous data, particularly time-series geospatial data, into a flexible schema with an online dashboard for interactive data visualizations, allowing end-users to effectively engage with the data.

-

The Geostreams framework comprises three primary components. Firstly, the underlying database is a robust database built on PostgreSQL. Leveraging the capabilities of PostGIS and the PostgreSQL JSON binary (JSONB) data type, this component efficiently handles the storage and management of geospatial and temporal data by structuring data sources, parameters and time series data streams into JSON documents that can be queried directly or in aggregate (such as weekly totals). The second component, the Geostreams API, a RESTful service, is written in Scala and supports automated ingest and aggregation of time-series geospatial data. For a wider reach, Pygeostreams, a Python wrapper library, is also available to interface with the API. The final component is the Geodashboard, a rich client web platform, developed in React.js and leveraging D3.js and Vega.js, dedicated to visualizing and downloading parsed data, allowing users to gain insights and facilitate further analysis.The dashboard is composed of many modular components that can be enabled and configured depending on the type and scale of data being displayed.

-

To enhance the framework’s capabilities, Geostreams seamlessly integrates with Clowder, a customizable and scalable data management framework designed for long-tail data. This allows Geostreams to provide long-term archiving and efficient raw file management.To ensure portability and ease of deployment, each component is containerized using Docker and can be quickly built with Docker Compose.

-

The effectiveness and versatility of the Geostreams framework have been validated through its successful implementation in projects such as the Great Lakes to Gulf, Great Lakes Monitoring, ARPA-E Smartfarm, and NSF Critical Interface Network. The framework provided a solid foundation that could be easily customized to meet the specific needs of each project, reducing manual effort and saving valuable time. Future plans for Geostreams, include migrating the API services to Django for seamless integration with advanced extensions like timescaleDB, enabling real-time data aggregation. Additionally, improvements to the data model are envisioned to enhance efficiency.

-

The Geostreams framework serves as an excellent example of a well-structured, modular, and extensible system that can be applied to other projects and domains apart from geospatial. The patterns and principles of containerization, scalability, and seamless integration with other frameworks foster an efficient and adaptable approach to software development enabling RSEs to minimize effort without compromising on quality.

-
-
-
-
-
-
- -
-
-
-
-

Over the last 40 years, network analysis has emerged as a prominent approach to data-intensive research. - Despite this steady growth and investment, network analysis remains something of a niche specialty that can exclude novice users who usually only receive standard statistical training. - Moreover, since much of the growth in network science has tended to be field-specific, tools and formats have developed independently across disciplines. - The multiplicity of formats and sensitivity of social data makes existing records difficult to share across scholars in the field, limiting the opportunity for new findings on the already accumulated body of network data. - IDEANet - Integrating Data Exchange and Analysis for Networks - aims to maximize scientific discovery in human network science by significantly lowering the analytic and access barriers-to-entry for researchers. - IDEANet is supported by the National Science Foundation as part of the Human Networks and Data Science - Infrastructure program (BCS-2024271 and BCS-2140024).

- -

IDEAnet features three key components (1) a suite of analysis tools developed in R which automatically generate standardized network analytic measures (2) a GUI (Graphical User Interface) which gives access to the aforementioned measurements through an easy-to-use menu-based program and (3) a secure data repository that routinizes the capacity for archiving and accessing network data, including sensitive data.

- -

The analysis tools are distributed as a package and built with real-world data constraints in mind to allow novice users the ability to gain substantive results as efficiently (but still accurately) as possible. - Core metrics comprise 17 node-level measurement (e.g., degree, centralities, reachability) in addition to 27 system-level metrics (e.g., network size, dyad census, transitivity). - Additional modules include multiple regression QAP, multi-relational blockmodeling and a community-detection routine that partitions the network based on 10 commonly used methods and evaluates their concordance using CHAMP. - Further modules are in development including meta-population disease simulation and dynamic network diffusion simulation.

- -

The secure data repository is hosted on Dataverse in collaboration with Duke University Library. - Researchers are often interested in sharing their data but can be limited by strict institutional requirements. - Our repository facilitates this transition by offering three levels of data security: Open access, Secure non-restricted, and Secure use-restricted. - Secure non-restricted data include some level of confidentiality such that investigators require IRB approval for access. - Secure use-restricted data requires both IRB approval for use and further substantive limitations required by the data owner. - To accommodate the diversity in requirements, IDEAnet makes use of the imPACT architecture – a “notary service” that seamlessly matches user certification and data access requirements.

- -

The difficult learning curve involved in learning network tools means that researchers with substantive interests in network processes but who are not specially trained in network methods must either invest heavily in training or risk serious analytic errors. - The goal of IDEANet is to provide an integrated network data analysis framework within R that capitalizes on the best of current tools while building robust safeguards against common data and analytic errors.

-
-
-
-
-
-
- -
-
- -
-
-

Recently the US-DOE Office of Advanced Scientific Computing Research (ASCR) office funded several seedling projects to conceptualize software sustainability organizations (SSOs). - One of them, the Collaboration for Better Software (for Science), or COLABS, has research software engineering as its centerpiece. - It aims to provide a wide range of services to client software projects and the broader community in partnership with ASCR’s user facilities. - These services include essential and advanced software engineering services, and place a strong emphasis on workforce development and retention by providing long-term stability, training, and support to enable and encourage RSEs and other staff to build their careers and excel in this role. - With this poster, we will engage with the RSE community; to get their input, and refine our objectives for the SSO, and how these objectives can be met.

- -

COLABS is envisioned as a multi-institutional distributed organization that will initially be anchored by three DOE national laboratories (ANL, LBNL, and ORNL), but can expand to include a wide variety of institutions including universities, industry, non-profit foundations, etc. - In addition to providing services directly, COLABS RSEs will also become ambassadors for changing the perception of RSE roles in scientific computing.

-
-
-
-
-
-
- -
-
-
-
-

Randomized controlled trials (RCTs) are considered the highest level of evidence to establish causal associations in clinical research. - However, problems with design, execution or reporting of the trial process can lead to unreliable findings, excessive costs, and, potentially, harm for patients. - Clinical trials often suffer from poor methodological and reporting quality (also known as rigor and transparency, respectively). - Two reporting guidelines, CONSORT(Consolidated Standard of Reporting Trials) and SPIRIT(Standard Protocol Items: Recommendations for Interventional Trials) have been designed to promote complete and clear reporting in RCT publications (results publications and protocols, respectively). - Using these guidelines, the validity and applicability of the RCT findings can be better assessed. - Although endorsed by many high-impact medical journals, adherence to these guidelines remains suboptimal, possibly because journals lack methods for enforcement and verification, which involves a substantial amount of journal staff or editorial time.

- -

RCTCheck uses natural language processing techniques and data management software (Clowder) to analyze RCT manuscripts and identify information related to rigor and transparency as defined in these guidelines. - RCTCheck analyzes user uploaded manuscripts, identifies sections and sentences, and using a Transformer-based deep learning model (PubMedBERT), classifies sentences into individual items in CONSORT and SPIRIT checklists and generates a report on the transparency of the manuscript. - This report can assist authors in checking the completeness of their reporting, and journals in maintaining high reporting standards, and other stakeholders of clinical research in critically assessing the quality of clinical trials, synthesizing evidence and promoting open science practices, leading to better clinical care and treatments.

- -

The user management functionality in Clowder provides different types of access for researchers looking to use the labeled and processed clinical manuscript data and authors looking to publish their work by ensuring that the manuscripts align with the guidelines. - An author will be able to upload a manuscript to RCTCheck and download a report of its critical appraisal. - A researcher is able to process a larger number of published reports and manuscripts and conduct large-scale analyses on reporting quality. - They will also be able to provide feedback on model predictions, which can potentially improve the deep learning models.

- -

The system leverages the Clowder data management framework to provide a backend for the execution of PubMedBERT and pre- and post- processing steps, including PDF extraction and visualizations of results. - A special purpose web client provides a specialized user interface for authors and publishers. - We present the overall architecture of the system and the current implementation with emphasis on leveraging existing generic cyberinfrastructure and extending it for specific use cases.

-
-
-
-
-
-
- -
-
-
-
-

Significant efforts have been placed on creating big data management architectures, but there is still a large middle ground between these projects of the largest scale and a small data project that requires little data management. - This middle ground which, depending upon hardware utilized, may be some-where in the range of millions to tens of millions of records in scale. - This scale is just big enough to create headaches if that data isn’t handled and managed efficiently. - Additionally, projects of this scale may not have the resources to support the tools and people required to manage a system intended for data at a larger scale. - The data lakehouse architecture presented here aims to fill this niche by delivering a solution that is performant at this scale and is built with open source technology. - As a case study of this architecture, we showcase what we developed to manage data for MOSSAIC (Modeling Outcomes using Surveillance data and Scalable AI for Cancer), a natural language processing deep learning joint project between the Department of Energy (DOE) and the National Cancer Institute (NCI). - The research element of the project is focused on (1) developing large-scale, state-of-the-art foundation models for clinical information extraction, (2) building new capabilities for biomarker and recurrence identification and detection, (3) pushing novel research in uncertainty quantification, explainability, and safe deployment, so that production AI models can be effectively and reliably deployed in real-time at the population level, (4) expanding the implementation of these tools in real world cancer registry settings as well as other clinical settings such as health care facilities, clinical laboratories etc., and (5) enabling large-scale foundation model training on DOE Leadership Computing Facility supercomputers. - Project data is sourced from the National Cancer Institute’s Surveillance, Epidemiology, and End Results Data Management System (SEER*DMS) which acts as a central registry of cancer data consolidating information from individual state cancer registries. - A research endeavor of this nature creates many requirements and constraints. - For example, we are working with highly sensitive personally-identifiable cancer patient data that, according to our data use agreements (DUAs), required us to build a system that operates in a network disconnected, air-gapped environment. - Additionally, we needed to provide the capability to limit access to segments of some data when needed. - Finally, the solution needed to be easily shareable with external healthcare institutions where there may be budget, hardware, or data management expertise constraints. - The solution presented is a data lake design that can be set up on any file system as well as any object storage system that utilizes the S3 API to store data from any type of source, and is loosely coupled with an in-process serverless analytical database management system that can both catalog the files in the data lake and query them through a no-code GUI SQL editor utilizing a JDBC driver or through code with a number of language-specific client APIs including Python. - The open sourced nature of the project offers solutions for the science community, and shared practices for processing and storing data may lead to easier data harmonization and increased reproducibility. - This architecture ultimately gives mid-sized data projects a direction for a system that will scale, offers excellent query performance, is resource efficient, and is flexible for future technology changes.

-
-
-
-
-
-
- -
-
- -
-
-

In the realm of advanced research, diverse data sources and complex computational workflows pose significant challenges in achieving centralized data management, efficient computation, and reproducibility. This poster proposes a sophisticated pipeline that combines the solutions to the existing problems in a centralized location.

-

The presented pipeline encompasses two indispensable components: the input preparation pipeline and the computation pipeline, both meticulously designed to address the aforementioned obstacles. The input preparation pipeline seamlessly harmonizes data collection from a multitude of platforms, including the likes of XNAT, One Drive, and other various cloud storage services. The computation pipeline, on the other hand, showcases the remarkable capabilities of SLURM and HPC clusters, empowered by the implementation of singularity containers, ensuring not only performance and scalability, but also reproducibility.

-

Furthermore, the poster uses BRILLIANCE (Brain tailoRed stImulation protocoL for acceLerated medIcal performance) as a use case to establish its utility, relevance and significance in the field of advanced research. The use case seamlessly integrates BIP (BRILLIANCE Input Preparer) for comprehensive data preparation and uses singularity containers and computational resources of ACCRE (Advanced Computing Center for Research and Education at Vanderbilt University) to obtain scalability, elevated performance and reproducibility.

-

By embracing this pipeline, researchers and practitioners can seamlessly integrate their data and compute. Furthermore, as the computational units are containerized, researchers can easily ensure reproducibility. This poster aspires to empower researchers to exercise their full potential and facilitate advancing strides in their respective fields.

-
-
-
-
-
-
- -
-
-
-
-

This poster presents the findings of research on the prevalence of research software as academic research output within international institutional repositories (IRs), often termed Research Information Systems (RIS). These platforms contain permanent metadata records of research output from the university. While these exist mainly to meet funders' open access requirements, they also serve to replace the old homepage of individual academics and to provide metadata on their contents for services that aggregate harvested content, thus increasing the FAIRness of the artifacts. Expanding on work conducted on UK-only repositories by using source data from OpenDOAR, a directory of global Open Access Repositories, similar analyses were applied to international IRs in what we believe is the first such census of its kind. 4,970 repositories from 125 countries were examined for the presence of software, along with repository-based metadata for potentially correlating factors. It appears that much more could be done to provide trivial technical updates to RIS platforms to recognise software as distinct and recordable research output in its own right. This poster will present the main results and the software approach used to examine such a large quantity of IRs, allowing future work to pivot on the datasets found.

-
-
-
-
-
-
- -
-
-
-
-

Human cognition is richly dynamic. Examining this quantitatively requires tasks that pose evolving and context-dependent problems to participants. As most behavioral testing is moving online, it is necessary to develop measurement tools that permit interactive computations. We present a new methodology which allows for flexible and sophisticated forms of dynamic task evolution and offers an attractive experience for participants and developers alike. Utilizing strengths from both visual, interactive languages (JavaScript) and sophisticated analytic languages (R, Python), we implement a client-server architecture. Here, all computationally intense operations for an online neurocognitive task run in a cloud-based server rather than in a browser. The server receives all data from a participant’s behavior in the task, allowing later stages of the task to be updated dynamically to pose appropriate problems to the participant. The Application Programming Interface (API) to interact with the server uses a customizable R script to process data received from the participant, allowing any specified computations to be performed before returning data to the browser-based JavaScript task, facilitating tight control over the state of the task. This methodology is intended to minimize the limitations of visual programming while retaining interactive and aesthetically pleasing task presentation. We suggest that this offers a unique solution to cognitive testing online and in the lab. A containerized implementation of this methodology is open-source and available on GitHub to minimize all effort in setting up the task: https://github.com/Brain-Development-and-Disorders-Lab/task_template_dynamic

-
-
-
-
-
-
- -
-
- -
-
-

Jupyter Notebooks are open-source tools researchers commonly use to develop workflows and other software. Researchers and RSEs alike are most likely familiar with the Classic Notebook interface, the original web application for creating and sharing notebooks, but there are several other coding environments to choose from. An Integrated Development Environment (IDE) is a software application that provides helpful features beyond traditional source code editors, such as debuggers, for developing software. However, IDEs such as VSCode can present a barrier to entry for researchers familiar with other tools. JupyterLab, an alternative developed by Project Jupyter, is an extensible development environment for notebooks that comes with many IDE-like features, including a debugger and tab expansion. Additionally, the community maintains many other helpful extensions that do not ship with the default environment. Our JupyterIDE project collects and curates useful extensions and provides notebook-based tutorials for how to use them. Tutorial-style notebooks include notebooks on Vim keybindings, which make cell manipulation faster and easier, and language server processing, which provides code auto-completion and linting features. Tools like these can make JupyterLab an ideal environment for developing research workflows that can be used by seasoned RSEs who are accustomed to IDE features in collaboration with researchers who may not have interest in investing time into learning a new tool. JupyterIDE makes these tools more accessible for users and promotes software engineering best practices in a research environment.

-
-
-
-
-
-
- -
-
-
-
-

Software plays a key role in the scientific discovery process. The user experience and sustainability of scientific software are critical to ensure the production of knowledge and scientific progress. Today, scientific software programs and projects often do not have the methods, processes, or best practices that are necessary to ensure high quality usable software. Knowledge from commercial software cannot be directly applied to scientific software due to differences in resource allocation, organizational structures, target audience, and scientific goals. To understand and bridge these gaps, our project, Scientific software Research for User experience, Design, Engagement and Learning (STRUDEL) is developing a typology of scientific software work and a design framework to understand and support the scientific software process including user interface design and development. We believe that this typology and design framework is necessary for research software engineering (RSE) practice to develop usable and sustainable software.

-

A sociologically informed typology helps break down diverse scientific projects for analysis and comparison by stakeholders. The STRUDEL typology is designed to guide scientific users who have questions about what types of user experience and software sustainability work to invest in, as well as when to do so. Our aim is for this typology to be a strategic thinking partner to guide project leaders, funding officers, domain experts, software developers, and so on when thinking about their product’s needs for UX, software sustainability, and overall strategy. It unpacks connections between organizational (project structures & funding), social (roles of people & structure of teams), and technical issues (technology stacks, target users, etc.) that shape scientific software to help users answer key strategic questions.

-

The STRUDEL design framework provides fundamentals and guidelines along with standard components and generalized UI flows for accomplishing specific tasks in scientific user interfaces that can be reused and customized. This science design framework will enable science teams to design and implement more usable and effective interfaces to address these unique challenges.

-

Overall, the STRUDEL project aims to bolster scientific software development efforts by improving the user experience, software quality, and software sustainability. In this poster, we will discuss our work and its broader applicability to RSE practice.

-
-
-
-
-
-
- -
-
-
-
-

Heterogeneous data is all around us and is defined as one that comes with many variations, types, incomplete details, and sometimes inaccurate information. This data can take different forms in large academic communities like universities, including information about courses, health and wellness activities, research, events, user groups, transportation, food and dining, buildings and rooms, and facilities. Accessing and processing heterogeneous data comes with challenges, including finding data and relevant metadata, various data formats, systems, communication protocols for accessing data, authentication and authorization methods, and incomplete and sometimes incorrect or obsolete information. Understanding and processing heterogeneous data is challenging, but it presents numerous opportunities for gaining deeper insights about the community and enabling data-driven decision-making, potentially leading to a better experience for all university participants (students, faculty, staff, alums, and other community members). Rokwire is an open-source platform for developing mobile applications intending to empower smart, healthy communities. It envisions integrating and processing a wide range of information and providing access through a mobile application that is personalized and privacy aware. The Rokwire platform includes core functionalities or Building Blocks that communicate with different systems to process raw data and make it available in a format multiple client applications can consume and deliver to its users. We briefly discuss the Rokwire platform and its capabilities around heterogeneous data processing within a large academic community, focusing on two of its functionalities: events data processing and managing software contributions to the platform. For each of these functionalities, we discuss its key characteristics, how it aids in reducing the barrier to heterogeneous data processing and increasing data access within academic communities, the current implementation, and future work. Events data come from different sources, with similarities and variations in the data format and content. We architected and developed an Events Manager web application that lets users create and manage events and process event data from different sources. The backend uses an Events Building Block web service module that standardizes event data and stores it in a database. When working with heterogeneous data, one might encounter data that cannot be directly integrated or shared for different reasons, like legacy file formats or data sharing restrictions. For data that cannot be directly shared with the platform, we have developed a Contributions Catalog web application supported by a Contributions Building Block web service module. External collaborators willing to contribute and integrate software modules that process such data with the Rokwire platform can use this application to share details of those modules, including purpose, data needs, protocols of data use, and removal. After a thorough and successful review, such third-party applications can be integrated with the Rokwire platform and made available to users. In future work on event data processing, we will continue adding new capabilities like enhancing event data (e.g., finding more accurate event locations when the data is incomplete) and improving its usage across the platform. In managing software contributions to the platform, we plan to provide enhanced review and publishing capabilities, including support for deployment. We also discuss our ongoing work of migrating these modules to a newer technology stack. We conclude by briefly discussing the collaboration of Research Software Engineers (RSEs) in architecting and developing the Rokwire platform and discussing the value added by the platform for the university community.

-
-
-
-
-
-
- -
-
-
-
-

Insights into the complex interactions and consequences of turbulent and stratified shear flows are critical for deciphering heat transfer in the ocean and its role in the global climate system. This work is part of an effort to classify the regime of a stratified inclined duct (SID) flow in real time, enabling researchers to prioritize relevant experimental data and minimize storage costs. Here, the authors have performed curvelet analysis to extract key features and textures from shared SID shadowgraphs (see [1] fig. 1 for the experiment’s setup). The Fast Discrete Curvelet Transform, provided by the CurveLab Matlab toolbox, has been proposed by Cand`es and Denoho [2], and is effective in its windowing approach, which separates a given signal into both different scales and orientations (angles) before performing ridgelet transforms. The spatial domain of the curvelet transform’s coarsest and finest scales, performed on a turbulent flow, are shown below; note the different textures that are extracted from the original image.

-
-
-
-
-
-
- -
-
-
-
-

Researchers are increasingly using web applications to promote their work in an accessible and engaging format. By leveraging interactive visualizations and intuitive interfaces, researchers can effectively share their data and code within the scientific community. RSEs may be interested in working with researchers to build web applications that have the potential to improve code and data reuse. Despite the value of these communication tools, maintaining them eventually falls to the researchers who are not incentivized to learn new tools and technologies. We present a network analysis visualization tool that is demonstrative of how an existing research workflow in a Jupyter Notebook can be transformed into complex web applications without leaving the JupyterLab development environment. This application uses Jupyter widgets (ipywidgets) to add interactive components - such as sliders, dropdown menus - and a network visualization widget (ipycytoscape) to visually explore and analyze a large citation network. Voila strips away code cells, leaving behind only interactive browser components, resulting in a fully-fledged user interface. By adapting existing workflows, researchers working with RSEs can benefit from the familiarity of the codebase and the development environment. This helps them maintain the application beyond the period of collaboration. Based on our experience, we recommend that researchers and RSEs consider the adoption of Jupyter Notebook and Jupyter widgets to transform existing workflows into intuitive, interactive, and aesthetic web applications that can effectively communicate their research findings.

-
-
-
-
-
-
- -
-
-
-
-

Releasing software is an important part of the work of many RSEs. Modern software stacks can often be complex, even within the scope of a single project or set of projects (a "software ecosystem"). One of the pivotal decisions in how to approach managing this complexity is the organization of the software into "repositories". The two ends of the spectrum are to consolidate everything into a single repository -- the "monorepo" approach -- or to use different repositories for each possible identifiable module -- the "multirepo" approach. We have been personally involved in managing both kinds of approaches, and have learned some of the pros and cons as they apply to the software release managers and various project developer roles.

-

Almost all software projects begin with a single repository to coordinate the work of the initial developer team (perhaps one person). The path to a monorepo or multirepo starts later as more modules and use-cases are added to the core functionality. Adding more repositories tends to add complexity for developers who must work in multiple ones, and for software release managers who must look across repositories for issues and milestones. If repositories span different project namespaces, e.g. GitHub "organizations", then additional steps are added to daily tasks of linking issues or pull-requests, authorizing users, and different types of communication. On the other hand, many tools associated with distributing software are more naturally scoped to a repository rather than a part of a directory tree, including most importantly packaging systems, but also cloud-based documentation and containerization. The packaging and release modularization of the multirepo approach greatly increases flexibility in dealing with the perennial problems of dependency management, which scale super-linearly with the number of different project components and developer sub-teams (in any organization scheme).

-

We have found that, largely due to dependency management and its effect on code reliability and reproducibilty, the advantages of monorepo approaches at smaller scales for developer teams and code complexity start to be outweighed by disadvantages in somewhat the same way horses pulling stagecoaches works for a single family but not a group of 50 -- there is no longer enough flexibility to changes in the landscape and the interactions among the people being conveyed.

-

The poster will summarize our key findings, along with a description of the technical and organizational context that our experience is based on, as we found this to be of primary importance to help the audience evaluate how much our conclusions are directly applicable to their use case.

-
-
-
-
-
-
- -
-
-
-
-

Over the last two decades, large-scale supercomputers and data centers have evolved to have a more heterogenous set of general and special-purpose processing units, e.g., multi-core CPUs, GPUs, FPGAs, DPUs, and generally xPUs, on nodes. Heterogenous parallel programming models, e.g., OpenMP, CUDA/HIP/SYCL, OpenACC, provided building blocks – via compiler and runtime systems – to harness the power and capabilities this hardware and have demonstrated large speedups through research studies. Yet, when used in production-level scientific software, these programming models not only can inhibit performance portability, but they also are error-prone because of their complex behavior and interaction with the base language’s semantics. Following a natural path to maturity, heterogeneous parallel computing has recently facilitated for programming productivity via software technology for heterogeneous parallel programming abstractions to eventually be integrated in a base language, e.g., C++, Python, Fortran. A set of such libraries uses heterogeneous programming models in their backend. Kokkos-core, and the complementary kernel library, Kokkos-kernels, both a part of the Kokkos project, is a prevalent example; Kokkos is currently developed by the U.S. Department of Energy (DoE) and used widely across DoE applications, and its capabilities are already being considered as part of the C++26 standard’s C++ parallel STL. Kokkos training and tutorials give users an intuitive understanding of how to develop Kokkos programs, regardless of Kokkos backend, and there are different channels for training through, e.g., example programs and tutorial videos. The Klokkos X-Stack Project (Klokkos is a combination of KLEE and Kokkos) in the U.S. Department of Energy aims to provide automated testing and analysis for Kokkos users to run Kokkos program and understand common API usage mistakes without needing actual hardware or a platform, i.e., there is no need to have GPU or other accelerators and kernel libraries like BLAS.Algorithmic/automated testing of parallel programs is computationally intractable due to the problem of the number of different paths a program can take. In the context of research software engineering, we ask: can a tool for automated testing of parallel programs using concolic analysis, coupled with a complementing set of parallel programming examples which are classified as correct/incorrect (thereby offering ground truth), help reduce the computational intractability of automated testing of parallel programs and thereby reduce the human burden of debugging of parallel programs? We will answer this in two steps. First, we will showcase automated analysis tool for Kokkos parallel programs that uses guided symbolic execution through an LLVM-based Klee plugin; this automated analysis can be considered a first pass of the compiled program for bug detection and a step before the expensive dynamic analysis. Second, we will then identify the ways that a set of community-gathered parallel programming examples classified as correct or incorrect can further improving tractability concolic analysis of parallel programs.

-
-
-
-
- From 7999757b2cccee6b6c7eebae7787ced4128a6455 Mon Sep 17 00:00:00 2001 From: "J.C. Subida" Date: Fri, 2 Aug 2024 13:10:51 -0500 Subject: [PATCH 07/10] Fix typo --- pages/program/abstracts/notebooks.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/pages/program/abstracts/notebooks.md b/pages/program/abstracts/notebooks.md index 02df9d3..ddaaf2f 100644 --- a/pages/program/abstracts/notebooks.md +++ b/pages/program/abstracts/notebooks.md @@ -26,7 +26,7 @@ Lastly, the core logic of this notebook is used to power the IN-CORE Community R _Jacob States, Isaac Spackman, Shubham Vyas_ -The integration of computational physical chemistry into undergraduate laboratories presents a unique opportunity for collaboration with the research software engineering field. To promote more efficient computational workflows and foster engagement among budding programmers in computational modeling, we present this notebook investigating small molecules with unpaired electrons (radicals). The CF3 radical has been extensively explored in the chemical literature owing to its importance in ozone depletion from CFCs (chlorofluorocarbons) and its unusual geometric structure which deviates from the planar structure of the CH3 radical, despite the similar size of the fluorine atom and the hydrogen atom. Exploring trends along chemical groups is commonplace in the chemical literature, and as such we have created a notebook demonstrating the facile preparation and analysis of a simple experiment substituting the F atoms in the CF3 radical for other halogens in the same group (Cl, Br, I) in a combinatorial fashion. From a single excel sheet, input files for the quantum modeling software ORCA can be reproducibly generated. Upon completion of the requested calculations, the meaningful data is systematically extracted from the produced log files. This method contrasts with traditional practices in undergraduate labs in which students manually construct input files and scroll through log files to copy/paste data and demonstrates a more efficient and reproductible alternative. The notebook not only serves as an educational tool but also acquaints future research software engineers with the specialized software developed by computational chemists. +The integration of computational physical chemistry into undergraduate laboratories presents a unique opportunity for collaboration with the research software engineering field. To promote more efficient computational workflows and foster engagement among budding programmers in computational modeling, we present this notebook investigating small molecules with unpaired electrons (radicals). The CF3 radical has been extensively explored in the chemical literature owing to its importance in ozone depletion from CFCs (chlorofluorocarbons) and its unusual geometric structure which deviates from the planar structure of the CH3 radical, despite the similar size of the fluorine atom and the hydrogen atom. Exploring trends along chemical groups is commonplace in the chemical literature, and as such we have created a notebook demonstrating the facile preparation and analysis of a simple experiment substituting the F atoms in the CF3 radical for other halogens in the same group (Cl, Br, I) in a combinatorial fashion. From a single excel sheet, input files for the quantum modeling software ORCA can be reproducibly generated. Upon completion of the requested calculations, the meaningful data is systematically extracted from the produced log files. This method contrasts with traditional practices in undergraduate labs in which students manually construct input files and scroll through log files to copy/paste data and demonstrates a more efficient and reproducible alternative. The notebook not only serves as an educational tool but also acquaints future research software engineers with the specialized software developed by computational chemists. ------ From 16e908c63b0d4bb2a341159bdb69fa956be0b497 Mon Sep 17 00:00:00 2001 From: Miranda Mundt <55767766+mrmundt@users.noreply.github.com> Date: Mon, 5 Aug 2024 07:56:29 -0600 Subject: [PATCH 08/10] Ignore SER and Mey --- .github/workflows/typo_config.toml | 2 ++ 1 file changed, 2 insertions(+) diff --git a/.github/workflows/typo_config.toml b/.github/workflows/typo_config.toml index cf92329..a41a473 100644 --- a/.github/workflows/typo_config.toml +++ b/.github/workflows/typo_config.toml @@ -9,4 +9,6 @@ ignore-hidden = false extend-ignore-re = [ "Brain tailoRed stImulation protocoL", "Vas Vasiliadis", + "SER", + "Mey", ] From 761bf2908d32f54aee42a315ae178d8e97884c7f Mon Sep 17 00:00:00 2001 From: Miranda Mundt <55767766+mrmundt@users.noreply.github.com> Date: Mon, 5 Aug 2024 07:57:16 -0600 Subject: [PATCH 09/10] Fix typo --- pages/program/abstracts/talks.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/pages/program/abstracts/talks.md b/pages/program/abstracts/talks.md index 2a91887..a7a22e4 100644 --- a/pages/program/abstracts/talks.md +++ b/pages/program/abstracts/talks.md @@ -1403,7 +1403,7 @@ set_last_modified: true the user to use workflow systems that explicitly declare the inputs and outputs of every node. This approach - shifts the compliance burden onto the user. If the user mis-specifies the workflow, it may still + shifts the compliance burden onto the user. If the user misspecifies the workflow, it may still execute but the provenance would be wrong. The "holy-grail” would be to collect provenance data at the system-level without modifying any application code and not needing superuser privileges or harming From e3ed37bb636c5e5fb8cd93453175b26e03589a56 Mon Sep 17 00:00:00 2001 From: "J.C. Subida" Date: Mon, 5 Aug 2024 09:21:40 -0500 Subject: [PATCH 10/10] Change program papers format --- pages/program/abstracts/papers.md | 165 +++++++++++------------------- 1 file changed, 62 insertions(+), 103 deletions(-) diff --git a/pages/program/abstracts/papers.md b/pages/program/abstracts/papers.md index 2e5f837..cf64092 100644 --- a/pages/program/abstracts/papers.md +++ b/pages/program/abstracts/papers.md @@ -8,106 +8,65 @@ menubar_toc: true set_last_modified: true --- -
-
-
-
- -
-
- -
-
-

The demand for efficient and innovative tools in research environments is ever-increasing in the rapidly - evolving landscape of artificial intelligence (AI) and machine learning (ML). This paper explores the - implementation of retrieval-augmented generation (RAG) to enhance the contextual accuracy and applicability of - large language models (LLMs) to meet the diverse needs of researchers. By integrating RAG, we address various - tasks such as synthesizing extensive questionnaire data, efficiently searching through document collections, - and extracting detailed information from multiple sources. Our implementation leverages open-source libraries, - a centralized repository of pre-trained models, and high-performance computing resources to provide - researchers with robust, private, and scalable solutions.

-
-
-
-
-
-
-
- -
-
- -
-
-

Lab notebooks are an integral part of science by documenting and tracking research progress in laboratories. - However, existing electronic solutions have not properly leveraged the full extent of capabilities provided by a - digital environment, resulting in most physics laboratory notebooks merely mimicking their physical counterparts - on a computer. To address this situation, we report here preliminary work toward a novel electronic laboratory - notebook, Lab Dragon, designed to empower researchers to create customized notebooks that optimize the benefits - of digital technology.

-
-
-
- -
-
-
- -
-
- -
-
-

In this work we aim to partially answer the question, "Just how many research software projects are out there?” - by searching for open source GitHub projects affiliated with research universities in the United States. We - explore this through keyword searches on GitHub itself and by scraping university websites for links to GitHub - repositories. We then filter these results by using a large language model to classify GitHub repositories as - research software engineering projects or not, finding over 35,000 RSE repositories. We report our results by - university. We then analyze these repositories against metrics of popularity, such as stars and repository - forks, and find just under 14,000 RSE repositories meet our minimum criteria for projects which have a - community. Based on the time since a developer last pushed a change to a RSE repository with a community, we - further posit that 3,300 RSE repositories with communities and a link to a research university are at risk of - dying, and thus may benefit from sustainability support. Finally, across all RSE projects linked to a research - university, we empirically find the top repository languages are Python, C++, and Jupyter Notebook.

-
-
-
- -
-
-
- -
-
- -
-
-

In the realm of scientific software development, adherence to best practices is often advocated. However, - implementing these can be challenging due to differing opinions. Certain aspects, such as software licenses and - naming conventions, are typically left to the discretion of the development team. Our team has established a set - of preferred practices, informed by, but not limited to, widely accepted best practices. These preferred - practices are derived from our understanding of the specific contexts and user needs we cater to. To facilitate - the dissemination of these practices among our team and foster standardization with collaborating domain - scientists, we have created a project template for Python projects. This template serves as a platform for - discussing the implementation of various decisions. This paper will succinctly delineate the components that - constitute an effective project template and elucidate the advantages of consolidating preferred practices in - such a manner.

-
-
-
+## Enhancing the application of large language models with retrieval-augmented generation for a research community + +_Juan José García Mesa, Gil Speyer_ + +The demand for efficient and innovative tools in research environments is ever-increasing in the rapidly +evolving landscape of artificial intelligence (AI) and machine learning (ML). This paper explores the +implementation of retrieval-augmented generation (RAG) to enhance the contextual accuracy and applicability of +large language models (LLMs) to meet the diverse needs of researchers. By integrating RAG, we address various +tasks such as synthesizing extensive questionnaire data, efficiently searching through document collections, +and extracting detailed information from multiple sources. Our implementation leverages open-source libraries, +a centralized repository of pre-trained models, and high-performance computing resources to provide +researchers with robust, private, and scalable solutions. + +--- + +## Lab Dragon: An electronic Laboratory Notebook to Support Human Practices in Experimental Science + +_Marcos Frenkel, Wolfgang Pfaff, Santiago Núñez-Corrales, Rob Kooper_ + +Lab notebooks are an integral part of science by documenting and tracking research progress in laboratories. +However, existing electronic solutions have not properly leveraged the full extent of capabilities provided by a +digital environment, resulting in most physics laboratory notebooks merely mimicking their physical counterparts +on a computer. To address this situation, we report here preliminary work toward a novel electronic laboratory +notebook, Lab Dragon, designed to empower researchers to create customized notebooks that optimize the benefits +of digital technology. + +--- + +## An Empirical Survey of GitHub Repositories at Research Universities + +_Samuel D. Schwartz, Boyana Norris, Stephen F. Fickas_ + +In this work we aim to partially answer the question, "Just how many research software projects are out there?” +by searching for open source GitHub projects affiliated with research universities in the United States. We +explore this through keyword searches on GitHub itself and by scraping university websites for links to GitHub +repositories. We then filter these results by using a large language model to classify GitHub repositories as +research software engineering projects or not, finding over 35,000 RSE repositories. We report our results by +university. We then analyze these repositories against metrics of popularity, such as stars and repository +forks, and find just under 14,000 RSE repositories meet our minimum criteria for projects which have a +community. Based on the time since a developer last pushed a change to a RSE repository with a community, we +further posit that 3,300 RSE repositories with communities and a link to a research university are at risk of +dying, and thus may benefit from sustainability support. Finally, across all RSE projects linked to a research +university, we empirically find the top repository languages are Python, C++, and Jupyter Notebook. + +--- + +## Preferred Practices Through a Project Template + +_Peter F. Peterson, Chen Zhang, Jose M. Borreguero-Calvo, Kevin A. Tactac_ + +In the realm of scientific software development, adherence to best practices is often advocated. However, +implementing these can be challenging due to differing opinions. Certain aspects, such as software licenses and +naming conventions, are typically left to the discretion of the development team. Our team has established a set +of preferred practices, informed by, but not limited to, widely accepted best practices. These preferred +practices are derived from our understanding of the specific contexts and user needs we cater to. To facilitate +the dissemination of these practices among our team and foster standardization with collaborating domain +scientists, we have created a project template for Python projects. This template serves as a platform for +discussing the implementation of various decisions. This paper will succinctly delineate the components that +constitute an effective project template and elucidate the advantages of consolidating preferred practices in +such a manner. + +---