-
Notifications
You must be signed in to change notification settings - Fork 0
/
index.qmd
502 lines (355 loc) · 18.5 KB
/
index.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
---
title: "Why We Need to Improve Software Engineering in Biostatistics"
subtitle: "A Call to Action"
author: "Daniel Sabanés Bové (Roche)"
institute: "Keynote at R/Pharma 2023"
date: "2023-10-26"
date-format: long
license: 'CC BY-SA'
format:
revealjs:
fontsize: 2rem
slide-number: c/t
toc: true
toc-depth: 1
footer: 'Why We Need to Improve Software Engineering in Biostatistics | [License](http://creativecommons.org/licenses/by-sa/4.0/ "License: CC BY-SA 4.0")'
preload-iframes: true
title-slide-attributes:
data-background-image: research-progress.png
data-background-size: 30%
data-background-opacity: "1"
data-background-position: 95% 85%
---
# Preamble {background-iframe="https://cicdguy.github.io/colored-particles/"}
## Disclaimer
</br></br></br>
*Any opinions expressed in this presentation are solely my own and not necessarily those of my employer (Hoffmann-La Roche Ltd).*
## Acknowledgements
Partly based on the manuscript "Improving Software Engineering in Biostatistics: Challenges and Opportunities" ([arXiv](https://arxiv.org/abs/2301.11791)) which is joint work with:
::: columns
::: {.column width="50%"}
- Heidi Seibold
- Anne-Laure Boulesteix
- Juliane Manitz
- Alessandro Gasparini
- Burak K. Günhan
- Oliver Boix
:::
::: {.column width="50%"}
- Armin Schüler
- Sven Fillinger
- Sven Nahnsen
- Anna E. Jacob
- Thomas Jaki
:::
:::
Its content started from a panel discussion on "Research Software Engineering for Clinical Biostatistics" which took place at the 43rd Annual Conference of the International Society for Clinical Biostatistics (ISCB) in Newcastle in August 2022.
## Objective of this keynote
::: {.incremental}
- I would like to kick off a good discussion (hopefully also beyond this talk) so that we can learn from each other
- I would like you to reflect on how biostatistics is being carried out on a daily basis in your organizations
- Why biostatistics? Because we are at R/Pharma, and also that is what I know about most about
- I would like you to strive to get yourself savvy in good software engineering practices and apply them in your daily job
- I would like you to influence your biostatistics colleagues and departments in our organizations
:::
## What can happen due to bad coding?
![[https://www.swissinfo.ch/eng/swiss-government-corrects-election-figures-on-political-party-strength/48924206](https://www.swissinfo.ch/eng/swiss-government-corrects-election-figures-on-political-party-strength/48924206)](news-part1.png)
## What can happen due to bad coding? (cont'd)
![[https://www.swissinfo.ch/eng/swiss-government-corrects-election-figures-on-political-party-strength/48924206](https://www.swissinfo.ch/eng/swiss-government-corrects-election-figures-on-political-party-strength/48924206)](news-part2.png)
## Another example, now from biostatistics
![[https://jamanetwork.com/journals/jama/fullarticle/2752474](https://jamanetwork.com/journals/jama/fullarticle/2752474)](retract-part1.png)
## Another example, now from biostatistics (cont'd)
![[https://jamanetwork.com/journals/jama/fullarticle/2752474](https://jamanetwork.com/journals/jama/fullarticle/2752474)](retract-part2.png)
## Another example, now from biostatistics (cont'd)
![[https://jamanetwork.com/journals/jama/fullarticle/2752474](https://jamanetwork.com/journals/jama/fullarticle/2752474)](retract-part3.png)
# Where I am coming from with this topic {background-iframe="https://cicdguy.github.io/colored-particles/"}
## My path to biostatistics in Pharma
::: {.incremental}
- I was good in math and I really liked stochastics
("What is the chance of winning the lottery?")
- But I always had fun programming
(Remember those days of
[Basic](https://en.wikipedia.org/wiki/BASIC) and
[Turbo Pascal](https://en.wikipedia.org/wiki/Turbo_Pascal)?)
- I was also interested in medicine ...
... but I could faint when watching somebody draw blood from my arm
- So I chose statistics because I could combine math and programming,
and work with medical data!
- I was formally trained as a statistician (B.Sc., M.Sc., Ph.D.)
- I started working in the Pharma industry as a biostatistician
:::
## What I did *not* do during those first years
::: {.incremental}
- I did not store my programs in version control systems
- I did not worry too much about making my code readable by others
- I did not try to make calculations reproducible
- I did not get any code review of my study design simulations
- I did not write any unit tests for my packages
:::
## Some tricks I learned in Tech
::: {.incremental}
- After about 5 years I had the opportunity to switch gears and work at Google
- I learned a lot quickly about:
- how to write unit tests
- how to use a consistent code style
- how to write design documents
- how to name functions
- how to review code and respond to review feedback
- how to check in code into the (internal) repository
- Also, but not that important:
- how to write SQL queries, macros, unit tests for those
- got to know memes and how to create new ones
:::
## But Pharma is more fun for me
::: {.incremental}
- I really liked that I did a lot of programming in a structured way in Google
- But I missed the scientific aspects from Pharma ...
... where we deal not just with numbers, but also biology, medicine, patients, medical doctors, etc.
- And Roche just started venturing more into R territory
(keyword: NEST project)
- So I switched back to Roche, because it is a great place to work
- And I see lots of potential to apply Tech skills in Pharma!
⇨ This is the topic this presentation
⇨ How can we improve the way we work with code in Pharma?
:::
# Programming is ubiquitous, but (still) neglected {background-iframe="https://cicdguy.github.io/colored-particles/"}
## Note
- I am trying to zoom in here on the "pain points"
- This is based on anecdotal stories from different organizations
- This is *not* based on surveys or quantitative evidence collected in a structured way
- In the discussion (here and later) we can learn more from each other
- If your experience is similar, feel free to share the pain points
- If your experience is different, then that is great and I would love to learn about it
## (Almost) All biostatisticians code
::: {.incremental}
- We don't use calculators anymore, right?
- There are a lot of graphical tools that we can use for simple tasks ...
... but it won't be sufficient for most biostatistics use cases
- We learn programming, maybe already in high school, or latest in undergraduate courses
- We wrangle, analyze, visualize, predict, ... biomedical data with code every day
:::
## But many don't take it serious enough
::: {.incremental}
- Most of the coding we do alone without any discussion or review
- We often copy and paste code from each other and adapt it every time
- We often distinguish ourselves from "Statistical Programmers" and give them instructions
(and feel we are better than them)
- We often shy away from sharing our code because we think it is too ugly
- We often don't take enough time to write clean code because we are too busy
- We often just develop code locally on our laptops
- We rarely use version control systems
:::
# Why is this a problem? {background-iframe="https://cicdguy.github.io/colored-particles/"}
## Cannot handover to other statisticians
::: {.incremental}
- Statisticians (too) take vacations, go on longer leave, switch companies, etc.
- Then another statistician (peers, managers) needs to back up or take over the project
- They might need to revisit the design or analyses and thus need to modify the code
- Might not find the code at all (because not in a version controlled repository)
- Might not be able to understand the code (because unreadable by others)
:::
## Cannot maintain code
::: {.incremental}
- Not just switching between different statisticians, but also over time, things can become difficult
- When I read code from one, two years ago, it is almost code from another person
- If I look back at my quickly written, ugly code, which is not documented, I have a very hard time
- If you change to the copy and paste code, it won't help anybody else
- If you discover bugs, those cannot be fixed across projects in one go
- If you add functionality, it will not be inherited by the other projects
:::
## Cannot prevent bugs
::: {.incremental}
- If we change the code and don't have tests, we don't know if it still works correctly
- Typically we just run then the whole script again and if it does not fail with an error we think it is fine
- But it might still produce wrong results due to new bugs
- And if we fix a bug, and don't add a test for it at the same time, it might come back later
- This is even more important in software packages that we create for us and others
:::
## Cannot reproduce results
::: {.incremental}
- When we don't prepare for it, new versions of R or R packages can lead to different results
⇨ This can be a real problem e.g. during inspections
- Very hard to reproduce manual steps:
- e.g. moving data around, running scripts, creating documents, etc.
- would need to be thoroughly documented to have a small chance, but that is usually omitted
- But Statistics is a key component of the scientific value chain ...
... thus has a responsibility to ensure reproducibility
:::
## Cannot submit to external stakeholders
::: {.incremental}
- If we cannot share reliably internally in the company, then less so with external stakeholders
- So far, analyses that need to be shared with regulators are often rewritten by statistical programmers
- Duplicates the work with original coding and has additional risk of introducing bugs
- But is usually not done for study protocol contents (sample size, simulations)
- Often reporting code was translated to proprietary software because that was considered the only "validated" way
- Still can have bugs in any software and the actual analysis scripts and macros
:::
# What can we do about it? {background-iframe="https://cicdguy.github.io/colored-particles/"}
## Become aware of the issues
::: {.incremental}
- Most statisticians actually have examples where lax programming led to such problems
- Let's realize that a lot is at stake:
we are not building airplanes that could fall from the sky ...
... but we are impacting the life of patients!
- It matters for patients that we
- calculate the right sample size,
- more generally determine the right clinical trial design,
- help finding the right molecules based on preclinical experiments,
- etc.
:::
## Become aware of the issues (cont'd)
![Significance, 18 (3): 6-7. [https://doi.org/10.1111/1740-9713.01522](https://doi.org/10.1111/1740-9713.01522)](schwab-held-significance.png)
## Take software engineering seriously
::: {.incremental}
- We know what we should do, let's just take the time and energy to do it
- We can learn a lot from the statistical programmers:
- organize ourselves with standards (starts with naming conventions for files and folders)
- automate repetitive tasks
- consider double programming of key results
- Processes should consider code in the same priority as documents (statistical analysis plans etc.)
- that implies code review, good and consistent code style, business continuity, ...
- Strive to be on par with Tech companies with regards to coding quality
:::
## Take software engineering seriously (cont'd)
![Significance, 18 (4): 42-44. [https://doi.org/10.1111/1740-9713.01554](https://doi.org/10.1111/1740-9713.01554)](seibold-charlton-significance.png)
## Educate, educate, educate
::: {.incremental}
- Starts with secondary schools:
- computer science should become a standard subject, not just be an extra
- same importance as math, geography, physics, etc.
- Continues with undergraduate and graduate programs:
- computer science must become a key component of statistics and data science programs
- good software engineering practices warrant dedicated courses
- Continues with post-graduate education during the work life:
online courses, software engineering workshops, conference sessions, etc.
:::
## Establish dedicated teams
::: {.incremental}
- Research Software Engineering teams are now established in academia, including data science institutions
- Similarly pharma companies are establishing software engineering focused teams
- Direct focus is often development of reusable software
- We should not stop there but also strive to improve the day to day work:
biostatistics, medical writing, preclinical experiments, etc.
- Software engineering work is complex, so we need top talents for this
:::
## Provide attractive career paths
::: {.incremental}
- Need to be able to attract and retain top talents
- Defining the job role can have different perspectives
- We can think about an evolution of "statistical programmers"
- Or a convergence of "statisticians" and "statistical programmers"
- Or a new definition of "statistical software engineers"
- Important is to make these careers attractive:
- same compensation as for biostatisticians or methodology experts
- same respect in the hierarchy of disciplines
- same career levels possible including leadership roles
:::
## Collaborate across companies
::: {.incremental}
- More natural now that we use R and other open source software
- Has been made so much easier in the last decade
(e.g. video conferencing, cloud based documents, code sharing platforms)
- Stakeholders ask for companies to work together towards standards,
rather than receiving different solutions from each company
- Loosely coupled software modules allow to "plug and play"
- combining different modules to make them fit company standards
- conneting to internals via company specific extensions
- Key opportunity in contrast to previous proprietary software based, company siloed, stack
:::
# Just A Few Examples {background-iframe="https://cicdguy.github.io/colored-particles/"}
## R Package [`crmPack`](https://roche.github.io/crmPack)
::: {.incremental}
- `crm` stands for continual reassessment method (Bayesian design for dose escalation)
- We started in 2013 with custom R scripts running `JAGS` code to run Markov Chain Monte Carlo (MCMC)
- Realized in 2014 that we need to have a way to avoid copy and paste of these via an R package
- Published on CRAN in 2015
- Described in a paper in 2019, and other companies started to use it
- Joined forces about end of 2021 to develop the package further together
:::
## Working group [`openstatsware`](https://rconsortium.github.io/asa-biop-swe-wg/)
::: {.incremental}
- Has been founded in August 2022 as working group in the ASA Biopharmaceutical Section
- ~40 members from over 25 pharma companies, CRO, etc.
- open for new members
- Focuses on
1. building packages together (already on CRAN are `mmrm` and `brms.mmrm`)
2. disseminating good software engineering (SWE) practices (workshops, videos)
- More details were in the talk by Ya Wang from Tuesday
:::
## Tools for reproducible research
Many great tools exist to help with reproducibility of statistical analyses
(see [CRAN task view](https://cran.r-project.org/web/views/ReproducibleResearch.html)), e.g.:
::: {.incremental}
- `Sweave` (2002)
- `knitr` and `Rmarkdown` (2014)
- `packrat` (2014)
- `officer` (2017)
- `usethis` (2017)
- `renv` (2019)
- `targets` (2020)
- `quarto` (2021)
:::
## Tools for package developers
Similarly lots of tools have been provided for package developers:
::: {.incremental}
- `testthat` (2009)
- `devtools` (2011)
- `checkmate` (2014)
- `lintr` (2014)
- `spelling` (2017)
- `styler` (2017)
- `pkgdown` (2018)
- `precommit` (2020)
:::
## Templates to start from use cases
Templates can be a great start to centralize code for reoccurring use cases, e.g.:
::: {.incremental}
- [TLG Catalog](https://insightsengineering.github.io/tlg-catalog/)
- [Biomarker Analysis Catalog](https://insightsengineering.github.io/biomarker-catalog/)
- [RPACT Vignettes](https://www.rpact.com/vignettes) (Clinical trial design)
- [Mediana case studies](https://gpaux.github.io/Mediana/CaseStudies.html) (Simulations of trials)
- [Falcon](https://pharmaverse.github.io/falcon) (FDA Safety Tables and Figures)
:::
## Guidelines for good practices
::: {.incremental}
- [`rOpenSci`](https://ropensci.org/)
- Non-profit initiative founded in 2011
- Staff, process, guidelines for Statistical Software Peer Review of R packages
- [The Turing Way](https://the-turing-way.netlify.app/)
- Started in 2019 as a guide for reproducibility (covering Version Control, Code Testing, and Continuous integration)
- Now encompasses guides for Reproducible Research, Project Design, Communication, Collaboration, Ethical Research
:::
## Guidelines for good practices (cont'd)
::: {.incremental}
- ["Best practices in statistical computing"](https://doi.org/10.1002/sim.9169) (Sanchez et al., 2021)
- 5 steps for implementing a code quality assurance (QA) process
1. adherence to clean and consistent code style
2. clear written documentation (code, workflow, decisions)
3. careful version control
4. good data management
5. regular testing and review
:::
## Workshops teaching good practices
::: {.incremental}
- ["Good Software Engineering Practices for R Packages"](https://rconsortium.github.io/asa-biop-swe-wg/blog/2023-gswe4rp-workshops-wrapup/)
- `openstatsware` have run five workshops in 2023
- ["The Carpentries"](https://carpentries.org/)
- Teaches foundational coding and data science skills to researchers worldwide
- Teaching material on software, data, and library is available
- ["Making Data Science Work for Clinical Reporting"](https://www.coursera.org/learn/making-data-science-work-for-clinical-reporting)
- Coursera course on how principles and methods from data science can be applied in clinical reporting
:::
# Call to action {background-iframe="https://cicdguy.github.io/colored-particles/"}
## Call to action
::: {.incremental}
- Let's be honest about how we are doing biostatistics in our companies
- Let's realize that a lot depends on doing the software engineering well
- Let's get savvy with the tools and good practices ourselves to improve our habits
- Let's influence our colleagues and managers in our companies to improve structures and processes
- Let's realize the benefits from improved software engineering in biostatistics!
:::
## Thank you!
These slides are at
[danielinteractive.github.io/rpharma-2023-keynote](https://danielinteractive.github.io/rpharma-2023-keynote/)
Feel free to connect at
[linkedin.com/in/danielsabanesbove](https://www.linkedin.com/in/danielsabanesbove/)