This repository has been archived by the owner on Oct 6, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathrap.Rmd
1165 lines (621 loc) · 65.7 KB
/
rap.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "RAP Guidance for Statistics Producers"
---
```{r include=FALSE}
require(knitr)
```
<p class="text-muted">Guidance for how to implement the principles of Reproducible Analytical Pipelines (RAP) into our production processes</p>
---
# What is RAP?
---
RAP (Reproducible Analytical Pipelines) is doing well documented, reproducible, quality analysis using the best tools available to us as analysts.
Cam ran an introduction to RAP session for DISD in December 2020, the slides can be found on [GitHub](https://github.com/cjrace/introduction-to-rap){target="_blank" rel="noopener noreferrer"}, and the recording is embedded below:
<div align="center">
<iframe width="640" height="360" src="https://web.microsoftstream.com/embed/video/f4fb4ae9-3149-4d7c-8968-ff4c11dc8ac3?autoplay=false&showinfo=false" allowfullscreen style="border:none;"></iframe>
</div>
[Reproducible Analytical Pipelines](https://dataingovernment.blog.gov.uk/2017/03/27/reproducible-analytical-pipeline/){target="_blank" rel="noopener noreferrer"}, or RAP for short. The full words still hide the true meaning behind buzzwords and jargon though. What it actually means is using automation to our advantage when analysing data, and this is as simple as writing code such as a SQL query that we can click a button to execute and do the job for us.
The cross-government group of RAP champions have laid out a [minimum level of RAP](https://github.com/best-practice-and-impact/rap_mvp_maturity_guidance/blob/master/Reproducible-Analytical-Pipelines-MVP.md){target="_blank" rel="noopener noreferrer"} to aim for (note that we have our own levels of practice below, which are building blocks to reach this end goal)
We already have 'analytical pipelines' and have done for many years. The aim of RAP, is to automate the parts of these pipelines that can be automated, to increase efficiency and accuracy, while creating a clear audit trail to allow analyses to easily be re-run if needed. This will free us up to focus on the parts of our work where our human input can really add value. RAP is something we can use to reduce the burden on us by getting rid of some of the boring stuff, what's not to like!
Cam and Sarah ran a session introducing how to get started with automated QA in relation to RAP, slides are available on [GitHub](https://sarahmwong.github.io/intro-to-automating-QA/#1){target="_blank" rel="noopener noreferrer"} and the recording of the session is below:
<div align="center">
<iframe width="640" height="360" src="https://web.microsoftstream.com/embed/video/57eab29d-afeb-4651-a9f5-687098b84d13?autoplay=false&showinfo=false" allowfullscreen style="border:none;"></iframe>
</div>
---
## Our scope
---
We want to focus on the parts of the production process that we have ownership and control over – so we are focussing on the process from data sources to publishable data files. This is the part of the process where RAP can currently add the most value - automating the production and quality assurance of our outputs currently takes up huge amount of analytical resource, which could be better spent providing insight and other value adding activity.
`r knitr::include_graphics("images/RAP-Scope-Arrow.png")`
In Official Statistics production we are using RAP as a framework for best practice when producing our published data files, as these are the foundations of our publications moving forward. Following this framework will help us to improve and standardise our current production processes and provide a clear 'pipeline' for analysts to follow. This will have the added benefit of setting a clear and defined world of tools and skills required, making learning and development that much clearer and easier. To get started with RAP, we first need to be able to understand what it actually means in practice, and be able to [assess our own work against the principles of RAP](#what-is-expected).
Implementing RAP for us will involve combining the use of SQL, R, and clear, consistent version control to increase efficiency and accuracy in our work. For more information on what these tools are, why we are using them, and resources to help up-skill in those areas, see our [learning resources](l+d.html) page.
The collection of, and routine checking of data as it is coming into the department is also an area that RAP can be applied to. We have kept this out of scope at the moment as the levels of control in this area vary wildly from team to team. If you would like advice and help to automate any particular processes, feel free to [contact us](mailto:[email protected]).
---
# Core principles
---
**Data sources for a publication are stored in the same database** _- [Preparing data](#preparing-data)_
**Underlying data files are produced using code, with no manual steps** _- [Writing code](#writing-code)_
**Files and scripts should be appropriately version controlled** _- [Version control](#version-control)_
---
# RAP in practice
---
The diagram below highlights what RAP means for us, and the varying levels in which it can be applied. The expectation is that all publications will meet the department's baseline implementation of RAP in the self-assessment tool. It's worth acknowledging that some teams are already working around great and best practice levels, and that we appreciate every team's situation is unique, our guidance is designed to be applicable across all official statistcs publications by DfE.
<html>
<head>
<title>hex-diagram</title>
<meta charset="utf-8"/>
</head>
<body><div class="mxgraph" style="max-width:100%;" data-mxgraph="{"highlight":"#0000ff","lightbox":false,"nav":true,"edit":"_blank","xml":"<mxfile host=\"app.diagrams.net\" modified=\"2023-01-25T13:50:00.816Z\" agent=\"5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36\" etag=\"fFHFq3WaFkuAxXLjLQzX\" version=\"20.8.10\" type=\"google\"><diagram id=\"_ltWYIdbVKAG_Xyvna0Y\" name=\"Page-1\">7Zzdc5s4EMD/Gk+f1DFfxnmM46TXm17rNnPXm3vJyCDbSjCiQsRO//rTSnxDiHOX1JAkL0YrrRDST8vuAhlZZ9v9B46jzR/MJ8HIHPv7kTUfmaZhG1P5A5I7LZlMXS1Yc+pr0bgQXNKfJNXMpAn1SZzKtEgwFggaVYUeC0PiiYoMc8521WYrFvgVQYTXpDIMEFx6OCCNZt+pLzZaOjXdQv4boetNdmZjcqJrtjhrnHYcb7DPdiWRdT6yzjhjQh9t92ckgMmrzsvFPbX5wDgJRYvCnzHhX5bXMCfmOMBLuS6q0SUJY7qUl6cng3CYqNCHIlXSWPDEEwkn+gQBDW+05si04lQZaVUkFRGooZpSNkhhfll//oFPPu8WZ9cXV5d31zcfkZNffH5RsbjLJlxOUwSHG7LHaxaOrFlEON0SIUeaSxeZSHY0222oIJcR9kBtJzGUso3YwvUa8nBF9yQDC8qcCSwo9Dw/GctyrKssOPYplxOmKwmOBfTEOP3JQoGhP2izkoUzFjCuxmtdqD/oR3B2Q7KakIVEnTwISo1t2zXPbGgsh0vD9SeygrVzC8m3lCUQbdktXqp5gfNyIkdaLsN1lMpy55Fymfi0XEw3SEkCF5LNC4ypCVTK2C3hguxLonTxPhAmF4HfySZpbQZ/utktc6LLu2LrGNl+2JS3TSbE6XZd510XRMuDFJisWGK8k3kph76jiLOIUyxIakXiFsRLrVCpTSfRk5dJ9P+Cl7Mk9ImfniI3fqpwQ4S3SQtL7N2sVeMviZArkdG55tinksR56QJCxgGje7ba0+OMHKeXPC8480gcyxkHMmO4vcIcwLjksOBm6LdZ7yhXQzRGoIJAARXNOyl3XyblT2+3n4FEt8KhneF1ZA5nOKYe9J4IJv0dAl7E19MW8pbQEOXN0A98AG/TN96Oxpthmb0k7lRjwBLugb3zpQeksGBcwUdDOHNaIaFrs4I4CJDuAEErpJURDRFGVbVOOk/e6DwancjIwsqe4XkhYyG4HW8JEWU8ZZyEud9wNxP/vVpi09qw3ZVgV96GeDdXeI1pGIsrsSFXNd1OJI3xG5PHY7LmKzqy4z4gOWdespWXpVemaQ79Zn03ZMYbZEeDrBZf94WxvwiPFT2QhpPzGAREJ5RCHOThyDj2OI1EW8x9q/VRoY2UropMUEWvm01z4GxWcXPHp3Pb7Yp3W0E+EptWxljPguVvxLvzAqxTnimKKwZpz1WiM5bjpNVR5Lmi5lAqIa2CkoNcRMN6w/F4EcxJPXfTD1s5lz5hrLzDiDM/0WvQZRx9rYCK5o+xiPYbgkdD0HQmvYxSFskyoB7OyIuIR1eH5XGiQhNleo9N6xhDf/4zZCStqdFLJE9L5MXJdos5Ja1PZ3LWaq26kRt66nrIyNVvxH0JWhZEPfvm5JaSnTxgq8I/hMciKpcoCN62uoaghNhKe4a6OSoad9M49MT2kGm0apnDvtAo50iWGJcrc0sCGLBP5AGLCiiT9IHfmooWJL1qByhV14AqVZQrduM59Mx2Fc/zuTu3jKHgOXHtWtTSjyD63VlAcPiukdRpYgjtStmbA4Azh562HjJwjtPPrM3InARCLcBtBYfJjwRemJtVjtbpr9KASWpVgQqkF1fGNmPDjvatHX3fMJUriiphUnu0rk8pr0uftToSKVbDz6SNzbKDM6FyUPWfgnxz6Cn5o20gHNA1jMyTGwhm40lMuNHPR+f6JTgWEQVw9gSdk4jFVLB74q0kJghUsifmzebdZL6shPywTLtr9fOtoYPST11JgAeyUI/JDJgvK0U/LEDrvkdfYrGuzMCKsy2IEhFTJRAbcnCeAJRRqoqk4qEZA/NlZfGHRWk9JPvVlDa+/GhHJE2y3eIgIVUX+lCH2DKlQ1zyhiXy2SoUQvUOhGc3veYPjPnKTcaSBnAt7nGNzSq7chFEFb6GTwZLJc19cJpWbKnvg3rji4iI0VCouXZmI2cOfcmbQlxwfB9aT0AJsqp5pYljNijJP2oqU2I9HpIDiZj8IiLcMfZtt4UITrB4xUiY48nDSLQZjudDwv1FSBDf9aWFbyAxI/FrJsJxzYeJGD8TEZdTe/z378509tW7WnxdJB//WV0jowWI2uRn3oWX8OBuJlfuhoiHXYji8x8oKedqAZGjdhQCdRPPF+1TrTpfvGyZuZ6K+1e5fN+32jwW5WlkX3CaDd+juLEVXx7V/JInWH9jMn1ffRjWkm2z3eb6O9nd5ckBMI/sNsxwTNRXX+aY7CP4QFHHha/LMFhOnQujCYal3yerkfFclsF+swzZSqd3s+e0DBOn+mavYTsHGQbzxHym9X8wnjjwxv94A6L3+ul2SV+jKainj9ssgWE/jSWQxeL/D6i60n9xsM7/BQ==</diagram></mxfile>"}"></div>
<script type="text/javascript" src="https://viewer.diagrams.net/js/viewer-static.min.js"></script>
</body>
</html>
---
# What is expected
---
<div class="alert alert-dismissible alert-warning">
It is expected that **all teams' processes meet all elements of good and great practice** as a baseline.
Teams are expected to review their own processes using the [publication self-assessment tool](https://rsconnect/rsc/publication-self-assessment){target="_blank" rel="noopener noreferrer"} and use the guidance on this site to start making improvements towards meeting the four core principles if they aren't already. If you would like additional help to review your processes, please contact the [Statistics Development Team](mailto:[email protected]).
</div>
Teams will start from different places and implement changes at different rates, and in different ways. We do not expect that every team will follow the same path, or even end at the same point. Don't worry if this seems overwhelming at first, use the guidance here to identify areas for improvement and then tackle them with confidence.
While working to reach our baseline expectation of good and great practice, you can track your progress in the [publication self-assessment tool](https://rsconnect/rsc/publication-self-assessment){target="_blank" rel="noopener noreferrer"} and contact the [Statistics Development Team](mailto:[email protected]) for help and support.
---
## How to assess your publication
---
The checklist provided in the [publication self-assessment tool](https://rsconnect/rsc/publication-self-assessment){target="_blank" rel="noopener noreferrer"} is designed to make reviewing our processes against our RAP levels easier, giving a straightforward list of questions to check your work against. This will flag potential areas of improvement, and you can then use the links to go to the specific section with more detail and guidance on how to develop your current processes in line with best practice.
Some teams will already be looking at best practice, while others will still have work to do to achieve the department's baseline of good and great practice. We know that all teams are starting this from different points, and are here to support all teams from their respective starting positions.
---
## Where we need to focus
---
Most teams have already made progress with their production of tidy data files, and the release of the automated screener has now tied up that end point of the pipeline that we are all currently working towards. The standard pipeline for all teams will roughly resemble this:
`r knitr::include_graphics("images/RAP-Process-Overview.png")`
The key now is for us to build on the work so far and focus on how we improve the quality and efficiency of our production processes up to that point. To do this, we need to make a concerted effort to standardise how we store and access our data, before then automating what we can to reduce the burden of getting the numbers ready and see the benefits of RAP. The exact meaning of this will vary within teams.
---
## How to get started
---
Measure your publication against the RAP levels using our [self assessment tool](https://rsconnect/rsc/publication-self-assessment/){target="_blank" rel="noopener noreferrer"}. This will give you a good starting point and initial points to work on to progress to the next level of RAP.
Once you've assessed your publication, have a look through our guidance below to narrow down how you can get started with improving those parts of your process.
The Statistics Development Team invites teams to take part in our partnership programme to develop their skills and implement RAP principles to a relevant project. Visit our page on [getting started with the partnership programme](l+d.html#Support_available) for more details.
---
# Preparing data
---
The first place to start for your teams RAP is to store the raw data you use to create underlying data in a Microsoft SQL Server database. This is similar to a sharepoint area or a shared folder, but it's a dedicated data storage area which allows multiple users to use the same file at once, and for you to run code against the data in one place.
---
## All source data stored in a database
---
`r knitr::include_graphics("images/good.svg")`
**What does this mean?**
When we refer to 'source data', we take this to mean the data you use at the start of the process to create the underlying data files. Any cleaning at the end of a collection will happen before this.
In order for us to be able to have an end-to-end data pipeline where we can replicate our analysis across the department, we should store all of the raw data needed to create aggregate statistics in a managed Microsoft SQL Server. This includes any lookup tables and all administrative data from collections prior to any manual processing. This allows us to then match and join the data together in an end-to-end process using SQL queries.
As far as meeting the requirement to have all source data in a database, databases other than SQL may be acceptable, though we can't support them in the same way.
**Why do it?**
The principle is that this source data will remain stable and is the point you can go back to and re-run the processes from if necessary. If for any reason the source data needs to change, your processes will be set up in a way that you can easily re-run them to get updated outputs based on the amended source data with minimal effort.
SQL is a fantastic language for large scale data joining and manipulation; it allows us to replicate end-to-end from raw data to final aggregate statistics output. Having all the data in one place and processing it in one place makes our lives easier, and also helps us when auditing our work and ensuring reproducibility of results.
**How to get started**
<!-- needs example -->
For a collection of relevant resources to use when learning SQL, see our [learning resources](l+d.html) page, and for guidance on best practice when writing SQL queries, see the [writing code](#writing-code) and [documentation](#documentation) sections on this page, as well as the guides immediately below on how to setup and use a SQL database.
---
### How to set up a SQL working area
---
There are a few different options, depending on where you want your new area to exist. Visit our [SQL learning page](sql.html#Setting_up_a_SQL_area) for details.
---
### Moving data to different areas
---
If your data is already in SQL, you can use this snippet of R code to move tables from one area (e.g. the iStore) to another (e.g. your team's modelling area) to ensure all data are stored in a database.
```
library(odbc)
library(dplyr)
library(dbplyr)
library(DBI)
# Step 1.1.: Connect to source server -------------------------------------------
con_source <- dbConnect(odbc(),
Driver = "SQL Server Native Client 11.0",
Server = "Name_of_source_server",
Database = "Source_database",
Trusted_Connection = "yes"
)
# Step 1.2.: Connect to target server
con_target <- dbConnect(odbc(),
Driver = "SQL Server Native Client 11.0",
Server = "Name_of_target_server",
Database = "Your_target_database",
Trusted_Connection = "yes"
)
# Step 2.1.: Pull the table from the source database
table_for_transfer <- tbl(con_source,in_schema("schema_name", "table_name")) %>% collect()
# Step 2.2.: Copy table into target database
dbWriteTable(con_target,"whatever_you_want_to_call_new_table", table_for_transfer)
```
---
### Importing data to SQL Server
---
There's lots of guidance online of how to import flat files from shared areas into Microsoft SQL server on the internet, including [this guide](https://docs.microsoft.com/en-us/sql/relational-databases/import-export/import-flat-file-wizard?view=sql-server-2017){target="_blank" rel="noopener noreferrer"}.
Remember that it is important to import them with consistent, thought-through [naming conventions](#naming-conventions). You will thank yourself later.
---
### How to grant access to your area
---
Much like setting up a SQL area, there are different ways to do this depending on the server your database is in. Visit our [SQL learning page](sql.html#Givinggetting_access) for details.
---
# Writing code
---
The key thing to remember is that **we should be automating everything we can**, and the key to automation is writing code. Using code is as simple as telling your computer what to do. Code is just a list of instructions in a language that your computer can understand.
---
## Processing is done with code
---
`r knitr::include_graphics("images/good.svg")`
**What does this mean?**
All extraction, and processing of data should be done using code, avoiding any manual steps and moving away from a reliance on Excel, SPSS, and other manual processing. In order to carry out our jobs to the best of our ability it is imperative that we use the [appropriate tools](#appropriate-tools) for the work that we do.
Even steps such as copy and pasting data, or pointing and clicking, are fraught with danger, and these risks should be minimised by using code to document and execute these processes instead.
**Why do it?**
Using code brings numerous benefits, computers are far quicker, more accurate, and far more reliable than humans in many of the tasks that we do. Writing out these instructions saves us significant amounts of time, particularly when it can be reused in future years, or even next week when one specific number in the source file suddenly changes, and also provides us with editable documentation for our production processes, saving the need for writing down information in extra documents.
Reliability is a huge benefit of the automation that RAP brings - when one of the lines of data has to be amended a week before publication, it's a life saver to know that you can re-run your process in minutes, and reassuring to know that it will give you the result you want. You can run the same code 100 times, and be confident that it will follow the same steps in the same order every single time.
**How to get started**
See our [learning resources](l+d.html) for a wealth of resources on SQL and R to learn the skills required to translate your process into code.
There are also two sections below with examples of tidying data in SQL and R to get you started.
Ensure that any last-minute fixes to the process are written in the code and not done with manual changes.
---
### Producing tidy underlying data in SQL
---
To get started, here is a [SQL query](https://github.com/TomFranklin/sql-applied-data-tidying/blob/master/data_tidying_l_and_d.sql){target="_blank" rel="noopener noreferrer"} that you can run on your own machine and walks you through the basics of tidying a simple example dataset in SQL.
---
### Tidying and processing data in R
---
<!-- could have better examples -->
[Here is a video](https://vimeo.com/33727555){target="_blank" rel="noopener noreferrer"} of Hadley Wickham talking about how to tidy your data to these principles in R. This covers useful functions and how to complete common data tidying tasks in R. Also worth taking a look at [applied data tidying in R, by RStudio](https://www.youtube.com/watch?v=1ELALQlO-yM){target="_blank" rel="noopener noreferrer"}.
Using the `%>%` pipe in R can be incredibly powerful, and make your code much easier to follow, as well as more efficient. If you aren't yet familiar with this, have a look at [this article](https://seananderson.ca/2014/09/13/dplyr-intro/){target="_blank" rel="noopener noreferrer"} that provides a useful beginners guide to piping and the kinds of functions you can use it for. The possibilities stretch about as far as your imagination, and if you have a function or task you want to do within a pipe, googling 'how do I do X in dplyr r' will usually start to point you in the right direction, alternatively you can [contact us](mailto:[email protected]), and we'll be happy to help you figure out how to do what you need.
A quick example of how powerful this is is below, where my_data is processed to create new columns, have column names renamed, have the column names tidied using the [janitor](https://garthtarr.github.io/meatR/janitor.html){target="_blank" rel="noopener noreferrer"} package, blank rows and columns removed, data filtered to only include specific geographic levels, and rows rearranged in order, all in a few lines of easy to follow code:
```{r example, eval=FALSE}
processed_regional_data <- my_data %>%
mutate(newPercentageColumn = (numberColumn / totalPopulationColumn) * 100) %>%
rename(newPercentageColumn = percentageRate,
numberColumn = number,
totalPopulationColumn = population) %>%
clean_names() %>%
remove_empty() %>%
filter(geographic_level == "Regional") %>%
arrange(time_period, region_name)
```
[Helpful new functions](https://towardsdatascience.com/five-tidyverse-tricks-you-may-not-know-about-c5026d5a19da){target="_blank" rel="noopener noreferrer"} in the tidyverse packages can help you to easily transform data from wide to long format (see tip 2 in the linked article for this, as it is often required for tidy data), as well as providing you with tools to allow you quickly and efficiently change the structure of your variables.
For further resources on learning R so that you're able to apply it to your everyday work, have a look at the [learning resources](l+d.html) page.
---
## Appropriate tools
---
`r knitr::include_graphics("images/good.svg")`
**What does this mean?**
Using the recommended tools on our [learning](l+d.html) page ([SQL](sql.html), [R](r.html) and [Git](git.html)), or other suitable alternatives that allow you to meet the [core principles](#core-principles). Ideally any tools used would be open source, Python is a good example of a tool that would also be well suited, though is less widely used in DfE and has a steeper learning curve than R.
Open-source refers to something people can modify and share because its design is publicly accessible. For more information, take a look at this [explanation of open-source](https://opensource.com/resources/what-open-source){target="_blank" rel="noopener noreferrer"}, as well as this guide to [working in an open-source way](https://opensource.com/open-source-way){target="_blank" rel="noopener noreferrer"}. In practical terms, this means moving away from the likes of SPSS, SASS and Excel VBA, and utilising the likes of R or Python, version controlled with git, and hosted in a publicly accessible repository.
**Why do it?**
There are many reasons why we have recommended the tools that we have, the recommended tools are:
- already in use at the department and easy for us to access
- easy and **free** to learn
- designed for the work that we do
- used widely across data science in both the public and private sector
- allow us to meet best practice when applying RAP to our processes
**How to get started**
Go to our [learning](l+d.html) page to read more about the recommended tools for the jobs we do, as well as looking at the resources available there for how to [build capability](l+d.html#general_resources) in them. Always feel free to contact us if you have any specific questions or would like help in understanding how to use those tools in your work.
By following [our guidance](#Version_controlled_final_code_scripts) in saving versions of code in an Azure DevOps, we will then be able to mirror those repositories in a publicly available GitHub area.
---
## Using 'run' scripts
---
Utilising a single 'run' script to execute processes written in other scripts brings a number of benefits. It isn't just about removing the need to manually trigger different code scripts to get the outputs, but it means the entire process, from start to finish, is fully documented in one place. This has a huge number of benefits, particularly for enabling new team members to pick up existing work quickly, without wasting time struggling to understand what has been done in the past.
---
### Connecting R to SQL
---
In order to create a single script to run all processes from, it is likely that you will need to use R to run SQL queries. If you are unsure of how to do this, take a look at the materials from Cathy Atkinson's coffee and coding session on [connecting R to SQL using DBI and odbc](https://educationgovuk.sharepoint.com/sites/sarpi/g/WorkplaceDocuments/Forms/AllItems.aspx?FolderCTID=0x012000C61C1076C17C5547A6D6D8C2A27B5D97&View=%7B2B35083D%2D7626%2D48E2%2D9615%2D451544742692%7D&id=%2Fsites%2Fsarpi%2Fg%2FWorkplaceDocuments%2FInducation%20learning%20and%20career%20development%2FCoffee%20and%20Coding%2F180718%5Fcathy%5FR%5FSQL%2FConnecting%5Fto%5FSQL%5FRevA%2Ehtml&parent=%2Fsites%2Fsarpi%2Fg%2FWorkplaceDocuments%2FInducation%20learning%20and%20career%20development%2FCoffee%20and%20Coding%2F180718%5Fcathy%5FR%5FSQL).
Chris Mason-Thom did another coffee and coding session on this, which you can watch below:
<center>
<iframe width="640" height="360" src="https://web.microsoftstream.com/embed/video/c9b7fd97-c854-4a1a-9074-cd80d2ea285e?autoplay=false&showinfo=false" allowfullscreen style="border:none;"></iframe>
</center>
---
### Dataset production scripts
---
`r knitr::include_graphics("images/great.svg")`
**What does this mean?**
Each dataset can be created by running a single script, which may 'source' multiple scripts within it. This **does not** mean that all of the code to create a file must be written in a single script, but instead that there is a single 'create file' or 'run' script that sources every step in the correct order such that every step from beginning to end will be executed if you run that single 'run' script.
This 'run' script should take the source data right through to final output at the push of a button, including any manipulation, aggregation, suppression etc.
**Why do it?**
Having a script that documents the whole process for this saves time when needing to rerun processes, and provides a clear documentation of how a file is produced.
**How to get started**
Review your current process - how many file scripts does it take to get from source data to final output, why are they separated, and what order should they be run in? Do you still have manual steps that could introduce human error (for example, manually moving column orders around in excel)?
You should automate any manual steps such as the example above. If it makes sense to, you could combine certain scripts to reduce the number. You can then write code in [R](r.html) to execute your scripts in order, so you are still only running one script to get the final output.
---
### Whole publication production scripts
---
`r knitr::include_graphics("images/best.svg")`
**What does this mean?**
The ultimate aim is to utilise a single script to document and run off everything for a publication, the data files, any QA, any summary reports. This script should allow you to run individual outputs by themselves as well, so make sure that each data file can be run in isolation by running single lines of this script.
All quality assurance for a file is also included in the single script that can be used to create a file from source data (see the [dataset production scripts section](#dataset-production-scripts))
**Why do it?**
This carries all of the same benefits as having a single 'run' script for a file, but at a wider publication level, effectively documenting the entire publication process in one place. This makes it easier for new analysts to pick up the process, as well as making it quicker and easier to rerun as all reports relating to that file are immediately available if you ever make changes file.
**How to get started**
The Education, Health and Care Plans production cycle is a good example of a single publication 'run' script. They have kept their actual data processing in SQL, but all the running and manipulation of the data happens in R.
The cycle originally consisted of multiple SQL scripts, manual QA and generation of final files.
`r knitr::include_graphics("images/Old process.jpg")`
The team now have their end-to-end process fully documented, which can be run off of one single R script. The 'run' script points at the SQL scripts to run them all in one go, and also creates a QA report and corresponding metadata files that pass the data screener. Each data file can still be run in isolation from this script.
`r knitr::include_graphics("images/New process.jpg")`
---
## Recyclable code for future use
---
`r knitr::include_graphics("images/great.svg")`
**What does this mean?**
We'd expect that any recyclable code would take less than 30 minutes of editing before being able to run again in a future iteration of the publication.
**Why do it?**
One huge benefit that comes with using code in our processes, is that we can pick them up in future years and reuse with minimum effort, saving us huge amounts of resource. To be able to do this, we need to be conscious of how we write our code, and write it in a way that makes it easy to use in future releases for the publication.
**How to get started**
Review your code and consider the following:
- What steps might need re-editing or could become irrelevant?
- Can you move all variables that require manual input (e.g. table names, years) to be assigned at the top of the code, so it's easy to edit in one place with each iteration?
- Are there any fixed variables that are prone to changing such as geographic boundaries, that you could start preparing for changes now by making it easy to adapt in future?
For example, if you refer to the year of publication in your code a lot, consider replacing every instance with a named variable, which you only need to change once at the start of your code. In the example below, the year is set at the top of the code, and is used to define "prev_year", both of which are used further down the code to filter the data based on year.
``` {r this_last_year, eval=FALSE}
this_year <- 2020
prev_year <- this_year - 1
data_filtered <- data %>%
filter(year == this_year)
data_filtered_last_year <- data %>%
filter(year == prev_year)
```
---
## Standards for coding
---
Code can often be written in many different ways, and in languages such as R, there are often many different functions and routes that you can take to get to the same end result. On top of that, there are even more possibilities for how you can format the code. This section will take you through some widely used standards for coding to help bring standardisation to this area and make it easier to both write and use our code.
---
### Clean final code
---
`r knitr::include_graphics("images/best.svg")`
**What does this mean?**
- This code should meet the best practice standards below (for SQL and R). If you are using a different language, such as Python, then contact us for advice on the best standards to use when writing code.
- There should be no redundant or duplicated code, even if this has been commented out. It should be removed from the files to prevent confusion further down the line.
- The only comments left in the code should be those describing the decisions you have made to help other analysts (and future you) to understand your code. More guidance on [commenting in code](#commenting-in-code) can be found later on this page.
**Why do it?**
Clean code is efficient, easy to write, easy to review, and easy to amend for future use. Below are some recommended standards to follow when writing code in SQL and R.
**How to get started**
Watch the coffee and coding session introducing good code practice below:
<center>
<iframe width="640" height="360" src="https://web.microsoftstream.com/embed/video/624e3442-aa66-44e7-bb4f-717a6508b056?autoplay=false&showinfo=false" allowfullscreen style="border:none;"></iframe>
</center>
Then you should also watch the follow up intermediate session:
<center>
<iframe width="640" height="360" src="https://web.microsoftstream.com/embed/video/92d4ba05-d304-4009-b543-e5f4563f1d30?autoplay=false&showinfo=false" allowfullscreen style="border:none;"></iframe>
</center>
Clean code should include comments. Comment why you've made decisions, don't comment what you are doing unless it is particularly complex as the code itself describes what you are doing. If in doubt, more comments are better than too few though. Ideally any specific comments or documentation should be alongside the code itself, rather than in separate documents.
---
#### SQL
---
For best practice on writing SQL code, here is a particularly useful [word document](resources/TSQL_Coding_Standards.docx){target="_blank" rel="noopener noreferrer"} produced by our [Data Hub](https://educationgovuk.sharepoint.com/sites/DataHubProgramme2/Shared%20Documents/Content%20Management/Data%20Hub%20one%20pager.pdf){target="_blank" rel="noopener noreferrer"}. This outlines a variety of best practices, ranging from naming conventions, to to formatting your SQL code so that it is easy to follow visually.
---
#### R
---
When using R, it is generally best practice to use [R projects](https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects){target="_blank" rel="noopener noreferrer"} as directories for your work.
The recommended standard for styling your code in R, is the [tidyverse styling](https://style.tidyverse.org/){target="_blank" rel="noopener noreferrer"}, which is fast becoming the global standard. What is even better is that you can automate this using the [styler](https://styler.r-lib.org/){target="_blank" rel="noopener noreferrer"} package, which will literally style your code for you at the click of a button, and is well worth a look.
`r knitr::include_graphics("images/styler.gif")`
There is also plenty of guidance around the internet for [best practice](https://waterdata.usgs.gov/blog/intro-best-practices/){target="_blank" rel="noopener noreferrer"} when writing [efficient R code](https://waterdata.usgs.gov/blog/intro-best-practices/){target="_blank" rel="noopener noreferrer"}.
---
#### HTML
---
If you ever find yourself writing html, or creating it through rmarkdown, you can check your html using [w3's validator](https://validator.w3.org/){target="_blank" rel="noopener noreferrer"}.
---
## Peer reviewing code
---
Peer review is an important element of quality assuring our work. We often do it without realising by bouncing ideas off of one another and by getting others to 'idiot check' our work. When writing code, ensuring that we get our work formally peer reviewed is particularly important for ensuring it's quality and value.
Prior to receiving code for peer review, the author should ensure that all code files are clean, commented appropriately and for larger projects should be held in a repo with an appropriate [README](#writing-a-readme-file) file.
When peer reviewing code you should be consider the following questions -
* Does the code do what the author intended?
* If you’re able to run the code, does it run without errors? If warnings are displayed, are they explained?
* If the project has unit/integration tests, do they pass?
* Are there any tests / checks that could be added into the code that would help to give greater confidence that it is doing what it is intended to?
* Are there comments explaining why any decisions have been made?
* Is the code written and structured sensibly?
* Are there any ways to make the code more efficient (either in number of lines or raw speed)?
* Does the code follow best practice for styling and structure?
* Are there any other teams/bits of code you're aware of that do similar things and would be useful to point the authors towards?
* At the end of the review, was there any information you needed to ask about that should be made more apparent in the code or documentation?
Depending on your access you may or may not be able to run the code yourself, but there should be enough information within the code and documentation to be able to respond to these questions.
---
### Review of code within team
---
`r knitr::include_graphics("images/great.svg")`
**What does this mean?**
- Is someone else in the team able to generate the same outputs?
- Has someone else in the team reviewed the code and given feedback?
- Have you taken on their feedback and improved the code?
**Why do it?**
There are many benefits to this, for example:
- Ensuring consistency across the team
- Minimizing mistakes and their impact
- Ensuring the requirements are met
- Improving code performance
- Sharing of techniques and knowledge
**How to get started**
If you can't answer yes, then:
- Get a member of the team to run the code using only your documentation
- Use their feedback to improve documentation/in-line comments in code
- Other tips for getting started with peer review can be found in the [Duck Book](https://best-practice-and-impact.github.io/qa-of-code-guidance/peer_review.html){target="_blank" rel="noopener noreferrer"}
- The Duck Book also contains some helpful [code QA checklists](https://best-practice-and-impact.github.io/qa-of-code-guidance/checklists.html){target="_blank" rel="noopener noreferrer"} to help get you thinking about what to check
---
#### Improving code performance
---
Peer reviewing code and not sure where to start? Improving code performance can be a great quick-win for many production teams. There will be cases where code you are reviewing does things in a slightly different way to how you would: profiling the R code with the microbenchmark package is a way to objectively figure out which method is more efficient.
For example below, we are testing out case_when, if_else and ifelse.
```
microbenchmark::microbenchmark(
case_when(1:1000 < 3 ~ "low", TRUE ~ "high"),
if_else(1:1000 < 3, "low", "high"),
ifelse(1:1000 < 3, "low", "high")
)
```
Running the code outputs a table in the R console, giving profile stats for each expression. Here, it is clear that on average, if_else() is the fastest function for the job.
```
Unit: microseconds
expr min lq mean median uq max neval
case_when(1:1000 < 3 ~ "low", TRUE ~ "high") 167.901 206.2510 372.7321 300.2515 420.1005 4187.001 100
if_else(1:1000 < 3, "low", "high") 55.301 74.0010 125.8741 103.7015 138.3010 538.201 100
ifelse(1:1000 < 3, "low", "high") 266.200 339.4505 466.7650 399.7010 637.6010 851.502 100
```
---
### Review of code from outside the team
---
`r knitr::include_graphics("images/best.svg")`
**What does this mean?**
- Has someone from outside of the team and publication area reviewed the code and given feedback?
- Have you taken on their feedback and improved the code?
**Why do it?**
All of the benefits you get from peer reviewing within your own team, multiple times over. Having someone external offers new perspectives, holds you to account by breaking down assumptions, and offers far greater opportunity for building capability through knowledge sharing.
**How to get started**
While peer reviewing code within the team is often practical, having external analysts peer review your code can bring a fresh perspective. If you're interested in this, please contact us, and we can help you to arrange someone external to your team to review your processes. For this to work smoothly, we recommend that your code is easily accessible for other analysts, such as hosted in an Azure DevOps repo and mirrored to github.
---
## Automated quality assurance
---
Any data files that have been created will need to be quality assured. These checks should be automated where possible, so the computer is doing the hard work - saving us time, and to ensure their reliability.
Some teams are already making great progress with automated QA and realising the benefits of it. The Statistics Development Team are working with these to provide generalised code that teams can use as a starting point for automated QA. The intention is that teams can then run this as a minimum, before then looking to develop more area specific checks to the script and/or continue with current checking processes in tandem. If your team already use, or are working towards using, automated QA then get in touch as we'd be keen to see what you have.
It is assumed that when using R, automated scripts will output .html reports that the team can read through to understand their data and identify any issues, and save as a part of their process documentation.
For more information on general quality assurance best practice in DfE, see the [How to QA guide](https://dfe-analytical-services.github.io/how-to-qa/index.html).
---
### Basic automated QA
---
`r knitr::include_graphics("images/good.svg")`
**What does this mean?**
The list of basic automated QA checks, with code examples can be found below and in our [GitHub repository](https://github.com/dfe-analytical-services/automated-data-qa){target="_blank" rel="noopener noreferrer"}:
* Checking for [minimum](https://github.com/dfe-analytical-services/automated-data-qa/blob/main/R/check_minimum_values.R){target="_blank" rel="noopener noreferrer"}, [maximum](https://github.com/dfe-analytical-services/automated-data-qa/blob/main/R/check_maximum_values.R){target="_blank" rel="noopener noreferrer"}, and [average](https://github.com/dfe-analytical-services/automated-data-qa/blob/main/R/check_average_values.R){target="_blank" rel="noopener noreferrer"} values across your data
* Checking for [extreme values and outliers](https://github.com/dfe-analytical-services/automated-data-qa/blob/main/R/check_extreme_values.R){target="_blank" rel="noopener noreferrer"}
* Ensuring there are no [duplicate rows](https://github.com/dfe-analytical-services/automated-data-qa/blob/main/R/check_duplicate_rows.R){target="_blank" rel="noopener noreferrer"} or [duplicate columns](https://github.com/dfe-analytical-services/automated-data-qa/blob/main/R/check_duplicate_columns.R){target="_blank" rel="noopener noreferrer"}
* Checking that where appropriate, [geographical subtotals add up to totals](https://github.com/dfe-analytical-services/automated-data-qa/blob/main/R/check_LA_subtotals_vs_region.R){target="_blank" rel="noopener noreferrer"} (e.g. all the numeric values for LAs in Yorkshire and The Humber add up to the regional total)
* Basic [trend analysis using scatter plots](https://github.com/dfe-analytical-services/automated-data-qa/blob/main/R/create_scatter_plot.R){target="_blank" rel="noopener noreferrer"}, to help you spot outliers and help tell the story of your data.
The Statistics Development Team have developed [the QA app](https://rsconnect/rsc/dfe-published-data-qa/){target="_blank" rel="noopener noreferrer"} to include some of these basic QA outputs.
**Why do it?**
Quality is one of the three pillars that our [code of practice](https://code.statisticsauthority.gov.uk/the-code/quality/){target="_blank" rel="noopener noreferrer"} is built upon. These basic level checks allow us to have confidence that we are accurately processing the data.
Automating these checks ensures their accuracy and reliability, as well as being dramatically quicker than doing these manually.
**How to get started**
Try using our [template code snippets](https://github.com/dfe-analytical-services/automated-data-qa) to get an idea of how you could automate QA of your own publication files. A recording of our introduction to automated QA is also available at the top of the page.
---
### Publication specific automated QA
---
`r knitr::include_graphics("images/great.svg")`
**What does this mean?**
Many teams will have aspects of their data and processes that require Quality Assuring beyond the generalisable basic checks above. Therefore it is expected that teams develop their own automated QA checks to QA specificities of their publications not covered by the basic checks.
**Why do it?**
Quality is one of the three pillars that our [code of practice](https://code.statisticsauthority.gov.uk/the-code/quality/){target="_blank" rel="noopener noreferrer"} is built upon. By building upon the basic checks to develop bespoke QA for our publications, we can increase our confidence in the quality of the processes and outputs that they produce.
**How to get started**
We expect that the basic level of automated QA will cover most needs that publication teams have. However, we also expect that each publication will have it's own quirks that require a more bespoke approach. An example of a publication with it's own bespoke QA checks will appear in this space shortly. For the time being, try to consider what things you'd usually check as flags that something hasn't gone right with your data. What are the unique aspects of your publication's data, and how can you automate checks against them to give you confidence in it's accuracy and reliability?
For those who are interested in starting writing their own QA scripts, it's worth looking at packages in R such as [testthat](https://testthat.r-lib.org/){target="_blank" rel="noopener noreferrer"}, including the [coffee and coding talk](https://educationgovuk.sharepoint.com/sites/sarpi/g/WorkplaceDocuments/Forms/AllItems.aspx?RootFolder=/sites/sarpi/g/WorkplaceDocuments/Inducation%20learning%20and%20career%20development/Coffee%20and%20Coding/190306_peter_autotesting&FolderCTID=0x012000C61C1076C17C5547A6D6D8C2A27B5D97&View=%7b2B35083D-7626-48E2-9615-451544742692%7d){target="_blank" rel="noopener noreferrer"} on it by Peter Curtis, as well as this [guide on testing](http://r-pkgs.had.co.nz/tests.html){target="_blank" rel="noopener noreferrer"} by Hadley Wickham.
The [janitor](https://garthtarr.github.io/meatR/janitor.html){target="_blank" rel="noopener noreferrer"} package in R also has some particularly useful functions, such as `clean_names()` to automatically clean up your variable names, `remove_empty()` to remove any completely empty rows and columns, and `get_dupes()` which retrieves any duplicate rows in your data - this last one is particularly powerful as you can feed it specific columns and see if there's any duplicate instances of values across those columns.
---
## Automating summary statistics
---
As a part of automating QA, we should also be looking to automate the production of summary statistics alongside the tidy underlying data files, this then provides us with instant insight into the stories underneath the numbers.
---
### Automated summaries
---
`r knitr::include_graphics("images/great.svg")`
**What does this mean?**
Summary outputs are automated and used to explore the stories of the data.
The Statistics Development Team have developed [the QA app](https://rsconnect/rsc/dfe-published-data-qa/){target="_blank" rel="noopener noreferrer"} to include some of these automated summaries, including minimum, maximum and average summaries for each indicator.
At a basic level we want teams to make use of the QA app to explore their data:
- Have you used the outputs of the automated QA from the screener to understand the data?
- Run automated QA, ensure that all interesting outputs/trends are reflected in the accompanying text
**Why do it?**
Value is one of the three pillars of our [code of practice](https://code.statisticsauthority.gov.uk/the-code/value/){target="_blank" rel="noopener noreferrer"}. Even more specifically it states that __'Statistics and data should be presented clearly, explained meaningfully and provide authoritative insights that serve the public good.'__.
As a result, we should be developing automated summaries to help us to better understand the story of the data and be authoritative and rigorous in our telling of it.
**How to get started**
Consider:
- Use the additional tabs available after a data file passes the [data screener](https://rsconnect/rsc/dfe-published-data-qa/){target="_blank" rel="noopener noreferrer"} as a starting point to explore trends across breakdowns and years.
- Running your publication-specific automated QA, ensuring that all interesting outputs/trends are reflected in the accompanying text
---
### Publication specific automated summaries
---
`r knitr::include_graphics("images/best.svg")`
**What does this mean?**
- Have you gone beyond the outputs of the QA app to consider automating further insights for your publication specifically? E.g. year on year changes for specific measures, comparisons of different characteristics that are of interest to the general public
- Are you using these outputs to write your commentary?
**Why do it?**
All publications are different, and therefore it is important that for each publication, teams go beyond the basics and produce automated summaries specific to their area.
**How to get started**
Consider:
- Integrating extra publication-specific QA into the production process
- Consider outputs specific to your publication that would help you to write commentary/draw out interesting analysis
---
# Version control
---
_When you assume you make an 'ass' out of 'u' and 'me'_. Everyone knows this saying, yet few of us heed it's warning.
The aim should be to leave your work in a state that others (including future you!), can pick it up and immediately find what they need, understanding the processes that have happened previously. Changes to files should be documented, and published versions should be clearly named and stored in their own folder.
As we work with code to process our data more and more, we can begin to utilise version control software to make this process much easier, allowing simultaneous collaboration on files.
---
## Sensible folder and file structure
---
`r knitr::include_graphics("images/good.svg")`
**What does this mean?**
As a minimum you should have a folder that includes all of the final versions of documents produced and published, per release, within a folder for the wider publication. Ask yourself if it would be easy for someone who isn't in the team to find specific files, and if not, is there a better way that you could name and structure your folders to make them more intuitive to navigate?
**Why do it?**
How you organize and name your files will have a big impact on your ability to find those files later and to understand what they contain. You should be consistent and descriptive in naming and organizing files so that it is obvious where to find specific data and what the files contain.
**How to get started**
Some questions to help you consider whether your folder structure is sensible are:
- Are all documentation, code and outputs for the publication saved in one folder area?
- Is simple version control clearly applied (e.g. having all final files in a folder named "final"?
- Are there sub-folders like 'code', 'documentation'', 'outputs' and 'final' to save the relevant working files in?
- Are you keeping a version log up to date with any changes made to files in this final folder?
---
### Naming conventions
---
Having a **clear** and **consistent** naming convention for your files is critical. Remember that file names should:
**Be machine readable**
- Avoid spaces.
- Avoid special characters such as: ~ ! @ # $ % ^ & * ( ) ` ; < > ? , [ ] { } ‘ “.
- Be as short as practicable; overly long names do not work well with all types of software.
**Be human readable**
- Be easy to understand the contents from the name.
**Play well with default ordering**
- Often (though not always!) you should have numbers first, particularly if your file names include dates.
- Follow the ISO 8601 date standard (YYYYMMDD) to ensure that all of your files stay in chronological order.
- Use leading zeros to left pad numbers and ensure files sort properly, avoiding 1,10,2,3.
If in doubt, take a look at this [presentation](https://speakerdeck.com/jennybc/how-to-name-files){target="_blank" rel="noopener noreferrer"}, or this [naming convention guide by Stanford](https://library.stanford.edu/research/data-management-services/data-best-practices/best-practices-file-naming){target="_blank" rel="noopener noreferrer"}, for examples reinforcing the above.
---
## Documentation
---
`r knitr::include_graphics("images/good.svg")`
**What does this mean?**
- You should be annotating as you go, ensuring that every process and decision made is written down. Processes are ideally written with code, and decisions in comments.
- There should be a [README](#writing-a-readme-file) notes file, that clearly details the steps in the process, any dependencies (such as places where access needs to be requested to) and how to carry out the process.
- Any specialist terms should also be defined if required (e.g. The NFTYPE lookup can be found in xxxxx. "NFTYPE" means school type).
**Why do it?**
When documenting your processes you should leave nothing to chance, we all have wasted time in the past trying to work out what it was that we had done before, and that time increases even more when we are picking up someone else's work. Thorough documentation saves us time, and provides a clear audit trail of what we do. This is key for the 'Reproducible' part of RAP, our processes must be easily reproducible and clear documentation is fundamental to that.
**How to get started**
Take a look at your processes and be critical - could another analyst pick them up without you there to help them? If the answer is no (don't feel ashamed, it will be for many teams) then go through and note down areas that require improvement, so that you can revise them with your team.
Take a look at the sections below for further guidance on improving your documentation.
---
### Commenting in code
---
When writing code, whether that is SQL, R, or something else, make sure you're commenting as you go. Start off every file by outlining the date, author, purpose, and if applicable, the structure of the file, like this:
```
----------------------------------------------------------------------------------------------
-- Script Name: Section 251 Table A 2019 - s251_tA_2019.sql
-- Description: Extraction of data from IStore and production of underlying data file
-- Author: Cam Race
-- Creation Date: 15/11/2019
----------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------
--// Process
-- 1. Extract the data for each available year
-- 2. Match in extra geographical information
-- 3. Create aggregations - both categorical and geographical totals
-- 4. Tidy up and output results
-- 5. Metadata creation
----------------------------------------------------------------------------------------------
```
Commented lines should begin with -- (SQL) or # (R), followed by one space and your comment. Remember that **comments should explain the why, not the what.**
In SQL you can also use `/**` and `**/` to bookend comments over multiple lines.
In rmarkdown documents you can bookend comments by using `<!--` and `-->`.
Use commented lines of - to break up your files into scannable chunks based upon the structure and subheadings, like the R example below:
```
# Importing the data -----------------------------------------------------------------------------------
```
Doing this can visually break up your code into sections that are easy to navigate around. It will also add that section to your outline, which can be used in RStudio using Ctrl-Shift-O. More details on the possibilities for this can be found in the [RStudio guidance on folding and sectioning code](https://support.rstudio.com/hc/en-us/articles/200484568-Code-Folding-and-Sections){target="_blank" rel="noopener noreferrer"}.
You might be thinking that it would be nice if there was software that could help you with documentation, if so, read on, as Git is an incredibly powerful tool that can help us easily and thoroughly document versions of our files. If you're at the stage where you are developing your own functions and packages in R, then take a look at [roxygen2](https://roxygen2.r-lib.org/){target="_blank" rel="noopener noreferrer"} as well.
---
### Writing a README file
---
**What does this mean?**
A README is a text file (.txt) that introduces and explains a project. It contains information that is required to understand what the project is about and how to use it.
**Why do it?**
It's an easy way to answer questions that your audience will likely have regarding how to install and use your project and also how to collaborate with you.
**How to get started**
As a starting point, you should aim to have as many of the following sections as are applicable to your project:
- Introduction
- Requirements (access, software, skills/knowledge)
- How to use
- How to contribute
- Contact details
The [Self-assessment tool](https://github.com/dfe-analytical-services/publication-self-assessment-copy){target="_blank" rel="noopener noreferrer"} and the [QA app](https://github.com/dfe-analytical-services/dfe-published-data-qa){target="_blank" rel="noopener noreferrer"} give two examples of readme files structured like this.
---
## Version control with git
---
If you do not already have git downloaded, you can [download the latest version from their website](https://git-scm.com/downloads).
For now, take a look at at the [resources for learning Git](l+d.html#git) on the learning resources page.
---
### Version controlled final code scripts
---