-
Notifications
You must be signed in to change notification settings - Fork 2
/
databases.html
2389 lines (1741 loc) · 147 KB
/
databases.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
<title>Working with large datasets in SQL, R, and Python</title>
<script type="text/javascript">
window.onload = function() {
var imgs = document.getElementsByTagName('img'), i, img;
for (i = 0; i < imgs.length; i++) {
img = imgs[i];
// center an image if it is the only element of its parent
if (img.parentElement.childElementCount === 1)
img.parentElement.style.textAlign = 'center';
}
};
</script>
<!-- Styles for R syntax highlighter -->
<style type="text/css">
pre .operator,
pre .paren {
color: rgb(104, 118, 135)
}
pre .literal {
color: #990073
}
pre .number {
color: #099;
}
pre .comment {
color: #998;
font-style: italic
}
pre .keyword {
color: #900;
font-weight: bold
}
pre .identifier {
color: rgb(0, 0, 0);
}
pre .string {
color: #d14;
}
</style>
<!-- R syntax highlighter -->
<script type="text/javascript">
var hljs=new function(){function m(p){return p.replace(/&/gm,"&").replace(/</gm,"<")}function f(r,q,p){return RegExp(q,"m"+(r.cI?"i":"")+(p?"g":""))}function b(r){for(var p=0;p<r.childNodes.length;p++){var q=r.childNodes[p];if(q.nodeName=="CODE"){return q}if(!(q.nodeType==3&&q.nodeValue.match(/\s+/))){break}}}function h(t,s){var p="";for(var r=0;r<t.childNodes.length;r++){if(t.childNodes[r].nodeType==3){var q=t.childNodes[r].nodeValue;if(s){q=q.replace(/\n/g,"")}p+=q}else{if(t.childNodes[r].nodeName=="BR"){p+="\n"}else{p+=h(t.childNodes[r])}}}if(/MSIE [678]/.test(navigator.userAgent)){p=p.replace(/\r/g,"\n")}return p}function a(s){var r=s.className.split(/\s+/);r=r.concat(s.parentNode.className.split(/\s+/));for(var q=0;q<r.length;q++){var p=r[q].replace(/^language-/,"");if(e[p]){return p}}}function c(q){var p=[];(function(s,t){for(var r=0;r<s.childNodes.length;r++){if(s.childNodes[r].nodeType==3){t+=s.childNodes[r].nodeValue.length}else{if(s.childNodes[r].nodeName=="BR"){t+=1}else{if(s.childNodes[r].nodeType==1){p.push({event:"start",offset:t,node:s.childNodes[r]});t=arguments.callee(s.childNodes[r],t);p.push({event:"stop",offset:t,node:s.childNodes[r]})}}}}return t})(q,0);return p}function k(y,w,x){var q=0;var z="";var s=[];function u(){if(y.length&&w.length){if(y[0].offset!=w[0].offset){return(y[0].offset<w[0].offset)?y:w}else{return w[0].event=="start"?y:w}}else{return y.length?y:w}}function t(D){var A="<"+D.nodeName.toLowerCase();for(var B=0;B<D.attributes.length;B++){var C=D.attributes[B];A+=" "+C.nodeName.toLowerCase();if(C.value!==undefined&&C.value!==false&&C.value!==null){A+='="'+m(C.value)+'"'}}return A+">"}while(y.length||w.length){var v=u().splice(0,1)[0];z+=m(x.substr(q,v.offset-q));q=v.offset;if(v.event=="start"){z+=t(v.node);s.push(v.node)}else{if(v.event=="stop"){var p,r=s.length;do{r--;p=s[r];z+=("</"+p.nodeName.toLowerCase()+">")}while(p!=v.node);s.splice(r,1);while(r<s.length){z+=t(s[r]);r++}}}}return z+m(x.substr(q))}function j(){function q(x,y,v){if(x.compiled){return}var u;var s=[];if(x.k){x.lR=f(y,x.l||hljs.IR,true);for(var w in x.k){if(!x.k.hasOwnProperty(w)){continue}if(x.k[w] instanceof Object){u=x.k[w]}else{u=x.k;w="keyword"}for(var r in u){if(!u.hasOwnProperty(r)){continue}x.k[r]=[w,u[r]];s.push(r)}}}if(!v){if(x.bWK){x.b="\\b("+s.join("|")+")\\s"}x.bR=f(y,x.b?x.b:"\\B|\\b");if(!x.e&&!x.eW){x.e="\\B|\\b"}if(x.e){x.eR=f(y,x.e)}}if(x.i){x.iR=f(y,x.i)}if(x.r===undefined){x.r=1}if(!x.c){x.c=[]}x.compiled=true;for(var t=0;t<x.c.length;t++){if(x.c[t]=="self"){x.c[t]=x}q(x.c[t],y,false)}if(x.starts){q(x.starts,y,false)}}for(var p in e){if(!e.hasOwnProperty(p)){continue}q(e[p].dM,e[p],true)}}function d(B,C){if(!j.called){j();j.called=true}function q(r,M){for(var L=0;L<M.c.length;L++){if((M.c[L].bR.exec(r)||[null])[0]==r){return M.c[L]}}}function v(L,r){if(D[L].e&&D[L].eR.test(r)){return 1}if(D[L].eW){var M=v(L-1,r);return M?M+1:0}return 0}function w(r,L){return L.i&&L.iR.test(r)}function K(N,O){var M=[];for(var L=0;L<N.c.length;L++){M.push(N.c[L].b)}var r=D.length-1;do{if(D[r].e){M.push(D[r].e)}r--}while(D[r+1].eW);if(N.i){M.push(N.i)}return f(O,M.join("|"),true)}function p(M,L){var N=D[D.length-1];if(!N.t){N.t=K(N,E)}N.t.lastIndex=L;var r=N.t.exec(M);return r?[M.substr(L,r.index-L),r[0],false]:[M.substr(L),"",true]}function z(N,r){var L=E.cI?r[0].toLowerCase():r[0];var M=N.k[L];if(M&&M instanceof Array){return M}return false}function F(L,P){L=m(L);if(!P.k){return L}var r="";var O=0;P.lR.lastIndex=0;var M=P.lR.exec(L);while(M){r+=L.substr(O,M.index-O);var N=z(P,M);if(N){x+=N[1];r+='<span class="'+N[0]+'">'+M[0]+"</span>"}else{r+=M[0]}O=P.lR.lastIndex;M=P.lR.exec(L)}return r+L.substr(O,L.length-O)}function J(L,M){if(M.sL&&e[M.sL]){var r=d(M.sL,L);x+=r.keyword_count;return r.value}else{return F(L,M)}}function I(M,r){var L=M.cN?'<span class="'+M.cN+'">':"";if(M.rB){y+=L;M.buffer=""}else{if(M.eB){y+=m(r)+L;M.buffer=""}else{y+=L;M.buffer=r}}D.push(M);A+=M.r}function G(N,M,Q){var R=D[D.length-1];if(Q){y+=J(R.buffer+N,R);return false}var P=q(M,R);if(P){y+=J(R.buffer+N,R);I(P,M);return P.rB}var L=v(D.length-1,M);if(L){var O=R.cN?"</span>":"";if(R.rE){y+=J(R.buffer+N,R)+O}else{if(R.eE){y+=J(R.buffer+N,R)+O+m(M)}else{y+=J(R.buffer+N+M,R)+O}}while(L>1){O=D[D.length-2].cN?"</span>":"";y+=O;L--;D.length--}var r=D[D.length-1];D.length--;D[D.length-1].buffer="";if(r.starts){I(r.starts,"")}return R.rE}if(w(M,R)){throw"Illegal"}}var E=e[B];var D=[E.dM];var A=0;var x=0;var y="";try{var s,u=0;E.dM.buffer="";do{s=p(C,u);var t=G(s[0],s[1],s[2]);u+=s[0].length;if(!t){u+=s[1].length}}while(!s[2]);if(D.length>1){throw"Illegal"}return{r:A,keyword_count:x,value:y}}catch(H){if(H=="Illegal"){return{r:0,keyword_count:0,value:m(C)}}else{throw H}}}function g(t){var p={keyword_count:0,r:0,value:m(t)};var r=p;for(var q in e){if(!e.hasOwnProperty(q)){continue}var s=d(q,t);s.language=q;if(s.keyword_count+s.r>r.keyword_count+r.r){r=s}if(s.keyword_count+s.r>p.keyword_count+p.r){r=p;p=s}}if(r.language){p.second_best=r}return p}function i(r,q,p){if(q){r=r.replace(/^((<[^>]+>|\t)+)/gm,function(t,w,v,u){return w.replace(/\t/g,q)})}if(p){r=r.replace(/\n/g,"<br>")}return r}function n(t,w,r){var x=h(t,r);var v=a(t);var y,s;if(v){y=d(v,x)}else{return}var q=c(t);if(q.length){s=document.createElement("pre");s.innerHTML=y.value;y.value=k(q,c(s),x)}y.value=i(y.value,w,r);var u=t.className;if(!u.match("(\\s|^)(language-)?"+v+"(\\s|$)")){u=u?(u+" "+v):v}if(/MSIE [678]/.test(navigator.userAgent)&&t.tagName=="CODE"&&t.parentNode.tagName=="PRE"){s=t.parentNode;var p=document.createElement("div");p.innerHTML="<pre><code>"+y.value+"</code></pre>";t=p.firstChild.firstChild;p.firstChild.cN=s.cN;s.parentNode.replaceChild(p.firstChild,s)}else{t.innerHTML=y.value}t.className=u;t.result={language:v,kw:y.keyword_count,re:y.r};if(y.second_best){t.second_best={language:y.second_best.language,kw:y.second_best.keyword_count,re:y.second_best.r}}}function o(){if(o.called){return}o.called=true;var r=document.getElementsByTagName("pre");for(var p=0;p<r.length;p++){var q=b(r[p]);if(q){n(q,hljs.tabReplace)}}}function l(){if(window.addEventListener){window.addEventListener("DOMContentLoaded",o,false);window.addEventListener("load",o,false)}else{if(window.attachEvent){window.attachEvent("onload",o)}else{window.onload=o}}}var e={};this.LANGUAGES=e;this.highlight=d;this.highlightAuto=g;this.fixMarkup=i;this.highlightBlock=n;this.initHighlighting=o;this.initHighlightingOnLoad=l;this.IR="[a-zA-Z][a-zA-Z0-9_]*";this.UIR="[a-zA-Z_][a-zA-Z0-9_]*";this.NR="\\b\\d+(\\.\\d+)?";this.CNR="\\b(0[xX][a-fA-F0-9]+|(\\d+(\\.\\d*)?|\\.\\d+)([eE][-+]?\\d+)?)";this.BNR="\\b(0b[01]+)";this.RSR="!|!=|!==|%|%=|&|&&|&=|\\*|\\*=|\\+|\\+=|,|\\.|-|-=|/|/=|:|;|<|<<|<<=|<=|=|==|===|>|>=|>>|>>=|>>>|>>>=|\\?|\\[|\\{|\\(|\\^|\\^=|\\||\\|=|\\|\\||~";this.ER="(?![\\s\\S])";this.BE={b:"\\\\.",r:0};this.ASM={cN:"string",b:"'",e:"'",i:"\\n",c:[this.BE],r:0};this.QSM={cN:"string",b:'"',e:'"',i:"\\n",c:[this.BE],r:0};this.CLCM={cN:"comment",b:"//",e:"$"};this.CBLCLM={cN:"comment",b:"/\\*",e:"\\*/"};this.HCM={cN:"comment",b:"#",e:"$"};this.NM={cN:"number",b:this.NR,r:0};this.CNM={cN:"number",b:this.CNR,r:0};this.BNM={cN:"number",b:this.BNR,r:0};this.inherit=function(r,s){var p={};for(var q in r){p[q]=r[q]}if(s){for(var q in s){p[q]=s[q]}}return p}}();hljs.LANGUAGES.cpp=function(){var a={keyword:{"false":1,"int":1,"float":1,"while":1,"private":1,"char":1,"catch":1,"export":1,virtual:1,operator:2,sizeof:2,dynamic_cast:2,typedef:2,const_cast:2,"const":1,struct:1,"for":1,static_cast:2,union:1,namespace:1,unsigned:1,"long":1,"throw":1,"volatile":2,"static":1,"protected":1,bool:1,template:1,mutable:1,"if":1,"public":1,friend:2,"do":1,"return":1,"goto":1,auto:1,"void":2,"enum":1,"else":1,"break":1,"new":1,extern:1,using:1,"true":1,"class":1,asm:1,"case":1,typeid:1,"short":1,reinterpret_cast:2,"default":1,"double":1,register:1,explicit:1,signed:1,typename:1,"try":1,"this":1,"switch":1,"continue":1,wchar_t:1,inline:1,"delete":1,alignof:1,char16_t:1,char32_t:1,constexpr:1,decltype:1,noexcept:1,nullptr:1,static_assert:1,thread_local:1,restrict:1,_Bool:1,complex:1},built_in:{std:1,string:1,cin:1,cout:1,cerr:1,clog:1,stringstream:1,istringstream:1,ostringstream:1,auto_ptr:1,deque:1,list:1,queue:1,stack:1,vector:1,map:1,set:1,bitset:1,multiset:1,multimap:1,unordered_set:1,unordered_map:1,unordered_multiset:1,unordered_multimap:1,array:1,shared_ptr:1}};return{dM:{k:a,i:"</",c:[hljs.CLCM,hljs.CBLCLM,hljs.QSM,{cN:"string",b:"'\\\\?.",e:"'",i:"."},{cN:"number",b:"\\b(\\d+(\\.\\d*)?|\\.\\d+)(u|U|l|L|ul|UL|f|F)"},hljs.CNM,{cN:"preprocessor",b:"#",e:"$"},{cN:"stl_container",b:"\\b(deque|list|queue|stack|vector|map|set|bitset|multiset|multimap|unordered_map|unordered_set|unordered_multiset|unordered_multimap|array)\\s*<",e:">",k:a,r:10,c:["self"]}]}}}();hljs.LANGUAGES.r={dM:{c:[hljs.HCM,{cN:"number",b:"\\b0[xX][0-9a-fA-F]+[Li]?\\b",e:hljs.IMMEDIATE_RE,r:0},{cN:"number",b:"\\b\\d+(?:[eE][+\\-]?\\d*)?L\\b",e:hljs.IMMEDIATE_RE,r:0},{cN:"number",b:"\\b\\d+\\.(?!\\d)(?:i\\b)?",e:hljs.IMMEDIATE_RE,r:1},{cN:"number",b:"\\b\\d+(?:\\.\\d*)?(?:[eE][+\\-]?\\d*)?i?\\b",e:hljs.IMMEDIATE_RE,r:0},{cN:"number",b:"\\.\\d+(?:[eE][+\\-]?\\d*)?i?\\b",e:hljs.IMMEDIATE_RE,r:1},{cN:"keyword",b:"(?:tryCatch|library|setGeneric|setGroupGeneric)\\b",e:hljs.IMMEDIATE_RE,r:10},{cN:"keyword",b:"\\.\\.\\.",e:hljs.IMMEDIATE_RE,r:10},{cN:"keyword",b:"\\.\\.\\d+(?![\\w.])",e:hljs.IMMEDIATE_RE,r:10},{cN:"keyword",b:"\\b(?:function)",e:hljs.IMMEDIATE_RE,r:2},{cN:"keyword",b:"(?:if|in|break|next|repeat|else|for|return|switch|while|try|stop|warning|require|attach|detach|source|setMethod|setClass)\\b",e:hljs.IMMEDIATE_RE,r:1},{cN:"literal",b:"(?:NA|NA_integer_|NA_real_|NA_character_|NA_complex_)\\b",e:hljs.IMMEDIATE_RE,r:10},{cN:"literal",b:"(?:NULL|TRUE|FALSE|T|F|Inf|NaN)\\b",e:hljs.IMMEDIATE_RE,r:1},{cN:"identifier",b:"[a-zA-Z.][a-zA-Z0-9._]*\\b",e:hljs.IMMEDIATE_RE,r:0},{cN:"operator",b:"<\\-(?!\\s*\\d)",e:hljs.IMMEDIATE_RE,r:2},{cN:"operator",b:"\\->|<\\-",e:hljs.IMMEDIATE_RE,r:1},{cN:"operator",b:"%%|~",e:hljs.IMMEDIATE_RE},{cN:"operator",b:">=|<=|==|!=|\\|\\||&&|=|\\+|\\-|\\*|/|\\^|>|<|!|&|\\||\\$|:",e:hljs.IMMEDIATE_RE,r:0},{cN:"operator",b:"%",e:"%",i:"\\n",r:1},{cN:"identifier",b:"`",e:"`",r:0},{cN:"string",b:'"',e:'"',c:[hljs.BE],r:0},{cN:"string",b:"'",e:"'",c:[hljs.BE],r:0},{cN:"paren",b:"[[({\\])}]",e:hljs.IMMEDIATE_RE,r:0}]}};
hljs.initHighlightingOnLoad();
</script>
<!-- MathJax scripts -->
<script type="text/javascript" src="https://cdn.bootcss.com/mathjax/2.7.0/MathJax.js?config=TeX-MML-AM_CHTML">
</script>
<style type="text/css">
body, td {
font-family: sans-serif;
background-color: white;
font-size: 13px;
}
body {
max-width: 800px;
margin: auto;
padding: 1em;
line-height: 20px;
}
tt, code, pre {
font-family: 'DejaVu Sans Mono', 'Droid Sans Mono', 'Lucida Console', Consolas, Monaco, monospace;
}
h1 {
font-size:2.2em;
}
h2 {
font-size:1.8em;
}
h3 {
font-size:1.4em;
}
h4 {
font-size:1.0em;
}
h5 {
font-size:0.9em;
}
h6 {
font-size:0.8em;
}
a:visited {
color: rgb(50%, 0%, 50%);
}
pre, img {
max-width: 100%;
}
pre {
overflow-x: auto;
}
pre code {
display: block; padding: 0.5em;
}
code {
font-size: 92%;
border: 1px solid #ccc;
}
code[class] {
background-color: #F8F8F8;
}
table, td, th {
border: none;
}
blockquote {
color:#666666;
margin:0;
padding-left: 1em;
border-left: 0.5em #EEE solid;
}
hr {
height: 0px;
border-bottom: none;
border-top-width: thin;
border-top-style: dotted;
border-top-color: #999999;
}
@media print {
* {
background: transparent !important;
color: black !important;
filter:none !important;
-ms-filter: none !important;
}
body {
font-size:12pt;
max-width:100%;
}
a, a:visited {
text-decoration: underline;
}
hr {
visibility: hidden;
page-break-before: always;
}
pre, blockquote {
padding-right: 1em;
page-break-inside: avoid;
}
tr, img {
page-break-inside: avoid;
}
img {
max-width: 100% !important;
}
@page :left {
margin: 15mm 20mm 15mm 10mm;
}
@page :right {
margin: 15mm 10mm 15mm 20mm;
}
p, h2, h3 {
orphans: 3; widows: 3;
}
h2, h3 {
page-break-after: avoid;
}
}
</style>
</head>
<body>
<h1>Working with large datasets in SQL, R, and Python</h1>
<h2>Querying and manipulating databases and datasets in R and Python</h2>
<p>Chris Paciorek, Department of Statistics, UC Berkeley</p>
<p>Last updated: January 2021</p>
<h1>0) This Tutorial</h1>
<p>This tutorial covers tools for manipulating large datasets, including those living in SQL databases or in data frames and related objects in R and Python. The focus is on querying rather than creating and administering databases as the intended audience is for statisticians/data analysis/data scientists who are carrying out analyses. A major emphasis is on how to do queries efficiently and how to use SQL effectively. At the moment, this tutorial is somewhat more focused on R than Python, but the manipulation of databases from R and Python are very similar because the core reliance is on SQL.</p>
<p>This tutorial assumes you have a working knowledge of R or Python. </p>
<p>Materials for this tutorial, including the R markdown file and associated code files that were used to create this document are available on Github at <a href="https://github.com/berkeley-scf/tutorial-databases">https://github.com/berkeley-scf/tutorial-databases</a>. You can download the files by doing cloning the Git repository. E.g.,from a terminal window on a UNIX-like machine, you can do this:</p>
<pre><code class="r">git clone https://github.com/berkeley-scf/tutorial-databases
</code></pre>
<p>Alternatively you can simply download a <a href="https://github.com/berkeley-scf/tutorial-databases/archive/master.zip">zip file</a> containing all the materials.</p>
<p>The example data files are not part of the Github repository. You can get the example data files (both Stack Overflow data and Wikipedia webtraffic data for the year 2016) <a href="http://www.stat.berkeley.edu/share/paciorek/tutorial-databases-data.zip">here</a>.</p>
<p>To create this HTML document, simply compile the corresponding R Markdown file in R as follows on the command line (or execute this R code within R or RStudio):</p>
<pre><code class="r">Rscript -e "library(knitr); knit2html('databases.Rmd')"
</code></pre>
<p>Solutions to the SQL challenges are available on request. </p>
<h2>Using PostgreSQL on Mac or Windows</h2>
<p>To replicate the (non-essential) PostgreSQL administration portion of this tutorial, you'll need access to a machine on which you can run a PostgreSQL server. While there are a variety of ways to do this, this tutorial assumes that you are running PostgreSQL on an Ubuntu (or Debian) Linux machine. If you are a Windows or Mac user, there are several options for accessing a Linux environment:</p>
<ul>
<li>You could run Ubuntu in a Docker container; Docker can be installed on Windows or Mac. Once you've installed Docker and have access to a terminal command line, please see the commands in <code>docker.sh</code> in this repository. </li>
<li>You could run an Amazon EC2/Google Cloud/Azure virtual machine instance, using a image that supports R and/or Python and then installing PostgreSQL as discussed in this tutorial.</li>
<li>You could try to use an Ubuntu Linux virtual machine (VM) developed here at Berkeley, the <a href="http://bce.berkeley.edu">Berkeley Common Environment (BCE)</a>, though this is no longer maintained/supported. BCE can be run through VirtualBox on your computer. Once you've installed VirtualBox and started a BCE virtual machine and have access to a terminal command line, please see the commands in <code>bce.sh</code> in this repository.</li>
</ul>
<p>Also note that in recent years the big cloud providers have created specific database services, so you are using a cloud provider, you'd probably want to take advantage of those rather than 'manually' running a database via a virtual machine. </p>
<p>This tutorial by Christopher Paciorek is licensed under a Creative Commons Attribution 3.0 Unported License (CC BY).</p>
<h1>1) Background</h1>
<h2>1.1) Data size</h2>
<p>The techniques and tools discussed here are designed for datasets in the range of gigabytes to tens of gigabytes, though they may scale to larger if you have a machine with a lot of memory or simply have enough disk space and are willing to wait. If you have 10s of gigabytes of data, you'll be better off if your machine has 10s of GBs of memory, as discussed in this tutorial. </p>
<p>If you're scaling to 100s of GBs, terabytes or petabytes, tools such as Spark may be your best bet, or possibly carefully-administered databases. Those topics are beyond the scope of this tutorial. However, this tutorial will be useful if you're doing SQL queries on Spark datasets or professionally-administered databases.</p>
<h2>1.2) Memory vs. disk</h2>
<p>On a computer there is a hierarchy of locations where data can be stored. The hierarchy has the trade-off that the locations that the CPU can access most quickly can store the least amount of data. The hierarchy looks like this:</p>
<ul>
<li> cpu cache </li>
<li> main memory</li>
<li> disk</li>
<li> local network (data stored on other machines)</li>
<li> general internet access</li>
</ul>
<p>For our purposes here the key question is whether the data resides in memory or on disk, but when considering Spark and distributed systems, one gets into issues of moving data across the network between machines. </p>
<p>Formally, databases are stored on disk, while R and Python store datasets in memory. This would suggest that databases will be slow to access their data but will be able to store more data than can be loaded into an R or Python session. However, databases can be quite fast due in part to disk caching by the operating system as well as careful implementation of good algorithms for database operations. For more information about disk caching see Section 2.6.5.</p>
<p>And conversely, R (and probably Python) have mechanisms for storing large datasets on disk in a way that they can be accessed fairly quickly.</p>
<h1>2) Database systems and SQL</h1>
<h2>2.1) Overview of databases</h2>
<p>Basically, standard SQL databases are <em>relational</em> databases that are a collection of rectangular format datasets (<em>tables</em>, also called <em>relations</em>), with each table similar to R or Pandas data frames, in that a table is made up of columns, which are called <em>fields</em> or <em>attributes</em>, each containing a single <em>type</em> (numeric, character, date, currency, enumerated (i.e., categorical), …) and rows or records containing the observations for one entity. Some of these tables generally have fields in common so it makes sense to merge (i.e., join) information from multiple tables. E.g., you might have a database with a table of student information, a table of teacher information and a table of school information.</p>
<p>One principle of databases is that if a set of fields contain duplicated information about a given category, you can more efficiently store information about each level of the category in a separate table. Consider information about people living in a state and information about each state - you don't want to include variables that only vary by state in the table containing information about individuals (at least until you're doing the actual analysis that needs the information in a single table). Or consider students nested within classes nested within schools.</p>
<p>Databases are set up to allow for fast querying and merging (called joins in database terminology). </p>
<p>You can interact with databases in a variety of database systems (DBMS=database management system). Some popular systems are SQLite, MySQL, PostgreSQL, Oracle and Microsoft Access. We'll concentrate on accessing data in a database rather than management of databases. SQL is the Structured Query Language and is a special-purpose high-level language for managing databases and making queries. Variations on SQL are used in many different DBMS.</p>
<p>Queries are the way that the user gets information (often simply subsets of tables or information merged across tables). The result of an SQL query is in general another table, though in some cases it might have only one row and/or one column.</p>
<p>Many DBMS have a client-server model. Clients connect to the server, with some authentication, and make requests (i.e., queries).</p>
<p>There are often multiple ways to interact with a DBMS, including directly using command line tools provided by the DBMS or via Python or R, among others. </p>
<h3>2.1.1) Relational Database Management Systems (DBMS)</h3>
<p>There are a variety of relational database management systems (DBMS). Some that are commonly used by the intended audience of this tutorial are SQLite, PostgreSQL, and mySQL. We'll concentrate on SQLite (because it is simple to use on a single machine) and PostgreSQL (because is is a popular open-source DBMS that is a good representative of a client-server model and has some functionality that SQLite lacks).</p>
<p>SQLite is quite nice in terms of being self-contained - there is no server-client model, just a single file on your hard drive that stores the database and to which you can connect to using the SQLite shell, R, Python, etc. However, it does not have some useful functionality that other DBMS have. For example, you can't use <code>ALTER TABLE</code> to modify column types or drop columns. </p>
<h3>2.1.2) NoSQL databases</h3>
<p>NoSQL (not only SQL) systems have to do with working with datasets that are not handled well in traditional DBMS, and not specifically about the use or non-use of SQL itself. In particular data might not fit well within the rectangular row-column data model of one or more tables in a database. And one might be in a context where a full DBMS is not needed. Or one might have more data or need faster responses than can be handled well by standard DBMS.</p>
<p>While these systems tend to scale better, they generally don't have a declarative query language so you end up having to do more programming yourself. For example in the Stanford database course referenced at the end of this tutorial, the noSQL video gives the example of web log data that records visits to websites. One might have the data in the form of files and not want to go through the trouble of data cleaning and extracting fields from unstructured text. In addition, one may need to do only simple queries that involve looking at each record separately and therefore can be easily done in parallel, which noSQL systems tend to be designed to do. Or one might have document data, such as Wikipedia pages, where the unstructured text on each page is not really suited for a DBMS. </p>
<p>Some NoSQL systems include</p>
<ul>
<li>Hadoop/Spark-style MapReduce systems,</li>
<li>key-value storage systems (e.g., with data stored as pairs of keys (i.e., ids) and values, such as in JSON),</li>
<li>document storage systems (like key-value systems but where the value is a document), and</li>
<li>graph storage systems (e.g., for social networks). </li>
</ul>
<h2>2.2) Concepts in SQL</h2>
<h3>2.2.1) Simple queries for choosing rows and columns from a table</h3>
<p>SQL is a declarative language that tells the database system what results you want. The system then parses the SQL syntax and determines how to implement the query.</p>
<p>Later we'll introduce a database of Stack Overflow questions and answers. The <em>questions</em> table has a field <em>viewcount</em> that indicates how many times each question was viewed. </p>
<p>Here is a simple query that selects the first five rows (and all columns, based on the <code>*</code> wildcard) from the questions table.</p>
<pre><code>select * from questions limit 5
</code></pre>
<p>Now let's see some more interesting usage of other SQL syntax.</p>
<pre><code>## find the largest viewcounts in the questions table
select distinct viewcount from questions order by viewcount desc limit 20
## get the questions that are viewed the most
select * from questions where viewcount > 100000
</code></pre>
<p>Let's lay out the various verbs in SQL. Here's the form of a standard query (though the ORDER BY is often not used and sorting is computationally expensive):</p>
<pre><code>SELECT <column(s)> FROM <table> WHERE <condition(s) on column(s)> ORDER BY <column(s)>
</code></pre>
<p>SQL keywords are often written in ALL CAPITALS though I won't necessarily do that in this tutorial. </p>
<p>And here is a table of some important keywords:</p>
<table><thead>
<tr>
<th>Keyword</th>
<th>What it does</th>
</tr>
</thead><tbody>
<tr>
<td>SELECT</td>
<td>select columns</td>
</tr>
<tr>
<td>FROM</td>
<td>which table to operate on</td>
</tr>
<tr>
<td>WHERE</td>
<td>filter (choose) rows satisfying certain conditions</td>
</tr>
<tr>
<td>LIKE, IN, <, >, =, etc.</td>
<td>used as part of conditions</td>
</tr>
<tr>
<td>ORDER BY</td>
<td>sort based on columns</td>
</tr>
</tbody></table>
<p>For comparisons in a WHERE clause, some common syntax for setting conditions includes LIKE (for patterns), =, >, <, >=, <=, !=.</p>
<p>Some other keywords are: DISTINCT, ON, JOIN, GROUP BY, AS, USING, UNION, INTERSECT, SIMILAR TO, SUBSTR in SQLite and SUBSTRING in PostgreSQL. </p>
<h3>2.2.2.) Schema and normalization</h3>
<p>To truly leverage the conceptual and computational power of a database you'll want to have your data in a normalized form, which means spreading your data across multiple tables in such a way that you don't repeat information unnecessarily.</p>
<p>The schema is the metadata about the tables in the database and the fields (and their types) in those tables.</p>
<p>Let's consider this using an educational example. Suppose we have a school with multiple teachers teaching multiple classes and multiple students taking multiple classes. If we put this all in one table organized per student, the data might have the following fields:</p>
<ul>
<li>student ID</li>
<li>student grade level</li>
<li>student name</li>
<li>class 1</li>
<li>class 2 </li>
<li>…</li>
<li>class n</li>
<li>grade in class 1</li>
<li>grade in class 2</li>
<li>…</li>
<li>grade in class n</li>
<li>teacher ID 1</li>
<li>teacher ID 2 </li>
<li>…</li>
<li>teacher ID n</li>
<li>teacher department 1</li>
<li>teacher department 2</li>
<li>…</li>
<li>teacher department n</li>
<li>teacher age 1</li>
<li>teacher age 2 </li>
<li>…</li>
<li>teacher age n</li>
</ul>
<p>There are a lot of problems with this.</p>
<ol>
<li>'n' needs to be the maximum number of classes a student might take. If one ambitious student takes many classes, there will be a lot of empty data slots.</li>
<li>All the information about individual teachers (department, age, etc.) is repeated many times, meaning we use more storage than we need to.</li>
<li>If we want to look at the data on a per teacher basis, this is very poorly organized for that.</li>
<li>If one wants to change certain information (such as the age of a teacher) one needs to do it in many locations, which can result in errors and is inefficient. </li>
</ol>
<p>It would get even worse if there was a field related to teachers for which a given teacher could have multiple values (e.g., teachers could be in multiple departments). This would lead to even more redundancy - each student-class-teacher combination would be crossed with all of the departments for the teacher (so-called multivalued dependency in database theory).</p>
<p>An alternative organization of the data would be to have each row represent the enrollment of a student in a class, with as many rows per student as the number of classes the student is taking.</p>
<ul>
<li>student ID</li>
<li>student name</li>
<li>class</li>
<li>grade in class</li>
<li>student grade level</li>
<li>teacher ID</li>
<li>teacher department</li>
<li>teacher age</li>
</ul>
<p>This has some advantages relative to our original organization in terms of not having empty data slots, but it doesn't solve the other three issues above.</p>
<p>Instead, a natural way to order this database is with the following tables.</p>
<ul>
<li><p>Student</p>
<ul>
<li>ID</li>
<li>name</li>
<li>grade_level</li>
</ul></li>
<li><p>Teacher</p>
<ul>
<li>ID</li>
<li>name</li>
<li>department</li>
<li>age</li>
</ul></li>
<li><p>Class</p>
<ul>
<li>ID</li>
<li>topic</li>
<li>class_size</li>
<li>teacher_ID</li>
</ul></li>
<li><p>ClassAssignment</p>
<ul>
<li>student_ID</li>
<li>class_ID</li>
<li>grade</li>
</ul></li>
</ul>
<p>Then we do queries to pull information from multiple tables. We do the joins based on 'keys', which are the fields in each table that allow us to match rows from different tables. </p>
<p>(That said, if all anticipated uses of a database will end up recombining the same set of tables, we may want to have a denormalized schema in which those tables are actually combined in the database. It is possible to be too pure about normalization! We can also create a virtual table, called a <em>view</em>, as discussed later.)</p>
<h3>2.2.3) Keys</h3>
<p>A key is a field or collection of fields that give(s) a unique value for every row/observation. A table in a database should then have a primary key that is the main unique identifier used by the DBMS. Foreign keys are columns in one table that give the value of the primary key in another table. When information from multiple tables is joined together, the matching of a row from one table to a row in another table is generally done by equating the primary key in one table with a foreign key in a different table.</p>
<p>In our educational example, the primary keys would presumably be: Student.ID, Teacher.ID, Class.ID, and for ClassAssignment two fields: {ClassAssignment.studentID, ClassAssignment.class_ID}.</p>
<p>Some examples of foreign keys would be:</p>
<ul>
<li>student_ID as the foreign key in ClassAssignment for joining with Student on Student.ID</li>
<li>teacher_ID as the foreign key in Class for joining with Teacher based on Teacher.ID</li>
<li>class_ID as the foreign key in ClassAssignment for joining with Class based on Class.ID</li>
</ul>
<h3>2.2.4) Queries that join data across multiple tables</h3>
<p>Suppose we want a result that has the grades of all students in 9th grade. For this we need information from the Student table (to determine grade level) and information from the ClassAssignment table (to determine the class grade). More specifically we need a query that joins 'Student' with 'ClassAssignment' based on 'Student.ID' and 'ClassAssignment.student_ID' and filters the rows based on 'Student.grade_level':</p>
<pre><code>SELECT Student.ID, grade FROM Student, ClassAssignment
WHERE Student.ID = ClassAssignment.student_ID
AND Student.grade_level = 9;
</code></pre>
<p>If we wanted to include information about the teachers who gave those grades we'd also join in the Teacher and Class tables. (We need the Class table to be able to match from ClassAssignment to Teacher.) It would look something like this:</p>
<pre><code>SELECT Student.ID, grade, Teacher.name, Teacher.department FROM
Student, ClassAssignment, Teacher, Class
WHERE Student.ID = ClassAssignment.student_ID
AND ClassAssignment.class_ID = Class.ID
AND Class.teacher_ID = teacher.ID
AND Student.grade_level = 9;
</code></pre>
<p>Note that both of these queries are <em>joins</em> (specifically <em>inner joins</em>), which are like <code>merge()</code> in R. We don't specifically use the JOIN keyword, but one could do these queries explicitly using JOIN, as we'll see later.</p>
<h2>2.3) Using SQL</h2>
<h3>2.3.1) Stack Overflow example database</h3>
<p>I've obtained data from <a href="https://stackoverflow.com">Stack Overflow</a>, the popular website for asking coding questions, and placed it into a normalized database. The SQLite version (also in CSVs as one CSV per table) has metadata (i.e., it lacks the actual text of the questions and answers) on all of the questions and answers posted in 2016.</p>
<p>We'll explore SQL functionality using this example database. </p>
<p>Now let's consider the Stack Overflow data. Each question may have multiple answers and each question may have multiple (topic) tags.</p>
<p>If we tried to put this into a single table, the fields could look like this if we have one row per question:</p>
<ul>
<li>question ID</li>
<li>ID of user submitting question</li>
<li>question title</li>
<li>tag 1</li>
<li>tag 2 </li>
<li>…</li>
<li>tag n</li>
<li>answer 1 ID</li>
<li>ID of user submitting answer 1</li>
<li>answer 2 ID</li>
<li>ID of user submitting answer 2 </li>
<li>…</li>
</ul>
<p>or like this if we have one row per question-answer pair:</p>
<ul>
<li>question ID</li>
<li>ID of user submitting question</li>
<li>question title</li>
<li>tag 1</li>
<li>tag 2</li>
<li>…</li>
<li>tag n</li>
<li>answer ID</li>
<li>ID of user submitting answer</li>
</ul>
<p>As we've discussed neither of those schema is particularly desirable. </p>
<p><strong><em>Question</em></strong>: How would you devise a schema to normalize the data. I.e., what set of tables do you think we should create?</p>
<p>Don't peek until after you've thought about it, but you can view one <a href="normalized_example.png">reasonable schema here</a>. The lines between tables indicate the relationship of foreign keys in one table to primary keys in another table. The schema in the actual databases of Stack Overflow data we'll use in this tutorial is similar to but not identical to that. </p>
<h4>Getting the database</h4>
<p>You can download a copy of the SQLite version of the Stack Overflow database (only data for the year 2016) from <a href="http://www.stat.berkeley.edu/share/paciorek/tutorial-databases-data.zip">here</a> as part of the overall zip with all of the example datasets as discussed in the introduction of this tutorial. </p>
<p>In the next section I'll assume the .db file is placed in the subdirectory of the repository called <code>data</code>.</p>
<p>Note that all of the code used to download the data from the Stack Overflow website and to manipulate it to create a complete Postgres database and (for the year 2016 only) an SQLite database and CSVs for each table is in the <code>data/prep_stackoverflow</code> subdirectory of this repository. Note that as of January 2020, <a href="https://archive.org/download/stackexchange">the data are still being kept up to date online</a>.</p>
<h3>2.3.2) Accessing SQL from other languages</h3>
<p>Although DBMS have their own interfaces (we'll see a bit of this later), databases are commonly accessed from other programs. For data analysts this would often be Python or R, as seen next.</p>
<p>Most of our examples of making SQL queries on a database will be done from R, but they could just as easily have been done from Python or other programs.</p>
<h4>Using SQL from R</h4>
<p>The <em>DBI</em> package provides a front-end for manipulating databases from a variety of DBMS (SQLite, MySQL, PostgreSQL, among others).
Basically, you tell the package what DBMS is being used on the back-end, link to the actual database, and then you can use the standard functions in the package regardless of the back-end.</p>
<p>With SQLite, R processes make calls against the stand-alone SQLite database (.db) file, so there are no SQLite-specific processes. With PostgreSQL, R processes call out to separate Postgres processes; these are started from the overall Postgres background process</p>
<p>You can access and navigate an SQLite database from R as follows.</p>
<pre><code class="r">library(RSQLite)
drv <- dbDriver("SQLite")
dir <- 'data' # relative or absolute path to where the .db file is
dbFilename <- 'stackoverflow-2016.db'
db <- dbConnect(drv, dbname = file.path(dir, dbFilename))
dbGetQuery(db, "select * from questions limit 5") # simple query to get 5 rows from a table
</code></pre>
<pre><code>## questionid creationdate score viewcount
## 1 34552550 2016-01-01 00:00:03 0 108
## 2 34552551 2016-01-01 00:00:07 1 151
## 3 34552552 2016-01-01 00:00:39 2 1942
## 4 34552554 2016-01-01 00:00:50 0 153
## 5 34552555 2016-01-01 00:00:51 -1 54
## title
## 1 Scope between methods
## 2 Rails - Unknown Attribute - Unable to add a new field to a form on create/update
## 3 Selenium Firefox webdriver won't load a blank page after changing Firefox preferences
## 4 Android Studio styles.xml Error
## 5 Java: reference to non-finial local variables inside a thread
## ownerid
## 1 5684416
## 2 2457617
## 3 5732525
## 4 5735112
## 5 4646288
</code></pre>
<p>We can easily see the tables and their fields:</p>
<pre><code class="r">dbListTables(db)
</code></pre>
<pre><code>## [1] "answers" "questions" "questions_tags"
## [4] "users"
</code></pre>
<pre><code class="r">dbListFields(db, "questions")
</code></pre>
<pre><code>## [1] "questionid" "creationdate" "score" "viewcount"
## [5] "title" "ownerid"
</code></pre>
<pre><code class="r">dbListFields(db, "answers")
</code></pre>
<pre><code>## [1] "answerid" "questionid" "creationdate" "score"
## [5] "ownerid"
</code></pre>
<p>One can either make the query and get the results in one go or make the query and separately fetch the results. Here we've selected the first five rows (and all columns, based on the <code>*</code> wildcard) and brought them into R as a data frame.</p>
<pre><code class="r">results <- dbGetQuery(db, 'select * from questions limit 5')
class(results)
</code></pre>
<pre><code>## [1] "data.frame"
</code></pre>
<pre><code class="r">query <- dbSendQuery(db, "select * from questions")
query
</code></pre>
<pre><code>## <SQLiteResult>
## SQL select * from questions
## ROWS Fetched: 0 [incomplete]
## Changed: 0
</code></pre>
<pre><code class="r">results2 <- fetch(query, 5)
identical(results, results2)
</code></pre>
<pre><code>## [1] TRUE
</code></pre>
<pre><code class="r">dbClearResult(query) # clear to prepare for another query
</code></pre>
<p>To disconnect from the database:</p>
<pre><code class="r">dbDisconnect(db)
</code></pre>
<p>To access a PostgreSQL database instead, you can do the following, assuming the database has been created and you have a username and password that allow you to access the particular database.</p>
<pre><code class="r">library(RPostgreSQL)
drv <- dbDriver("PostgreSQL")
db <- dbConnect(drv, dbname = 'stackoverflow', user = 'paciorek', password = 'test')
</code></pre>
<p>Apart from the different manner of connecting, all of the queries above are the same regardless of whether the back-end DBMS is SQLite, PostgreSQL, etc.</p>
<h4>Using SQL from Python</h4>
<p>For SQLite:</p>
<pre><code class="python">import sqlite3 as sq
dir <- 'data' # relative or absolute path to where the .db file is
dbFilename <- 'stackoverflow-2016.db'
import os
db = sq.connect(os.path.join('data', dbFilename))
c = db.cursor()
c.execute("select * from questions limit 5") # simple query
results = c.fetchall() # retrieve results
</code></pre>
<p>To disconnect:</p>
<pre><code>c.close()
</code></pre>
<p>Here's how you would connect to PostgreSQL instead:</p>
<pre><code class="python">import psycopg2 as pg
db = pg.connect("dbname = 'stackoverflow' user = 'paciorek' host = 'localhost' password = 'test'")
c = db.cursor()
</code></pre>
<h4>Questions</h4>
<p><strong><em>Challenge</em></strong>: Return a few rows from the users, questions, answers, and tags tables so you can get a sense for what the entries in the tables are like.</p>
<p><strong><em>Challenge</em></strong>: Find the youngest users in the database.</p>
<h3>2.3.3) Simple joins</h3>
<p>It turns out that the syntax of using multiple tables we've seen can be viewed formally as a table join and could also be implemented using the JOIN keyword.</p>
<p>The syntax generally looks like this (again the WHERE and ORDER BY are optional):</p>
<pre><code>SELECT <column(s)> FROM <table1> JOIN <table2> ON <columns to match on>
WHERE <condition(s) on column(s)> ORDER BY <column(s)>
</code></pre>
<p>Let's see some joins using the different syntax on the Stack Overflow database. In particular let's select only the questions with the tag “python”.</p>
<pre><code class="r">## a join with JOIN
result1 <- dbGetQuery(db, "select * from questions join questions_tags
on questions.questionid = questions_tags.questionid where tag = 'python'")
## a join without JOIN
result2 <- dbGetQuery(db, "select * from questions, questions_tags
where questions.questionid = questions_tags.questionid and tag = 'python'")
head(result2)
</code></pre>
<pre><code>## questionid creationdate score viewcount
## 1 34553559 2016-01-01 04:34:34 3 96
## 2 34556493 2016-01-01 13:22:06 2 30
## 3 34557898 2016-01-01 16:36:04 3 143
## 4 34560088 2016-01-01 21:10:32 1 126
## 5 34560213 2016-01-01 21:25:26 1 127
## 6 34560740 2016-01-01 22:37:36 0 455
## title
## 1 Python nested loops only working on the first pass
## 2 bool operator in for Timestamp in Series does not work
## 3 Pairwise haversine distance calculation
## 4 Stopwatch (chronometre) doesn't work
## 5 How to set the type of a pyqtSignal (variable of class X) that takes a X instance as argument
## 6 Flask: Peewee model_to_dict helper not working
## ownerid questionid..7 tag
## 1 845642 34553559 python
## 2 4458602 34556493 python
## 3 2927983 34557898 python
## 4 5736692 34560088 python
## 5 5636400 34560213 python
## 6 3262998 34560740 python
</code></pre>
<pre><code class="r">identical(result1, result2)
</code></pre>
<pre><code>## [1] TRUE
</code></pre>
<p>Here's a three-way join with some additional use of aliases to abbreviate table names. What does this query ask for?</p>
<pre><code class="r">result1 <- dbGetQuery(db, "select * from questions Q
join questions_tags T on Q.questionid = T.questionid
join users U on Q.ownerid = U.userid
where tag = 'python' and age < 18")
result2 <- dbGetQuery(db, "select * from questions Q, questions_tags T, users U
where Q.questionid = T.questionid
and Q.ownerid = U.userid
and tag = 'python'
and age < 18")
identical(result1, result2)
</code></pre>
<pre><code>## [1] TRUE
</code></pre>
<p><strong><em>Challenge</em></strong>: Write a query that would return all the answers to questions with the Python tag.</p>
<p><strong><em>Challenge</em></strong>: Write a query that would return the users who have answered a question with the Python tag.</p>
<h3>2.3.4) Grouping / stratifying</h3>
<p>A common pattern of operation is to stratify the dataset, i.e., collect it into mutually exclusive and exhaustive subsets. One would then generally do some operation on each subset. In SQL this is done with the GROUP BY keyword.</p>
<p>Here's a basic example where we count the occurrences of different tags. </p>
<pre><code>dbGetQuery(db, "select tag, count(*) as n from questions_tags
group by tag order by n desc limit 100")
</code></pre>
<p><strong><em>Challenge</em></strong>: What specifically does that query do? Describe the table that would be returned.</p>
<p>In general <code>GROUP BY</code> statements will involve some aggregation operation on the subsets. Options include: COUNT, MIN, MAX, AVG, SUM.</p>
<p>Note that to filter the result of a grouping operation, we need to use <code>having</code> rather than <code>where</code>.</p>
<p>Also note the use of <code>as</code> to define a name for the new column.</p>
<pre><code>dbGetQuery(db, "select tag, count(*) as n from questions_tags
group by tag having n > 100000 limit 10")
</code></pre>
<p><strong><em>Challenge</em></strong>: Write a query that will count the number of answers for each question, returning the most answered questions. </p>
<h3>2.3.5) Getting unique results (DISTINCT)</h3>
<p>A useful SQL keyword is DISTINCT, which allows you to eliminate duplicate rows from any table (or remove duplicate values when one only has a single column or set of values).</p>
<pre><code class="r">tagNames <- dbGetQuery(db, "select distinct tag from questions_tags")
dbGetQuery(db, "select count(distinct tag) from questions_tags")
</code></pre>
<pre><code>## count(distinct tag)
## 1 41006
</code></pre>
<h3>2.3.6) Indexes</h3>
<p>An index is an ordering of rows based on one or more fields. DBMS use indexes to look up values quickly, either when filtering (if the index is involved in the WHERE condition) or when doing joins (if the index is involved in the JOIN condition). So in general you want your tables to have indexes.</p>
<p>DBMS use indexing to provide sub-linear time lookup. Without indexes, a database needs to scan through every row sequentially, which is called linear time lookup – if there are n rows, the lookup is \(O(n)\) in computational cost. With indexes, lookup may be logarithmic – O(log(n)) – (if using tree-based indexes) or constant time – O(1) – (if using hash-based indexes). A binary tree-based search is logarithmic; at each step through the tree you can eliminate half of the possibilities. </p>
<p>Here's how we create an index, with some time comparison for a simple query.</p>
<pre><code class="r">system.time(dbGetQuery(db, "select * from questions where viewcount > 10000")) # 2.4 seconds
system.time(dbExecute(db, "create index count_index on questions (viewcount)")) # 5.6 seconds
system.time(dbGetQuery(db, "select * from questions where viewcount > 10000")) # 0.9 seconds
## restore earlier state by removing index
system.time(dbExecute(db, "drop index count_index"))
</code></pre>
<p>In other contexts, an index can save huge amounts of time. So if you're working with a database and speed is important, check to see if there are indexes.</p>
<p>That being said, using indexes in a lookup is not always advantageous, as discussed in Section 2.6 on efficient SQL queries.</p>
<h3>2.3.7) Temporary tables and views</h3>
<p>You can think of a view as a temporary table that is the result of a query and can be used in subsequent queries. In any given query you can use both views and tables. The advantage is that they provide modularity in our querying. For example, if a given operation (portion of a query) is needed repeatedly, one could abstract that as a view and then make use of that view.</p>
<p>Suppose we always want the age and displayname of question owners available. Once we have the view we can query it like a regular table.</p>
<pre><code class="r">## note there is a creationdate in users too, hence disambiguation
dbExecute(db, "create view questionsAugment as
select questionid, questions.creationdate, score, viewcount, title, ownerid, age, displayname
from questions join users on questions.ownerid = users.userid")
## don't be confused by the "0" response --
## it just means that nothing is returned to R; the view _has_ been created
dbGetQuery(db, "select * from questionsAugment where age < 15 limit 5")
</code></pre>
<p>One use of a view would be to create a mega table that stores all the information from multiple tables in the (unnormalized) form you might have if you simply had one data frame in R or Python.</p>
<pre><code class="r">dbExecute(db, "drop view questionsAugment") # drop so can create again when rerun the code above
</code></pre>
<h3>2.3.8) Creating database tables</h3>
<p>One can create tables from within the <code>sqlite</code> and <code>psql</code> command line interfaces (discussed later), but often one would do this from R or Python. Here's the syntax from R.</p>
<pre><code>## Option 1: pass directly from CSV to database
dbWriteTable(conn = db, name = "student", value = "student.csv", row.names = FALSE, header = TRUE)
## Option 2: pass from data in an R data frame
## First create your data frame:
# student <- data.frame(...)
## or
# student <- read.csv(...)
dbWriteTable(conn = db, name = "student", value = student, row.names = FALSE, append = FALSE)
</code></pre>
<h2>2.4) More advanced SQL</h2>
<h3>2.4.1) More on joins</h3>
<p>We've seen a bunch of joins but haven't discussed the full taxonomy of types of joins. There are various possibilities for how to do a join depending on whether there are rows in one table that do not match any rows in another table.</p>
<p><em>Inner joins</em>: In database terminology an inner join is when the result has a row for each match of a row in one table with the rows in the second table, where the matching is done on the columns you indicate. If a row in one table corresponds to more than one row in another table, you get all of the matching rows in the second table, with the information from the first table duplicated for each of the resulting rows. For example in the Stack Overflow data, an inner join of questions and answers would pair each question with each of the answers to that question. However, questions without any answers or (if this were possible) answers without a corresponding question would not be part of the result.</p>
<p><em>Outer joins</em>: Outer joins add additional rows from one table that do not match any rows from the other table as follows. A <em>left outer join</em> gives all the rows from the first table but only those from the second table that match a row in the first table. A <em>right outer join</em> is the converse, while a <em>full outer join</em> includes at least one copy of all rows from both tables. So a left outer join of the Stack Overflow questions and answers tables would, in addition to the matched questions and their answers, include a row for each question without any answers, as would a full outer join. In this case there should be no answers that do not correspond to question, so a right outer join should be the same as an inner join. </p>
<p><em>Cross joins</em>: A cross join gives the Cartesian product of the two tables, namely the pairwise combination of every row from each table, analogous to <code>expand.grid</code> in R. I.e., take a row from the first table and pair it with each row from the second table, then repeat that for all rows from the first table. Since cross joins pair each row in one table with all the rows in another table, the resulting table can be quite large (the product of the number of rows in the two tables). In the Stack Overflow database, a cross join would pair each question with every answer in the database, regardless of whether the answer is an answer to that question.</p>
<p>Here's a table of the different kinds of joins:</p>
<table><thead>
<tr>
<th>Type of join</th>
<th>Rows from first table</th>
<th>Rows from second table</th>
</tr>
</thead><tbody>
<tr>
<td>inner (default)</td>
<td>all that match on specified condition</td>
<td>all that match on specified condition</td>
</tr>
<tr>
<td>left outer</td>
<td>all</td>
<td>all that match first</td>
</tr>
<tr>
<td>right outer</td>
<td>all that match second</td>
<td>all</td>
</tr>
<tr>
<td>full outer</td>
<td>all</td>
<td>all</td>
</tr>
<tr>
<td>cross</td>
<td>all combined pairwise with second</td>
<td>all combined pairwise with first</td>
</tr>
</tbody></table>
<p>A 'natural' join is an inner join that doesn't require you to specify the common columns between tables on which to enforce equality, but it's often good practice to not use a natural join and to explicitly indicate which columns are being matched on.</p>
<p>Simply listing two or more tables separated by commas as we saw earlier is the same as a <em>cross join</em>. Alternatively, listing two or more tables separated by commas, followed by conditions that equate rows in one table to rows in another is the same as an <em>inner join</em>. </p>
<p>In general, inner joins can be seen as a form of cross join followed by a condition that enforces matching between the rows of the table. More broadly, here are five equivalent joins that all perform the equivalent of an inner join:</p>
<pre><code>select * from table1 join table2 on table1.id = table2.id ## explicit inner join
select * from table1, table2 where table1.id = table2.id ## without explicit JOIN
select * from table1 cross join table2 where table1.id = table2.id
select * from table1 join table2 using(id)
select * from table1 natural join table2
</code></pre>
<p>Note that in the last query the join would be based on all common columns, which could be a bit dangerous if you don't look carefully at the schema of both tables. Assuming <code>id</code> is the common column, then the last of these queries is the same as the others.</p>
<p><strong><em>Challenge</em></strong>: Create a view with one row for every question-tag pair, including questions without any tags.</p>
<p><strong><em>Challenge</em></strong>: Write a query that would return the displaynames of all of the users who have <em>never</em> posted a question. The NULL keyword will come in handy – it's like <code>NA</code> in R. Hint: NULLs should be produced if you do an outer join.</p>
<p><strong><em>Challenge</em></strong>: How many questions tagged with 'random-forest' were unanswered? (You should need two different kinds of joins to answer this.)</p>
<h3>2.4.2) Joining a table with itself (self joins)</h3>
<p>Sometimes we want to query information across rows of the same table. For example supposed we want to analyze the time lags between when the same person posts a question. Do people tend to post in bursts or do they tend to post uniformly over the year? To do this we need contrasts between the times of the different posts. (One can also address this using window functions, discussed later.)</p>
<p>So we need to join two copies of the same table, which means dealing with resolving the multiple copies of each column.</p>
<p>This would look like this:</p>
<pre><code>dbGetQuery(db, "create view question_contrasts as
select * from questions Q1 join questions Q2
on Q1.ownerid = Q2.ownerid")
</code></pre>
<p>That should create a new table (actually a view) with all pairs of questions asked by a single person.</p>
<p>Actually, there's a problem here.</p>
<p><strong><em>Challenge</em></strong>: What kinds of rows will we get that we don't want?</p>
<p>A solution to that problem of having the same question paired with itself is:</p>
<pre><code>dbGetQuery(db, "create view question_contrasts as
select * from questions Q1 join questions Q2
on Q1.ownerid = Q2.ownerid
where Q1.creationdate != Q2.creationdate")
</code></pre>
<p><strong><em>Challenge</em></strong>: There's actually a further similar problem. What is the problem and how can we fix it by changing two characters in the query above? Hint, even as character strings, the creationdate column has an ordering.</p>
<h3>2.4.3) Set operations: UNION, INTERSECT, EXCEPT</h3>
<p>You can do set operations like union, intersection, and set difference using the UNION, INTERSECT, and EXCEPT keywords on tables that have the same schema (same column names and types), though most often these would be used on single columns (i.e., single-column tables).</p>
<p>Note that one can often set up an equivalent query without using INTERSECT or UNION.</p>
<p>Here's an example of a query that can be done with or without an intersection. Suppose we want to know the names of all individuals who have asked both an R question and a Python question. We can do this with INTERSECT:</p>
<pre><code class="r">system.time(
result1 <- dbGetQuery(db, "select displayname, userid from
questions Q join users U on U.userid = Q.ownerid
join questions_tags T on Q.questionid = T.questionid
where tag = 'r'
intersect
select displayname, userid from
questions Q join users U on U.userid = Q.ownerid
join questions_tags T on Q.questionid = T.questionid
where tag = 'python'")
)
</code></pre>
<pre><code>## user system elapsed
## 7.872 2.907 29.394
</code></pre>
<p>Alternatively we can do a self-join. Note that the syntax gets complicated as we are doing multiple joins.</p>
<pre><code class="r">system.time(
result2 <- dbGetQuery(db, "select displayname, userid from
(questions Q1 join questions_tags T1