-
Notifications
You must be signed in to change notification settings - Fork 32
/
dataman.tex
429 lines (359 loc) · 13.7 KB
/
dataman.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
\chapter{Managing Data}\label{s:dataman}
There's not much point creating interactive web pages
if they don't have something to interact with.
To provide that,
we need something to store data and something to serve it.
We could build one program to do both,
but experience teaches that it's better to create one for each
so that they are easier to understand, test, and maintain.
After tossing a coin,
we decide to start with the data store;
\chapref{s:server} will look at how to build a server.
\section{Data Formats}\label{s:dataman-formats}
The most widely used text format for tabular data is undoubtedly
\gref{g:csv}{comma-separated values} (CSV).
Each row of the table is a line in the file;
the values within each row---i.e., the columns---are separated by commas.
Numbers appear as themselves;
strings may or may not be wrapped in quotation marks,
unless they contain commas themselves,
in which case they definitely are:
\begin{minted}{text}
"maroon",128,0,0
"olive",128,128,0
"aqua",0,255,255
"fuchsia",255,0,255
\end{minted}
The first line of a CSV file is often a \gref{g:header-row}{header row}
that defines the names of the columns.
For example,
the small table shown above would better be represented as:
\begin{minted}{text}
"name","red","green","blue"
"maroon",128,0,0
"olive",128,128,0
"aqua",0,255,255
"fuchsia",255,0,255
\end{minted}
Tragically,
CSV doesn't require the first row to be a header,
and CSV files usually don't specify units or data types.
We can guess that the values in the table above are integers,
but it's all too common to have a CSV file whose columns are labelled ``height'' and ``weight''
without any indication of whether the heights are in feet or meters
or the weights in pounds or kilograms.
CSV is good for tabular data,
but a lot of data doesn't neatly fit into rows and columns.
Many programmers use \gref{g:json}{JSON} instead:
it supports a subset of the syntax for values, arrays, and objects in JavaScript,
so that (for example)
we can store configuration values for a program like this:
\begin{minted}{js}
{
"name" : "DataExplorer",
"version" : "1.2.1",
"preferences" : {
"colorscheme" : "dark",
"autofill" : true
},
"last_opened" : [
"raw/biotic.dat",
"raw/genomic.dat",
"cooked/inferred.dat"
]
}
\end{minted}
JSON can be used for tabular data as well.
The whole table is an array,
and each record is an object with name-value pairs:
\begin{minted}{js}
const colors = [
{"name": "maroon", "red": 128, "green": 0, "blue": 0},
{"name": "olive", "red": 128, "green": 128, "blue": 0},
{"name": "aqua", "red": 0, "green": 255, "blue": 255},
{"name": "fuchsia", "red": 255, "green": 0, "blue": 255}
]
\end{minted}
\noindent
Repeating field names like this is wasteful compared to listing them once at the top of a table,
but it does mean that the fields within rows can be accessed directly
using expressions like \texttt{colors[1].red}.
\section{Slicing Data}\label{s:dataman-slicing}
The data we will use as an example is available in a variety of formats
from the \hreffoot{https://figshare.com/articles/Portal\_Project\_Teaching\_Database/1314459}{Portal Project Teaching Database}.
We will focus on \texttt{surveys.csv},
which has over 35,500 records.
That's a lot to look at,
so we will create a 10-record slice for testing.
Although it would be easy to take the first ten,
or the last,
there's a good chance that neither would be representative of the data as a whole.
Instead,
we will write a little script that selects some records at random.
Since it doesn't need to be efficient,
we will read everything,
pair each line with a random number,
sort the lines using those random numbers as keys,
then take the top few lines.
\begin{minted}{js}
const fs = require('fs')
const [inputFile, numLines, outputFile] = process.argv.splice(2)
const lines = fs.readFileSync(inputFile, 'utf-8')
.split('\n')
header = lines[0]
const sample = lines.slice(1)
.map(line => [Math.random(), line])
.sort((left, right) => { return left[0] - right[0] })
.slice(0, parseInt(numLines))
.map(pair => pair[1])
fs.writeFileSync(outputFile, header + '\n' + sample.join('\n'))
\end{minted}
When we run this on the command line:
\begin{minted}{shell}
$ node select-random.js ../../data/surveys.csv 10 slice.csv
\end{minted}
\noindent
we get:
\begin{minted}{text}
record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight
18501,3,14,1991,13,OT,M,21,28
2283,1,15,1980,11,OL,M,21,23
19941,5,2,1992,1,PP,M,22,13
27413,12,29,1997,5,,,,
16002,5,9,1989,19,SC,,,
28813,11,21,1998,12,DO,M,35,56
9338,7,4,1984,11,DO,F,35,57
28336,8,22,1998,7,PB,M,26,23
25323,3,16,1997,9,DM,F,33,26
6785,10,23,1982,5,DM,F,37,45
\end{minted}
Running it again will probably generate a different data slice,
since we're not specifying a random number generation \gref{g:seed}{seed}.
We are bad people, and will fix this in the exercises.
\begin{aside}{Slicing Command-Line Arguments}
When we run:
\begin{minted}{shell}
$ node select-random.js ../../data/surveys.csv 10 slice.csv
\end{minted}
the array \texttt{process.argv} contains five strings:
the \texttt{node} command,
the name of our script \texttt{select-random.js},
and then the name of the input file,
the number of lines we can,
and the name of the output file.
\texttt{process.argv.slice(2)} discards elements 0 and 1 from this list,
leaving us the three values we need to assign to \texttt{inputFile},
\texttt{numLines},
and \texttt{outputFile} respectively.
We will explore a better way to do this in the exercises.
\end{aside}
\section{Data Manager}\label{s:dataman-manager}
Rather arbitrarily,
we decide that our data manager will be able to answer two questions:
\begin{enumerate}
\item
How many records do we have and what range of years do they cover?
This is the kind of opening question that many client programs will ask.
\item
What are the minimum, average, and maximum values
for weight and hindfoot length by year
for a given range of years?
This would be very specific to a particular kind of client program;
a good service would either provide many such specialized queries
or provide a way to apply common \gref{g:aggregation-function}{aggregation functions}
to particular columns.
\end{enumerate}
We will use \hreffoot{https://www.papaparse.com/}{PapaParse}\index{PapaParse} to parse our CSV,
so our first step is to install it:
\begin{minted}{shell}
$ npm install papaparse
\end{minted}
After loading the library and reading our test data file a couple of times,
we break down and read the documentation,
then come up with this as the first version of our data manager:
\begin{minted}{js}
const fs = require('fs')
const papa = require('papaparse')
class DataManager {
constructor (filename) {
const raw = fs.readFileSync(filename, 'utf-8')
const options = {header: true, dynamicTyping: true}
this.data = papa.parse(raw, options).data
}
}
module.exports = DataManager
\end{minted}
\texttt{papa.parse} takes two arguments:
the CSV file to be parsed
and a configuration object that controls how the parser behaves.
This configuration object is highly customizable.;
here,
our \texttt{options} instruct the parser
to interpret the first row as a header (which sets column names)
and to convert things that look like numbers to numbers (the \texttt{dynamicTyping} option).
The output of \texttt{papa.parse} looks like this:
\begin{minted}{js}
{ data:
[ { record_id: 18501,
month: 3,
day: 14,
year: 1991,
plot_id: 13,
species_id: 'OT',
sex: 'M',
hindfoot_length: 21,
weight: 28 },
...eight more records...
{ record_id: 6785,
month: 10,
day: 23,
year: 1982,
plot_id: 5,
species_id: 'DM',
sex: 'F',
hindfoot_length: 37,
weight: 45 } ],
errors: [],
meta:
{ delimiter: ',',
linebreak: '\n',
aborted: false,
truncated: false,
cursor: 350,
fields:
[ 'record_id',
'month',
'day',
'year',
'plot_id',
'species_id',
'sex',
'hindfoot_length',
'weight' ] } }
\end{minted}
\noindent
so \texttt{papa.parse(raw, options).data} gets the data we want as JSON.
Let's write a method to get some overall statistics:
\begin{minted}{js}
getSurveyStats () {
return {
year_low : this._get(this.data, 'year', Math.min),
year_high : this._get(this.data, 'year', Math.max),
record_count : this.data.length
}
}
// ...other methods...
_get(values, field, func) {
return func(...values.map(rec => rec[field]).filter(val => !isNaN(val)))
}
\end{minted}
Functions like \texttt{Math.min} and \texttt{Math.max} take any number of scalar values as arguments,
but do not directly process arrays.
Enter \gref{g:spread-syntax}{spread syntax} \texttt{...}:
the notation \texttt{func(...array)} means
``pass all the values in the array as separate arguments'',
which saves us from writing our own minimum and maximum functions.
Thus,
\texttt{func(...this.data.map(rec\ =\textgreater{}\ rec[field]))} means
``select the specified field from each record in \texttt{this.data} to create an array of fields,
then pass all of those values as arguments to \texttt{func}.
We include an underscore at the start of the name of \texttt{\_get} to indicate that we
intend it to be used only inside \texttt{DataManager} and not to be called elsewhere.
Adding the method to get weight and hindfoot length for a range of years
is comparatively straightforward.
First,
we write a function to calculate the average of one or more arguments:
\begin{minted}{js}
const _average = (...values) => {
let sum = 0
for (let v of values) {
sum += v
}
return sum / values.length
}
\end{minted}
It would be more natural for \texttt{\_average} to take an array rather than a variable number of arguments,
but we want to be able to use it in the same way that we use \texttt{Math.min} and \texttt{Math.max},
so we have to conform to their signature.
After some thought we realize that it's possible for \texttt{subset} to be empty---i.e.,
it's possible that there are years that have no data in our dataset.
We should filter these out,
to prevent unnecessary effort being made to render summary statistics
with \texttt{NaN} values.\index{NaN@\texttt{NaN} (Not a Number)}
Remembering that empty arrays are not falsy in JavaScript (\chapref{s:basics}),
we decide to test that the \texttt{subset} returned by filtering for each year
contains at least one entry.
The last thing that we need to ensure is that each data object has a unique key,
which will make it much easier for React to efficiently update the display of
the data when we are ready to render it.
The method to get the values for a range of years is now:
\begin{minted}{js}
getSurveyRange (minYear, maxYear) {
return Array(1 + maxYear - minYear)
.fill(0)
.map((v, i) => minYear + i)
.map(year => {
const subset = this.data.filter(r => r.year === year)
if (subset.length) {
return {
key : toString(year),
year : year,
min_hindfoot_length :this._get(subset,
'hindfoot_length', Math.min),
ave_hindfoot_length : this._get(subset,
'hindfoot_length', _average),
max_hindfoot_length : this._get(subset,
'hindfoot_length', Math.max),
min_weight : this._get(subset, 'weight', Math.min),
ave_weight : this._get(subset, 'weight', _average),
max_weight : this._get(subset, 'weight', Math.max)
}
}
})
}
\end{minted}
\section{Exercises}\label{s:dataman-exercises}
\exercise{Tracing Data}
Trace the execution of the utility program that creates a small sample of the original data,
explaining what is passed into each of the chained methods calls.
\exercise{Unrandom}
Programs that rely on random numbers are impossible to test
because there's (deliberately) no way to predict their output.
Luckily, computer programs don't actually use random numbers:
they use \gref{g:pseudo-random-number}{pseudo-random numbers}
that are generated in a repeatable but unpredictable way.
Given the same initial \gref{g:seed}{seed},
a pseudo-random number generator will always produce the same sequence of values.
There is no way to set a seed for \texttt{Math.random} out of the box,
but the \hreffoot{https://www.npmjs.com/package/seedrandom}{seedrandom} package provides an add-on function for this purpose.
Install the package and modify the slice selection utility
so that it takes a word or phrase as a command-line argument
and uses it to seed the random number generator.
\exercise{One Record Per Year}
Another way to slice the data for testing purposes is to select one record from each year.
Write a small command-line JavaScript program that:
\begin{enumerate}
\item
Reads all the data from the CSV.
\item
Keeps the first record it finds for each year.
\item
Prints these records formatted as SQL \texttt{insert} statements.
\end{enumerate}
\exercise{Error Handling}
Modify \texttt{DataManager}'s constructor so that it checks for errors.
\exercise{Generalization}
Modify \texttt{getSurveyRange} so that it can be called like this:
\begin{minted}{js}
getSurveyRange(minYear, maxYear, 'hindfoot_length', 'weight')
\end{minted}
\noindent
i.e., so that the names of the fields whose minimum, average, and maximum values are wanted
can be passed as strings,
and the method will automatically create the right names and values in its result.
\exercise{Handling Command-Line Arguments}
Read the documentation for
the \hreffoot{https://www.npmjs.com/package/command-line-args}{\texttt{command-line-args}} package
and rewrite the data slicing script to use it instead of \texttt{process.argv.slice}.
\section*{Key Points}
\input{keypoints/dataman}