dataman.tex

\chapter{Managing Data}\label{s:dataman}

There's not much point creating interactive web pages
if they don't have something to interact with.
To provide that,
we need something to store data and something to serve it.
We could build one program to do both,
but experience teaches that it's better to create one for each
so that they are easier to understand, test, and maintain.
After tossing a coin,
we decide to start with the data store;
\chapref{s:server} will look at how to build a server.

\section{Data Formats}\label{s:dataman-formats}

The most widely used text format for tabular data is undoubtedly
\gref{g:csv}{comma-separated values} (CSV).
Each row of the table is a line in the file;
the values within each row---i.e., the columns---are separated by commas.
Numbers appear as themselves;
strings may or may not be wrapped in quotation marks,
unless they contain commas themselves,
in which case they definitely are:

\begin{minted}{text}
"maroon",128,0,0
"olive",128,128,0
"aqua",0,255,255
"fuchsia",255,0,255
\end{minted}

The first line of a CSV file is often a \gref{g:header-row}{header row}
that defines the names of the columns.
For example,
the small table shown above would better be represented as:

\begin{minted}{text}
"name","red","green","blue"
"maroon",128,0,0
"olive",128,128,0
"aqua",0,255,255
"fuchsia",255,0,255
\end{minted}

Tragically,
CSV doesn't require the first row to be a header,
and CSV files usually don't specify units or data types.
We can guess that the values in the table above are integers,
but it's all too common to have a CSV file whose columns are labelled ``height'' and ``weight''
without any indication of whether the heights are in feet or meters
or the weights in pounds or kilograms.

CSV is good for tabular data,
but a lot of data doesn't neatly fit into rows and columns.
Many programmers use \gref{g:json}{JSON} instead:
it supports a subset of the syntax for values, arrays, and objects in JavaScript,
so that (for example)
we can store configuration values for a program like this:

\begin{minted}{js}
{
  "name" : "DataExplorer",
  "version" : "1.2.1",
  "preferences" : {
    "colorscheme" : "dark",
    "autofill" : true
  },
  "last_opened" : [
    "raw/biotic.dat",
    "raw/genomic.dat",
    "cooked/inferred.dat"
  ]
}
\end{minted}

JSON can be used for tabular data as well.
The whole table is an array,
and each record is an object with name-value pairs:

\begin{minted}{js}
const colors = [
  {"name": "maroon", "red": 128, "green": 0, "blue": 0},
  {"name": "olive", "red": 128, "green": 128, "blue": 0},
  {"name": "aqua", "red": 0, "green": 255, "blue": 255},
  {"name": "fuchsia", "red": 255, "green": 0, "blue": 255}
]
\end{minted}

\noindent
Repeating field names like this is wasteful compared to listing them once at the top of a table,
but it does mean that the fields within rows can be accessed directly
using expressions like \texttt{colors[1].red}.

\section{Slicing Data}\label{s:dataman-slicing}

The data we will use as an example is available in a variety of formats
from the \hreffoot{https://figshare.com/articles/Portal\_Project\_Teaching\_Database/1314459}{Portal Project Teaching Database}.
We will focus on \texttt{surveys.csv},
which has over 35,500 records.
That's a lot to look at,
so we will create a 10-record slice for testing.

Although it would be easy to take the first ten,
or the last,
there's a good chance that neither would be representative of the data as a whole.
Instead,
we will write a little script that selects some records at random.
Since it doesn't need to be efficient,
we will read everything,
pair each line with a random number,
sort the lines using those random numbers as keys,
then take the top few lines.

\begin{minted}{js}
const fs = require('fs')

const [inputFile, numLines, outputFile] = process.argv.splice(2)
const lines = fs.readFileSync(inputFile, 'utf-8')
    .split('\n')
header = lines[0]
const sample = lines.slice(1)
    .map(line => [Math.random(), line])
    .sort((left, right) => { return left[0] - right[0] })
    .slice(0, parseInt(numLines))
    .map(pair => pair[1])
fs.writeFileSync(outputFile, header + '\n' + sample.join('\n'))
\end{minted}

When we run this on the command line:

\begin{minted}{shell}
$ node select-random.js ../../data/surveys.csv 10 slice.csv
\end{minted}

\noindent
we get:

\begin{minted}{text}
record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight
18501,3,14,1991,13,OT,M,21,28
2283,1,15,1980,11,OL,M,21,23
19941,5,2,1992,1,PP,M,22,13
27413,12,29,1997,5,,,,
16002,5,9,1989,19,SC,,,
28813,11,21,1998,12,DO,M,35,56
9338,7,4,1984,11,DO,F,35,57
28336,8,22,1998,7,PB,M,26,23
25323,3,16,1997,9,DM,F,33,26
6785,10,23,1982,5,DM,F,37,45
\end{minted}

Running it again will probably generate a different data slice,
since we're not specifying a random number generation \gref{g:seed}{seed}.
We are bad people, and will fix this in the exercises.

\begin{aside}{Slicing Command-Line Arguments}
  When we run:

\begin{minted}{shell}
$ node select-random.js ../../data/surveys.csv 10 slice.csv
\end{minted}

  the array \texttt{process.argv} contains five strings:
  the \texttt{node} command,
  the name of our script \texttt{select-random.js},
  and then the name of the input file,
  the number of lines we can,
  and the name of the output file.
  \texttt{process.argv.slice(2)} discards elements 0 and 1 from this list,
  leaving us the three values we need to assign to \texttt{inputFile},
  \texttt{numLines},
  and \texttt{outputFile} respectively.
  We will explore a better way to do this in the exercises.
\end{aside}

\section{Data Manager}\label{s:dataman-manager}

Rather arbitrarily,
we decide that our data manager will be able to answer two questions:

\begin{enumerate}
\item
  How many records do we have and what range of years do they cover?
  This is the kind of opening question that many client programs will ask.
\item
  What are the minimum, average, and maximum values
  for weight and hindfoot length by year
  for a given range of years?
  This would be very specific to a particular kind of client program;
  a good service would either provide many such specialized queries
  or provide a way to apply common \gref{g:aggregation-function}{aggregation functions}
  to particular columns.
\end{enumerate}

We will use \hreffoot{https://www.papaparse.com/}{PapaParse}\index{PapaParse} to parse our CSV,
so our first step is to install it:

\begin{minted}{shell}
$ npm install papaparse
\end{minted}

After loading the library and reading our test data file a couple of times,
we break down and read the documentation,
then come up with this as the first version of our data manager:

\begin{minted}{js}
const fs = require('fs')
const papa = require('papaparse')

class DataManager {

  constructor (filename) {
    const raw = fs.readFileSync(filename, 'utf-8')
    const options = {header: true, dynamicTyping: true}
    this.data = papa.parse(raw, options).data
  }
}

module.exports = DataManager
\end{minted}

\texttt{papa.parse} takes two arguments: 
the CSV file to be parsed
and a configuration object that controls how the parser behaves. 
This configuration object is highly customizable.;
here,
our \texttt{options}  instruct the parser 
to interpret the first row as a header (which sets column names)
and to convert things that look like numbers to numbers (the \texttt{dynamicTyping} option).
The output of \texttt{papa.parse} looks like this:

\begin{minted}{js}
{ data:
   [ { record_id: 18501,
       month: 3,
       day: 14,
       year: 1991,
       plot_id: 13,
       species_id: 'OT',
       sex: 'M',
       hindfoot_length: 21,
       weight: 28 },

     ...eight more records...

     { record_id: 6785,
       month: 10,
       day: 23,
       year: 1982,
       plot_id: 5,
       species_id: 'DM',
       sex: 'F',
       hindfoot_length: 37,
       weight: 45 } ],
  errors: [],
  meta:
   { delimiter: ',',
     linebreak: '\n',
     aborted: false,
     truncated: false,
     cursor: 350,
     fields:
      [ 'record_id',
        'month',
        'day',
        'year',
        'plot_id',
        'species_id',
        'sex',
        'hindfoot_length',
        'weight' ] } }
\end{minted}

\noindent
so \texttt{papa.parse(raw, options).data} gets the data we want as JSON.
Let's write a method to get some overall statistics:

\begin{minted}{js}
  getSurveyStats () {
    return {
      year_low : this._get(this.data, 'year', Math.min),
      year_high : this._get(this.data, 'year', Math.max),
      record_count : this.data.length
    }
  }

  // ...other methods...

  _get(values, field, func) {
    return func(...values.map(rec => rec[field]).filter(val => !isNaN(val)))
  }
\end{minted}

Functions like \texttt{Math.min} and \texttt{Math.max} take any number of scalar values as arguments,
but do not directly process arrays.
Enter \gref{g:spread-syntax}{spread syntax} \texttt{...}:
the notation \texttt{func(...array)} means
``pass all the values in the array as separate arguments'',
which saves us from writing our own minimum and maximum functions.
Thus,
\texttt{func(...this.data.map(rec\ =\textgreater{}\ rec[field]))} means
``select the specified field from each record in \texttt{this.data} to create an array of fields,
then pass all of those values as arguments to \texttt{func}.
We include an underscore at the start of the name of \texttt{\_get} to indicate that we
intend it to be used only inside \texttt{DataManager} and not to be called elsewhere.

Adding the method to get weight and hindfoot length for a range of years
is comparatively straightforward.
First,
we write a function to calculate the average of one or more arguments:

\begin{minted}{js}
const _average = (...values) => {
  let sum = 0
  for (let v of values) {
    sum += v
  }
  return sum / values.length
}
\end{minted}

It would be more natural for \texttt{\_average} to take an array rather than a variable number of arguments,
but we want to be able to use it in the same way that we use \texttt{Math.min} and \texttt{Math.max},
so we have to conform to their signature.

After some thought we realize that it's possible for \texttt{subset} to be empty---i.e.,
it's possible that there are years that have no data in our dataset.
We should filter these out,
to prevent unnecessary effort being made to render summary statistics
with \texttt{NaN} values.\index{NaN@\texttt{NaN} (Not a Number)}
Remembering that empty arrays are not falsy in JavaScript (\chapref{s:basics}),
we decide to test that the \texttt{subset} returned by filtering for each year
contains at least one entry.

The last thing that we need to ensure is that each data object has a unique key,
which will make it much easier for React to efficiently update the display of
the data when we are ready to render it.

The method to get the values for a range of years is now:

\begin{minted}{js}
  getSurveyRange (minYear, maxYear) {
    return Array(1 + maxYear - minYear)
      .fill(0)
      .map((v, i) => minYear + i)
      .map(year => {
        const subset = this.data.filter(r => r.year === year)
        if (subset.length) {
          return {
            key  : toString(year),
            year : year,
            min_hindfoot_length :this._get(subset,
                                           'hindfoot_length', Math.min),
            ave_hindfoot_length : this._get(subset,
                                            'hindfoot_length', _average),
            max_hindfoot_length : this._get(subset,
                                            'hindfoot_length', Math.max),
            min_weight : this._get(subset, 'weight', Math.min),
            ave_weight : this._get(subset, 'weight', _average),
            max_weight : this._get(subset, 'weight', Math.max)
          }
        }
      })
  }
\end{minted}

\section{Exercises}\label{s:dataman-exercises}

\exercise{Tracing Data}

Trace the execution of the utility program that creates a small sample of the original data,
explaining what is passed into each of the chained methods calls.

\exercise{Unrandom}

Programs that rely on random numbers are impossible to test
because there's (deliberately) no way to predict their output.
Luckily, computer programs don't actually use random numbers:
they use \gref{g:pseudo-random-number}{pseudo-random numbers}
that are generated in a repeatable but unpredictable way.
Given the same initial \gref{g:seed}{seed},
a pseudo-random number generator will always produce the same sequence of values.

There is no way to set a seed for \texttt{Math.random} out of the box,
but the \hreffoot{https://www.npmjs.com/package/seedrandom}{seedrandom} package provides an add-on function for this purpose.
Install the package and modify the slice selection utility
so that it takes a word or phrase as a command-line argument
and uses it to seed the random number generator.

\exercise{One Record Per Year}

Another way to slice the data for testing purposes is to select one record from each year.
Write a small command-line JavaScript program that:

\begin{enumerate}
\item
  Reads all the data from the CSV.
\item
  Keeps the first record it finds for each year.
\item
  Prints these records formatted as SQL \texttt{insert} statements.
\end{enumerate}

\exercise{Error Handling}

Modify \texttt{DataManager}'s constructor so that it checks for errors.

\exercise{Generalization}

Modify \texttt{getSurveyRange} so that it can be called like this:

\begin{minted}{js}
getSurveyRange(minYear, maxYear, 'hindfoot_length', 'weight')
\end{minted}

\noindent
i.e., so that the names of the fields whose minimum, average, and maximum values are wanted
can be passed as strings,
and the method will automatically create the right names and values in its result.

\exercise{Handling Command-Line Arguments}

Read the documentation for
the \hreffoot{https://www.npmjs.com/package/command-line-args}{\texttt{command-line-args}} package
and rewrite the data slicing script to use it instead of \texttt{process.argv.slice}.

\section*{Key Points}

\input{keypoints/dataman}