- Learn about data types (columns)
- Spreadsheets <> Data (Rows <> Observations)
- Best practices for recording data
- Common mistakes
- Data Types
- Cardinal rules
- Common mistakes
- Unravelling Data
- Exporting to CSV
- Testing your spreadsheet
If in doubt, aim for consistency in every column. Never try to record more than one "type" of thing in a column.
...,-3,-2,0,1,2,3,...
versus3.141529
- Integers are any whole number
- Decimals include any number with a decimal point
A string is any sequence of characters.
- Literally anything you can type can be represented as a string.
- Default type in Excel
- Be careful
180mg
is not a number!
- A number as far as the computer is concerned
- Often (but not always)
- Dates are integers counted from 1 Jan 1970
- Times are fractions of a day
- Other possibilities
- milliseconds since 1960
- Days since January 0 1900 (Excel!)
TRUE
orFALSE
statements.1
or0
is a common shorthand
- 'Categorical' (ordered or unordered)
- Integers with labels
- An unordered (nominal) factor
- Named but not ordered
- apples
- oranges
- pears
- R stores this as
1,2,3
for convenience but not because1<2<3
.
- An ordered (ordinal) factor such as a Likert scale
- Ordered and named
- Strongly disagree
- Disagree
- Neither agree/disagree
- Agree
- Strongly agree
- R stores this as
1,2,3,4,5
for convenience and understands that1<2<3<4<5
In R studio,
integers <- as.integer(c(1, 3, 15, 16))
decimals <- c(1.4, 3.5, 15.55, 16.4)
bools <- c(T, T, F, T)
dates <- as.Date(c("22/04/2016", "13/05/1997"), format = "%d/%m/%Y")
strings <- c("These are", "Strings")
factors <- as.factor(c("Apples", "Pears", "Lemons"))
factors <- as.factor(c("Good", "Better", "Best"), ordered=TRUE)
then use str()
to see the data 'structure'
The cardinal rules of using spreadsheet programs for data:
Put all your variables in columns - the thing you're measuring, like 'weight', 'temperature' or 'SBP'. Break things down into their most basic constituents, and keep units in your headers only.
Put each observation in its own row. Think very carefully about what constitutes your basic observation. Often it's your patient, but it may not be as intuitive as you think.
Have a single 'header' row to label your columns
Don't combine multiple pieces of information in one cell.
Leave the raw data raw - don't mess with it! That means no formulas anywhere in your spreadsheet!
Export the cleaned data to a text based format like CSV. This ensures that anyone can use the data, and is the format required by most data repositories.
just add some white space and dividers
And all excel does is present it to us in an easy to use format.
But you always need to remember that you need to go back and forth between both formats.
So merged cells, colours, comments will both be lost and confuse.
- Raw data from an RCT on pain relief following mastectomy
- Download from FigShare.
Identify and fix these common mistakes
- Multiple tables
- Multiple tabs
- Not filling in zeros
- Using bad null values
- Using formatting to convey information
- Using formatting to make the data sheet look pretty
- Placing comments or units in cells
- More than one piece of information in a cell
- Field name problems
- Special characters in data
- Inclusion of metadata in data table
- Date formatting
- 3 sheets: readme, dictionary, data ... then export,share the data sheet
- Data validation in Excel
- Learn to export to CSV