Name		Name	Last commit message	Last commit date
parent directory ..
2016-FCC-New-Coders-Survey-Data.csv		2016-FCC-New-Coders-Survey-Data.csv
README.md		README.md
clean-data.R		clean-data.R
example-age.png		example-age.png
survey-data-dictionary.md		survey-data-dictionary.md

README.md

Cleaning and Combine Free Code Camp Survey Data

Introduction

The survey data was broken up into two parts and need to be combined into one for ease of future downstream analyses. Additionally, these two data sets need to be cleaned up a bit because of the nature of survey data.

Prerequisites to Rerun Data Manipulations

R (>= 3.3.0)
dplyr (>= 0.5.0) CRAN
Rcpp (>= 0.12.6) CRAN

Reproduce Cleaning and Combining of Data

Running the following script will create a new file 2016-FCC-New-Coders-Survey-Data.csv file in this directory clean-data/.

git clone https://github.com/FreeCodeCamp/2016-new-coder-survey.git
cd clean-data
Rscript clean-data.R

Notable Data Transformations

Obvious Outliers

In some of the numeric free text answers, numeric values were filtered out if it was beyond a reasonable threshold. For example, an answer saying you've coded for 100,000 months would be removed.

Numeric Ranges

Some answers were given as ranges. For example, a range of "9-10" months of programming might have been answer to a question. The average of this range was taken when possible.

Years to Months

Some answers to a question asking about months were given in years. These were converted to months if possible.

Normalization of Answers

Some of the free text answers were very similar to each other, with the exception of a space or two. These will register as different answers if you aren't looking for them. Answers like "Cybersecurity" and "Cyber Security" are the same and were changed to a consistent manner. There may have been some missed.

Getting Started Analyzing Data with R

Loading Data

For an initial look at the data, you can load the data into R with the following commands.

> library(dplyr)  # Used for each of use and manipulation of data
> setwd("directory-where-clean-survey-data-is")  # Change this to your path
> survey <- read.csv(file = "2016-FCC-New-Coders-Survey-Data.csv",
+                   header = TRUE,
+                   na.strings = NA,
+                   stringsAsFactors = FALSE) %>% tbl_df()

Example: Age

> survey %>% select(Age) %>% filter(!is.na(Age)) %>% summary()
#       Age
#  Min.   :10.00
#  1st Qu.:23.00
#  Median :27.00
#  Mean   :29.18
#  3rd Qu.:33.00
#  Max.   :86.00
> library(ggplot2)  # Use for data visualizations
> survey %>% filter(!is.na(Age)) %>%
+ ggplot(aes(x = Age)) + geom_histogram(binwidth = 5)

Changelog

2016 Aug 1st
- Set minimum commute time to 300 minutes
- Set minimum home mortgage to $1000 and maximum home mortgage to $1000000
- Set minimum student debt to $1000 and maximum student debt to $500000
- Check for consistent answers for number of children and yes/no to having childen i.e. if you answered yes to having children, you should have a number for number of children
- Fix spelling mistake in IsReceiveDisabilitiesBenefits (original: IsReceiveDiabilitiesBenefits)
- Update R and R package versions
2016 May 18th
- Initial dataset combine and cleaning

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clean-data

clean-data

README.md

Cleaning and Combine Free Code Camp Survey Data

Table of Contents

Introduction

Prerequisites to Rerun Data Manipulations

Reproduce Cleaning and Combining of Data

Notable Data Transformations

Obvious Outliers

Numeric Ranges

Years to Months

Normalization of Answers

Getting Started Analyzing Data with R

Loading Data

Example: Age

Changelog

Files

clean-data

Directory actions

More options

Directory actions

More options

Latest commit

History

clean-data

Folders and files

parent directory

README.md

Cleaning and Combine Free Code Camp Survey Data

Table of Contents

Introduction

Prerequisites to Rerun Data Manipulations

Reproduce Cleaning and Combining of Data

Notable Data Transformations

Obvious Outliers

Numeric Ranges

Years to Months

Normalization of Answers

Getting Started Analyzing Data with R

Loading Data

Example: Age

Changelog