Skip to content

Getting Started in Jupyter Notebook

Sylvia Tran edited this page Mar 10, 2020 · 2 revisions

To get started in Jupyter Notebook, there are a few key steps that you need to follow:

I. Download the Data

Note that there are several Zillow datasets on Kaggle. This link will take you to the correct one for purposes of completing this project.

Note that you do not have to unzip the files when you download them to your local machine.

II. Upload the Data

This assumes the following: (A) You already have Anaconda or Jupyter Notebook installed on your machine (B) You have already downloaded / cloned this git repository to your local machine (C) You are in the repository working directory on the command line.

Since data/ is part of the .gitignore the data a teammate will have downloaded onto their local machine is not going to be available to you. Therefore you must run the following commands from your terminal (in the project directory):

computer_name:rent-v-buy user$ mkdir data
computer_name:rent-v-buy user$ cd data/
computer_name:rent-v-buy/data user$ mkdir raw
computer_name:rent-v-buy/data user$ mkdir interim
computer_name:rent-v-buy/data user$ mkdir processed
  • Navigate back to your project root directory (rent-v-buy). From the command line type: computer_name:rent-v-buy user$ jupyter notebook
  • Your default browser should open up with what appears to be a GUI for your repository directory.
  • Navigate to the data/raw/ folder
  • Upload each of the downloaded files from Kaggle

III. Load & Serialize the Data

Once the upload is complete:

  • Navigate to back to the notebooks/ directory
  • Open the 00-Load-Data.ipynb notebook
  • Run all the cells in 00-Load-Data.ipynb which should take approximately 10-20 minutes to complete (depending on your machine specs & available RAM).
  • Alternatively, you can instead of navigating to notebooks/, navigate instead to src/ in your terminal and run the following command from the command line: computer_name:rent-v-buy/src user$ python make_dataset.py This should also take approximately 10-20 minutes to complete.
  • Thereafter, you should be able to load the files into a separate notebook that you create to do any data transformations / cleaning necessary.

In the new notebook, make all the same package imports as the ones made in 00-Load-Data.ipynb Execute that cell Data can subsequently be loaded by passing commands such as: cities_crosswalk = pd.read_pickle('../data/interim/city_crosswalk.pickle')

  • Store the cleaned/transformed data in the data/interim/ file directory and DO NOT overwrite the files that were uploaded in the data/raw/ directory.

Please reach out to Sylvia for help if necessary.