02_exercises.qmd

# Applied Exercises {-}

```{r}
library(reticulate)
os <- import("os")
os$listdir(".")
```


```{python}

import numpy as np
import matplotlib.pyplot as plt
#%matplotlib inline
from matplotlib.pyplot import subplots
import pandas as pd
from ISLP import load_data

```

```{python}
pd.set_option('display.max_columns', None)
#pd.set_option('display.max_rows', None)
```


## 8. This exercise relates to the `College` data set, which can be found in the file `College.csv` on the book website. It contains a number of variables for 777 different universities and colleges in the US. The variables are

- `Private` : Public/private indicator
- `Apps` : Number of applications received
- `Accept` : Number of applicants accepted
- `Enroll` : Number of new students enrolled
- `Top10perc` : New students from top 10 % of high school class
- `Top25perc` : New students from top 25 % of high school class
- `F.Undergrad` : Number of full-time undergraduates
- `P.Undergrad` : Number of part-time undergraduates
- `Outstate` : Out-of-state tuition
- `Room.Board` : Room and board costs
- `Books` : Estimated book costs
- `Personal` : Estimated personal spending
- `PhD` : Percent of faculty with Ph.D.s
- `Terminal` : Percent of faculty with terminal degree
- `S.F.Ratio` : Student/faculty ratio
- `perc.alumni` : Percent of alumni who donate
- `Expend` : Instructional expenditure per student
- `Grad.Rate` : Graduation rate

Before reading the data into `Python`, it can be viewed in Excel or a
text editor.

### (a) Use the `pd.read_csv()` function to read the data into `Python`. Call the loaded data `college`. Make sure that you have the directory set to the correct location for the data.

```{python}
college = pd.read_csv('ISLP_data/College.csv')
college
```


### (b) Look at the data used in the notebook by creating and running a new cell with just the code `college` in it. You should notice that the first column is just the name of each university in a column named something like `Unnamed: 0`. We don’t really want `pandas` to treat this as data. However, it may be handy to have these names for later. Try the following commands and similarly look at the resulting data frames:


```{python}

college2 = pd.read_csv('ISLP_data/College.csv', index_col=0)
#college2

college3 = college.rename({'Unnamed: 0': 'college'},
  axis=1)
#college3

college3 = college3.set_index('college')
#college3
```


This has used the first column in the file as an `index` for the data frame. This means that `pandas` has given each row a name corresponding to the appropriate university. Now you should see that the first data column is `Private`. Note that the names of the colleges appear on the left of the table. We also introduced a new python object above: a *dictionary*, which is specified by `(key, value)` pairs. Keep your modified version of the data with the following:

```{python}
college = college3
college
```

### (c) Use the `describe()` method to produce a numerical summary of the variables in the data set.

```{python}

college.describe()

```


### (d) Use the `pd.plotting.scatter_matrix()` function to produce a scatterplot matrix of the first columns `[Top10perc, Apps, Enroll]`. Recall that you can reference a list C of columns of a data frame `A` using `A[C]`.


```{python}
#fig, ax = subplots(figsize=(8, 8))
pd.plotting.scatter_matrix(college[['Top10perc','Apps','Enroll']])
#plt.show()
```


### (e) Use the boxplot() method of college to produce side-by-side boxplots of Outstate versus Private.

```{python}

```


### (f) Create a new qualitative variable, called Elite, by binning the Top10perc variable into two groups based on whether or not the proportion of students coming from the top 10% of their high school classes exceeds 50%.

```{python}
college['Elite'] = pd.cut(college['Top10perc'],
  [0,0.5,1],
  labels=['No', 'Yes'])
```


Use the `value_counts()` method of `college['Elite']` to see how many elite universities there are. Finally, use the `boxplot()` method again to produce side-by-side boxplots of `Outstate` versus `Elite`.


```{python}
college['Elite'].value_counts()
```

```{r}

```


### (g) Use the `plot.hist()` method of `college` to produce some histograms with difering numbers of bins for a few of the quantitative variables. The command `plt.subplots(2, 2)` may be useful: it will divide the plot window into four regions so that four plots can be made simultaneously. By changing the arguments you can divide the screen up in other combinations.

```{python}

```


### (h) Continue exploring the data, and provide a brief summary of what you discover.

```{python}

```


## 9. This exercise involves the `Auto` data set studied in the lab. Make sure that the missing values have been removed from the data.

```{python}

Auto = pd.read_csv('ISLP_data/Auto.csv',
                    na_values=['?'])
Auto

```


### (a) Which of the predictors are quantitative, and which are qualitative?

Mpg, Displacement, Horsepower, Weight and Acceleration are quantitative. Cylinders, Year, Origin, and Name are qualitative.

```{python}
Auto.info()
```

```{python}
Auto['cylinders'] = Auto['cylinders'].astype('object') 
Auto['cylinders']
```


### (b) What is the range of each quantitative predictor? You can answer this using the `min()` and `max()` methods in `numpy`. 

```{python}
mpg_min = Auto['mpg'].min( )
mpg_max = Auto['mpg'].max( )

print('The min and max miles per gallon are', (mpg_min, mpg_max))
```


```{python}
dsp_min = Auto['displacement'].min( )
dsp_max = Auto['displacement'].max( )

print('The min and max displacement are', (dsp_min, dsp_max))
```


```{python}
hpwr_min = Auto['horsepower'].min( )
hpwr_max = Auto['horsepower'].max( )

print('The min and max horsepower are', (hpwr_min, hpwr_max))
```

```{python}
wt_min = Auto['weight'].min( )
wt_max = Auto['weight'].max( )

print('The min and max weights are', (wt_min, wt_max))
```

```{python}
acc_min = Auto['acceleration'].min( )
acc_max = Auto['acceleration'].max( )

print('The min and max accelerations are', (acc_min, acc_max))
```

### (c) What is the mean and standard deviation of each quantitative predictor?


```{python}

mpg_mean = Auto['mpg'].mean( )
mpg_sd = Auto['mpg'].std( )

print('The mean and standard deviation of miles per gallon are', mpg_mean,'and', mpg_sd)
```


```{python}
dsp_mean = Auto['displacement'].mean( )
dsp_sd = Auto['displacement'].std( )

print('The mean and standard deviation of weight are', dsp_mean,'and', dsp_sd)
```


```{python}
hpwr_mean = Auto['horsepower'].mean( )
hpwr_sd = Auto['horsepower'].std( )

print('The mean and standard deviation of horsepower are', hpwr_mean,'and', hpwr_sd)
```

```{python}
wt_mean = Auto['weight'].mean( )
wt_sd = Auto['weight'].std( )

print('The mean and standard deviation of weight are', wt_mean,'and', wt_sd)
```

```{python}
acc_mean = Auto['acceleration'].mean( )
acc_sd = Auto['acceleration'].std( )

print('The mean and standard deviation of acceleration are', acc_mean,'and', acc_sd)
```


### (d) Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?

```{python}
Auto_new = Auto.drop(Auto.index[10:85])
Auto_new
```


```{python}
mpg_min = Auto_new['mpg'].min( )
mpg_max = Auto_new['mpg'].max( )

print('The min and max miles per gallon of the subsetted data are', (mpg_min, mpg_max))

mpg_mean = Auto_new['mpg'].mean( )
mpg_sd = Auto_new['mpg'].std( )

print('The mean and standard deviation of miles per gallon of the subsetted data are', mpg_mean,'and', mpg_sd)
```


```{python}
dsp_min = Auto_new['displacement'].min( )
dsp_max = Auto_new['displacement'].max( )

print('The min and max displacement of the subsetted data are', (dsp_min, dsp_max))

dsp_mean = Auto_new['displacement'].mean( )
dsp_sd = Auto_new['displacement'].std( )

print('The mean and standard deviation of weight of the subsetted data are', dsp_mean,'and', dsp_sd)
```


```{python}
hpwr_min = Auto['horsepower'].min( )
hpwr_max = Auto['horsepower'].max( )

print('The min and max horsepower of the subsetted data are', (hpwr_min, hpwr_max))

hpwr_mean = Auto['horsepower'].mean( )
hpwr_sd = Auto['horsepower'].std( )

print('The mean and standard deviation of horsepower of the subsetted data are', hpwr_mean,'and', hpwr_sd)
```

```{python}
wt_min = Auto['weight'].min( )
wt_max = Auto['weight'].max( )

print('The min and max weights of the subsetted data are', (wt_min, wt_max))

wt_mean = Auto['weight'].mean( )
wt_sd = Auto['weight'].std( )

print('The mean and standard deviation of weight of the subsetted data are', wt_mean,'and', wt_sd)
```

```{python}
acc_min = Auto['acceleration'].min( )
acc_max = Auto['acceleration'].max( )

print('The min and max accelerations of the subsetted data are', (acc_min, acc_max))

acc_mean = Auto['acceleration'].mean( )
acc_sd = Auto['acceleration'].std( )

print('The mean and standard deviation of acceleration of the subsetted data are', acc_mean,'and', acc_sd)
```


### (e) Using the full data set, investigate the predictors graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings.


### (f) Suppose that we wish to predict gas mileage (`mpg`) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting `mpg`? Justify your answer.


## 10. This exercise involves the `Boston` housing data set.

### (a) To begin, load in the `Boston` data set, which is part of the `ISLP` library.


```{python}
Boston = load_data("Boston")
Boston
```


### (b) How many rows are in this data set? How many columns? What do the rows and columns represent?


### (c) Make some pairwise scatterplots of the predictors (columns) in this data set. Describe your fndings.


### (d) Are any of the predictors associated with per capita crime rate? If so, explain the relationship.


### (e) Do any of the suburbs of Boston appear to have particularly high crime rates? Tax rates? Pupil-teacher ratios? Comment on the range of each predictor.


### (f) How many of the suburbs in this data set bound the Charles river?


### (g) What is the median pupil-teacher ratio among the towns in this data set?


### (h) Which suburb of Boston has lowest median value of owneroccupied homes? What are the values of the other predictors for that suburb, and how do those values compare to the overall ranges for those predictors? Comment on your fndings.


### (i) In this data set, how many of the suburbs average more than seven rooms per dwelling? More than eight rooms per dwelling? Comment on the suburbs that average more than eight rooms per dwelling.