diff --git a/_episodes/01-working-with-openrefine.md b/_episodes/01-working-with-openrefine.md index 145788b1..af88fb20 100644 --- a/_episodes/01-working-with-openrefine.md +++ b/_episodes/01-working-with-openrefine.md @@ -65,7 +65,9 @@ along with a number representing how many times that value occurs in the column. 4. Try sorting this facet by name and by count. Do you notice any problems with the data? What are they? 5. Hover the mouse over one of the names in the `Facet` list. You should see that you have an `edit` function available. 6. You could use this to fix an error immediately, and OpenRefine will ask whether you want to make the same correction to every value it finds like that one. But OpenRefine offers even better ways to find and fix these errors, which we'll use instead. We'll learn about these when we talk about clustering. + > ## Solution +> > There will be several near-identical entries in `scientificName`. For example, there is one entry for `Ammospermophilis harrisi` and > one entry for `Ammospermophilus harrisii`. These are both misspellings of `Ammospermophilus harrisi`. We will see how to correct these > misspelled and mistyped entries in a later exercise. @@ -78,14 +80,17 @@ along with a number representing how many times that value occurs in the column. > 2. Is the column formatted as Number, Date, or Text? How does changing the format change the faceting display? > > 3. Which years have the most and least observations? +> > > ## Solution +> > > > 1. For the column `yr` do `Facet` > `Text facet`. A box will appear in the left panel showing that there are 26 unique entries in > > this column. > > 2. By default, the column `yr` is formatted as Text. You can change the format by doing `Edit cells` > `Common transforms` > > > `To number`. Doing `Facet` > `Numeric facet` creates a box in the left panel that shows a histogram of the number of > > entries per year. Notice that the data is shown as a number, not a date. If you instead transform the column to a date, the > > program will assume all entries are on January 1st of the year. -> > 3. After creating a facet, click `Sort by count` in the facet box. The year with the most observations is 1997. The least is 1977. +> > 3. After creating a facet, click `Sort by count` in the facet box. The year with the most observations is 1997. The least is 1977. +> > > {: .solution} {: .challenge} @@ -120,6 +125,7 @@ If data in a column needs to be split into multiple columns, and the parts are s 5. Click `OK`. You'll get some new columns called `scientificName 1`, `scientificName 2`, and so on. 6. Notice that in some cases `scientificName 1` and `scientificName 2` are empty. Why is this? What do you think we can do to fix this? + > ## Solution > > The entries that have data in `scientificName 3` and `scientificName 4` but not the first two `scientificName` columns @@ -131,6 +137,7 @@ can do to fix this? > ## Exercise > > Try to change the name of the second new column to "species". How can you correct the problem you encounter? +> > > ## Solution > > > > On the `scientificName 2` column, click the down arrow and then `Edit column` > `Rename this column`. Type "species" into the box @@ -160,6 +167,7 @@ Words with spaces at the beginning or end are particularly hard for we humans to 1. In the header for the column `scientificName`, choose `Edit cells` > `Common transforms` > `Trim leading and trailing whitespace`. 2. Notice that the `Split` step has now disappeared from the `Undo / Redo` pane on the left and is replaced with a `Text transform on 3 cells` 3. Perform the same `Split` operation on `scientificName` that you undid earlier. This time you should only get two new columns. Why? + > ## Solution > > Removing the leading white spaces means that each entry in this column has exactly one space (between the genus and species names). diff --git a/_episodes/02-filter-exclude-sort.md b/_episodes/02-filter-exclude-sort.md index a7822d95..92ca8d06 100644 --- a/_episodes/02-filter-exclude-sort.md +++ b/_episodes/02-filter-exclude-sort.md @@ -27,12 +27,14 @@ There are many entries in our data table. We can filter it to work on a subset o > > 1. What scientific names (genus and species) are selected by this procedure? > 2. How would you restrict this to one of the species selected? +> > > ## Solution > > 1. Do `Facet` > `Text facet` on the `scientificName` column after filtering. This will show that > > two names match your filter criteria. They are `Baiomys taylori` and `Chaetodipus baileyi`. > > 2. To restrict to only one of these two species, you could make the search case sensitive or > > you could split the `scientificName` column into species and genus before filtering or > > you could include more letters in your filter. +> > > {: .solution} {: .challenge} @@ -56,7 +58,8 @@ is currently selected, while filtering allows you to select a subset of your dat > > 2. Click `include`. This will explicitly include this species, and exclude others that are not expicitly included. Notice that the > option now changes to `exclude`. > > 3. Click `include` and `exclude` on the other species (`Chaetodipus baileyi`) and notice how the two entries appear and disappear -> from the table. +> > from the table. +> > > {: .solution} {: .challenge} @@ -82,9 +85,12 @@ If you try to re-sort a column that you have already used, the drop-down menu ch * > `Sort` > `Remove sort` - This option allows you to undo your sort. > ## Exercise +> > Sort the data by `plot`. What year(s) were observations recorded for plot 1 in this filtered dataset. +> > > ## Solution > > In the `plot` column, select `Sort...` > `numbers` and select `smallest first`. The years represented are 1990 and 1995. +> > > {: .solution} {: .challenge} @@ -98,6 +104,7 @@ You can sort by multiple columns by performing sort on additional columns. The s > You might like to look for trends in your data by month of collection across years. > 1. How do you sort your data by month? > 2. How would you do this differently if you were instead trying to see all of your entries in chronological order? +> > > ## Solution > > > > 1. For the `mo` column, click on `Sort...` and then `numbers`. This will group all entries made in, for example, January, @@ -105,7 +112,8 @@ You can sort by multiple columns by performing sort on additional columns. The s > > 2. For the `yr` column, click on `Sort` > `Sort...` > `numbers` and select `sort by this column alone`. This will undo the > > sorting by month step. Once you've sorted by `yr` you can then apply another sorting step to sort by month within year. To do this > > for the `mo` column, click on `Sort` > `numbers` but do not select `sort by this column alone`. To ensure that all entries are shown -> > chronologically, you will need to add a third sorting step by day within month. +> > chronologically, you will need to add a third sorting step by day within month. +> > > {: .solution} {: .challenge} diff --git a/_episodes/03-numbers.md b/_episodes/03-numbers.md index 2c756b4b..b7e9a940 100644 --- a/_episodes/03-numbers.md +++ b/_episodes/03-numbers.md @@ -26,10 +26,13 @@ To transform cells in the `recordID` column to numbers, click the down arrow for > ## Exercise > > Transform three more columns, including `period`, from text to numbers. Can all columns be transformed to numbers? +> > > ## Solution +> > > > Only observations that include only numerals (0-9) can be transformed to numbers. If you apply a number transformation to > > a column that doesn't meet this criteria, and then click the `Undo / Redo` tab, you will see a step that starts with > > `Text transform on 0 cells`. This means that the data in that column was not transformed. +> > > {: .solution} {: .challenge} @@ -59,7 +62,7 @@ Now that we have multiple columns representing numbers, we can see how they rela ## Examine pair of columns in detail -We can examine one pair of columns by clicking on its square in the `Scatterplot Matrix`` A new facet with only that pair will appear in the left margin. +We can examine one pair of columns by clicking on its square in the `Scatterplot Matrix` A new facet with only that pair will appear in the left margin. > ## Exercise >