Skip to content

Commit

Permalink
Numbered exercises, broken links amended
Browse files Browse the repository at this point in the history
episodes\01-regular-expressions.md
- Elevated "Learning common regex metacharacters" to improve the layout of the episode
- Numbered challenges / exercises (LibraryCarpentry#198)
- Removed decayed link, replaced with regex100.com library of commmunity-submitted regex (LibraryCarpentry#210)

episodes\01-regular-expressions.md
- Numbered exercises (LibraryCarpentry#198)
- Clarified that learner should type add a space after community for first challenge in exercise 2.1 (LibraryCarpentry#220 (comment))
- Added exercise 2.4 on use of regex in R (self-organisied workshop, this lessson is taught after https://datacarpentry.org/r-socialsci/)
  • Loading branch information
aforestsomewhere committed Apr 27, 2024
1 parent 14a7da0 commit 24cc373
Show file tree
Hide file tree
Showing 2 changed files with 96 additions and 26 deletions.
38 changes: 20 additions & 18 deletions episodes/01-regular-expressions.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ Most regular expression implementations employ similar syntaxes and metacharacte

A very simple use of a regular expression would be to locate the same word spelled two different ways. For example the regular expression `organi[sz]e` matches both `organise` and `organize`. But because it locates all matches for the pattern in the file, not just for that word, it would also match `reorganise`, `reorganize`, `organises`, `organizes`, `organised`, `organized`, etc.

### Learning common regex metacharacters
## Learning common regex metacharacters

Square brackets can be used to define a list or range of characters to be found. So:

Expand Down Expand Up @@ -100,6 +100,8 @@ Or, any other string that starts a line, begins with a letter `o` in lower or ca

::::::::::::::::::::::::::::::::::::::::::::::::::

## Additional regex metacharacters

Other useful special characters are:

- `*` matches the preceding element zero or more times. For example, ab\*c matches "ac", "abc", "abbbc", etc.
Expand All @@ -113,7 +115,7 @@ So, what are these going to match?

::::::::::::::::::::::::::::::::::::::: challenge

## `^[Oo]rgani.e\w*`
## 1. `^[Oo]rgani.e\w*`

What will the regular expression `^[Oo]rgani.e\w*` match?

Expand All @@ -138,7 +140,7 @@ Or, any other string that starts a line, begins with a letter `o` in lower or ca

::::::::::::::::::::::::::::::::::::::: challenge

## `[Oo]rgani.e\w+$`
## 2. `[Oo]rgani.e\w+$`

What will the regular expression `[Oo]rgani.e\w+$` match?

Expand All @@ -163,7 +165,7 @@ Or, any other string that ends a line, begins with a letter `o` in lower or capi

::::::::::::::::::::::::::::::::::::::: challenge

## `^[Oo]rgani.e\w?\b`
## 3. `^[Oo]rgani.e\w?\b`

What will the regular expression `^[Oo]rgani.e\w?\b` match?

Expand All @@ -188,7 +190,7 @@ Or, any other string that starts a line, begins with a letter `o` in lower or ca

::::::::::::::::::::::::::::::::::::::: challenge

## `^[Oo]rgani.e\w?$`
## 4. `^[Oo]rgani.e\w?$`

What will the regular expression `^[Oo]rgani.e\w?$` match?

Expand All @@ -213,7 +215,7 @@ Or, any other string that starts and ends a line, begins with a letter `o` in lo

::::::::::::::::::::::::::::::::::::::: challenge

## `\b[Oo]rgani.e\w{2}\b`
## 5. `\b[Oo]rgani.e\w{2}\b`

What will the regular expression `\b[Oo]rgani.e\w{2}\b` match?

Expand All @@ -238,7 +240,7 @@ Or, any other string that begins with a letter `o` in lower or capital case afte

::::::::::::::::::::::::::::::::::::::: challenge

## `\b[Oo]rgani.e\b|\b[Oo]rgani.e\w{1}\b`
## 6. `\b[Oo]rgani.e\b|\b[Oo]rgani.e\w{1}\b`

What will the regular expression `\b[Oo]rgani.e\b|\b[Oo]rgani.e\w{1}\b` match?

Expand All @@ -261,7 +263,7 @@ Or, any other string that begins with a letter `o` in lower or capital case afte

::::::::::::::::::::::::::::::::::::::::::::::::::

This logic is useful when you have lots of files in a directory, when those files have logical file names, and when you want to isolate a selection of files. It can be used for looking at cells in spreadsheets for certain values, or for extracting some data from a column of a spreadsheet to make new columns. There are many other contexts in which regex is useful when using a computer to search through a document, spreadsheet, or file structure. Some real-world use cases for regex are included on a [ACRL Tech Connect blog post](https://acrl.ala.org/techconnect/post/fear-no-longer-regular-expressions/) .
This logic is useful when you have lots of files in a directory, when those files have logical file names, and when you want to isolate a selection of files. It can be used for looking at cells in spreadsheets for certain values, or for extracting some data from a column of a spreadsheet to make new columns. There are many other contexts in which regex is useful when using a computer to search through a document, spreadsheet, or file structure. You can browse real-world use cases in the [regex101.com library of community-submitted regex patterns](https://regex101.com/library).

To embed this knowledge we will not - however - be using computers. Instead we'll use pen and paper for now.

Expand All @@ -275,7 +277,7 @@ Then test each other on the answers. If you want to check your logic use [regex1

::::::::::::::::::::::::::::::::::::::: challenge

## Using square brackets
## 1. Using square brackets

What will the regular expression `Fr[ea]nc[eh]` match?

Expand All @@ -300,7 +302,7 @@ Note that the way this regular expression is constructed, it will match misspell

::::::::::::::::::::::::::::::::::::::: challenge

## Using dollar signs
## 2. Using dollar signs

What will the regular expression `Fr[ea]nc[eh]$` match?

Expand All @@ -325,7 +327,7 @@ This will match the pattern only when it appears at the end of a line. It will a

::::::::::::::::::::::::::::::::::::::: challenge

## Introducing options
## 3. Introducing options

What would match the strings `French` and `France` that appear at the beginning of a line?

Expand All @@ -347,7 +349,7 @@ This will also find words where there were characters after `French` such as `Fr

::::::::::::::::::::::::::::::::::::::: challenge

## Case insensitivity
## 4. Case insensitivity

How do you match the whole words `colour` and `color` (case insensitive)?

Expand Down Expand Up @@ -375,7 +377,7 @@ so `/colou?r/i` will match all case insensitive variants of `colour` and `color`

::::::::::::::::::::::::::::::::::::::: challenge

## Word boundaries
## 5. Word boundaries

How would you find the whole word `headrest` and or `head rest` but not <code>head  rest</code> (that is, with two spaces between `head` and `rest`?

Expand All @@ -397,7 +399,7 @@ Note that although `\bhead\s?rest\b` does work, it will also match zero or one t

::::::::::::::::::::::::::::::::::::::: challenge

## Matching non-linguistic patterns
## 6. Matching non-linguistic patterns

How would you find a string that ends with four letters preceded by at least one zero?

Expand All @@ -415,7 +417,7 @@ How would you find a string that ends with four letters preceded by at least one

::::::::::::::::::::::::::::::::::::::: challenge

## Matching digits
## 7. Matching digits

How do you match any four-digit string anywhere?

Expand All @@ -437,7 +439,7 @@ Note: this will also match four-digit strings within longer strings of numbers a

::::::::::::::::::::::::::::::::::::::: challenge

## Matching dates
## 8. Matching dates

How would you match the date format `dd-MM-yyyy`?

Expand All @@ -459,7 +461,7 @@ Depending on your data, you may choose to remove the word bounding.

::::::::::::::::::::::::::::::::::::::: challenge

## Matching multiple date formats
## 9. Matching multiple date formats

How would you match the date format `dd-MM-yyyy` or `dd-MM-yy` at the end of a line only?

Expand All @@ -481,7 +483,7 @@ Note this will also find strings such as `31-01-198` at the end of a line, so yo

::::::::::::::::::::::::::::::::::::::: challenge

## Matching publication formats
## 10. Matching publication formats

How would you match publication formats such as `British Library : London, 2015` and `Manchester University Press: Manchester, 1999`?

Expand Down
84 changes: 76 additions & 8 deletions episodes/02-match-extract-strings.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,21 +17,21 @@ exercises: 30

::::::::::::::::::::::::::::::::::::::::::::::::::

## Exercise Using Regex101.com
## Exercise: Using Regex101.com

For this exercise, open a browser and go to [https://regex101.com](https://regex101.com). Regex101.com is a free regular expression debugger with real time explanation, error detection, and highlighting.

Open the [swcCoC.md file](https://github.com/LibraryCarpentry/lc-data-intro/tree/main/episodes/data/swcCoC.md), copy the text, and paste that into the test string box.

For a quick test to see if it is working, type the string `community` into the regular expression box.
For a quick test to see if it is working, type the string `community ` into the regular expression box.

If you look in the box on the right of the screen, you see that the expression matches six instances of the string 'community' (the instances are also highlighted within the text).

::::::::::::::::::::::::::::::::::::::: challenge

### Taking spaces into consideration

Type `community `. You get three matches. Why not six?
Add a space after `community`. You get three matches. Why not six?

::::::::::::::: solution

Expand Down Expand Up @@ -135,7 +135,7 @@ Find all of the words starting with Comm or comm that are plural.

::::::::::::::::::::::::::::::::::::::::::::::::::

## Exercise finding email addresses using regex101.com
## Exercise: finding email addresses

For this exercise, open a browser and go to [https://regex101.com](https://regex101.com).

Expand Down Expand Up @@ -217,7 +217,7 @@ See the previous exercise for the explanation of the expression up to the `+`

::::::::::::::::::::::::::::::::::::::::::::::::::

## Exercise finding phone numbers, Using regex101.com
## Exercise: finding phone numbers

Does this Code of Conduct contain a phone number?

Expand Down Expand Up @@ -355,9 +355,79 @@ This expression should find one match in the document.

One of the reasons we stress the value of consistent and predictable directory and filenaming conventions is that working in this way enables you to use the computer to select files based on the characteristics of their file names. For example, if you have a bunch of files where the first four digits are the year and you only want to do something with files from '2017', then you can. Or if you have 'journal' somewhere in a filename when you have data about journals, you can use the computer to select just those files. Equally, using plain text formats means that you can go further and select files or elements of files based on characteristics of the data *within* those files. See Workshop Overview: [File Naming \& Formatting](https://librarycarpentry.org/lc-overview/06-file-naming-formatting) for further background.


::::::::::::::::::::::::::::::::::::::::::::::::::
## Exercise: Extracting substrings in R using regex

You can use regular expressions in many functions in base R, for example **`grep`** and **`sub`**. We will look at some functions from **`stringr`**, a powerful package for character strings that works well with packages we saw already like **`dplyr`** and **`tidyr`**. To learn more about **`stringr`** after the workshop, you may want to check out this handy [string manipulation with stringr cheatsheet](https://rstudio.github.io/cheatsheets/html/strings.html).

We will look at just two functions in **`stringr`** that can take regular expressions as an argument:

* `str_extract(string, pattern)`: return the first pattern match found in each string, as a vector

* `str_replace(string, pattern, replacement)`: finds the first pattern match in a string, and replaces it with a replacement string

These functions will return the first pattern match only. To return all possible matches, we can use **`str_extract_all()`** and **`str_replace_all()`**.

::::::::::::::::::::::::::::::::::::::::: callout
### ESCAPING METACHARACTERS IN REGULAR EXPRESSIONS IN R
Regular expressions in R follow the general syntax we have seen so far, with one main exception. In R, strings use a backslash `\` to escape special behavior - but regular expressions are themselves regarded as strings by R. This creates a problem when we use a metacharacter, like **`\d`**, as this is interpreted by R as `d`. We get around this by using an extra `\` beforehand, like **`\\d`**.
::::::::::::::::::::::::::::::::::::::::::::::::::

## Extracting a substring in Google Sheets using regex
Let's build a regular expression to extract the month from a date, written as a string in the format YYYY-MM-DD:

```R
library(stringr)
str_extract("2024-05-09", "-\\d{2}-")
```
```output
[1] "-05-"
```
This returns the month (MM), with the dashes on either side. There are more advanced ways to remove these, bu a simple approach would be to use **`str_replace_all()`** to replace every "-" with "" (that is, nothing):
```R
str_replace_all("-05-", "-", "")
```
```output
[1] "05"
```
We could go one step further and complete both steps in one line of code by wrapping our first **`str_extract()`** function inside **`str_replace_all()`**
```R
str_replace_all(str_extract("2024-05-09", "-\\d{2}-"), "-", "")
```
```output
[1] "05"
```
::::::::::::::::::::::::::::::::::::::: challenge

### Using regex on a dataframe in R with stringr

Open the "SAFI_clean.csv" dataset we worked with on days 1 and 2 in R. It contains a column "interview_date" with the date of each interview in the format "YYYY-MM-DD". How can we apply the regular expression we just built to add a new column "interview_month" containing just the month (MM)?

Hint: the mutate() function from dplyr can create new columns containing modified values

::::::::::::::: solution
### Solution
```R
library(tidyverse)
library(here)

#Read the data
interviews <- read_csv(
here("data", "SAFI_clean.csv"),
na = "NULL")

# Add a new column "interview_month", containing the month (MM) extracted from the interview_date (YYYY-MM-DD)
df <- interviews %>%
mutate(interview_month = str_replace_all(str_extract(interview_date,"-(\\d{2})-"),"-", ""))
```
:::::::::::::::::::::::::

::::::::::::::::::::::::::::::::::::::::::::::::::

## Exercise: Extracting substrings in Google Sheets using regex

You can also use regular expression in Google Sheets.


::::::::::::::::::::::::::::::::::::::: challenge

Expand All @@ -381,9 +451,7 @@ This is one way to solve this challenge. You might have found others. Inside the
Latitude and longitude are in decimal degree format and can be positive or negative, so we start with an optional dash for negative values then use `\d+` for a one or more digit match followed by a period `\.`. Note we had to escape the period using `\`. After the period we look for one or more digits `\d+` again followed by a literal comma `,`. We then have a literal space match followed by an optional dash `-` (there are few `0.0` latitude/longitudes that are probably errors, but we'd want to retain so we can deal with them). We then repeat our `\d+\.\d+` we used for the latitude match.

:::::::::::::::::::::::::

::::::::::::::::::::::::::::::::::::::::::::::::::

:::::::::::::::::::::::::::::::::::::::: keypoints

- Regular expressions are useful for searching and cleaning data.
Expand Down

0 comments on commit 24cc373

Please sign in to comment.