diff --git a/episodes/01-regular-expressions.md b/episodes/01-regular-expressions.md index 4bd87c4d..50739b44 100644 --- a/episodes/01-regular-expressions.md +++ b/episodes/01-regular-expressions.md @@ -46,7 +46,7 @@ Most regular expression implementations employ similar syntaxes and metacharacte A very simple use of a regular expression would be to locate the same word spelled two different ways. For example the regular expression `organi[sz]e` matches both `organise` and `organize`. But because it locates all matches for the pattern in the file, not just for that word, it would also match `reorganise`, `reorganize`, `organises`, `organizes`, `organised`, `organized`, etc. -### Learning common regex metacharacters +## Learning common regex metacharacters Square brackets can be used to define a list or range of characters to be found. So: @@ -100,6 +100,8 @@ Or, any other string that starts a line, begins with a letter `o` in lower or ca :::::::::::::::::::::::::::::::::::::::::::::::::: +## Additional regex metacharacters + Other useful special characters are: - `*` matches the preceding element zero or more times. For example, ab\*c matches "ac", "abc", "abbbc", etc. @@ -113,7 +115,7 @@ So, what are these going to match? ::::::::::::::::::::::::::::::::::::::: challenge -## `^[Oo]rgani.e\w*` +## 1. `^[Oo]rgani.e\w*` What will the regular expression `^[Oo]rgani.e\w*` match? @@ -138,7 +140,7 @@ Or, any other string that starts a line, begins with a letter `o` in lower or ca ::::::::::::::::::::::::::::::::::::::: challenge -## `[Oo]rgani.e\w+$` +## 2. `[Oo]rgani.e\w+$` What will the regular expression `[Oo]rgani.e\w+$` match? @@ -163,7 +165,7 @@ Or, any other string that ends a line, begins with a letter `o` in lower or capi ::::::::::::::::::::::::::::::::::::::: challenge -## `^[Oo]rgani.e\w?\b` +## 3. `^[Oo]rgani.e\w?\b` What will the regular expression `^[Oo]rgani.e\w?\b` match? @@ -188,7 +190,7 @@ Or, any other string that starts a line, begins with a letter `o` in lower or ca ::::::::::::::::::::::::::::::::::::::: challenge -## `^[Oo]rgani.e\w?$` +## 4. `^[Oo]rgani.e\w?$` What will the regular expression `^[Oo]rgani.e\w?$` match? @@ -213,7 +215,7 @@ Or, any other string that starts and ends a line, begins with a letter `o` in lo ::::::::::::::::::::::::::::::::::::::: challenge -## `\b[Oo]rgani.e\w{2}\b` +## 5. `\b[Oo]rgani.e\w{2}\b` What will the regular expression `\b[Oo]rgani.e\w{2}\b` match? @@ -238,7 +240,7 @@ Or, any other string that begins with a letter `o` in lower or capital case afte ::::::::::::::::::::::::::::::::::::::: challenge -## `\b[Oo]rgani.e\b|\b[Oo]rgani.e\w{1}\b` +## 6. `\b[Oo]rgani.e\b|\b[Oo]rgani.e\w{1}\b` What will the regular expression `\b[Oo]rgani.e\b|\b[Oo]rgani.e\w{1}\b` match? @@ -261,7 +263,7 @@ Or, any other string that begins with a letter `o` in lower or capital case afte :::::::::::::::::::::::::::::::::::::::::::::::::: -This logic is useful when you have lots of files in a directory, when those files have logical file names, and when you want to isolate a selection of files. It can be used for looking at cells in spreadsheets for certain values, or for extracting some data from a column of a spreadsheet to make new columns. There are many other contexts in which regex is useful when using a computer to search through a document, spreadsheet, or file structure. Some real-world use cases for regex are included on a [ACRL Tech Connect blog post](https://acrl.ala.org/techconnect/post/fear-no-longer-regular-expressions/) . +This logic is useful when you have lots of files in a directory, when those files have logical file names, and when you want to isolate a selection of files. It can be used for looking at cells in spreadsheets for certain values, or for extracting some data from a column of a spreadsheet to make new columns. There are many other contexts in which regex is useful when using a computer to search through a document, spreadsheet, or file structure. You can browse real-world use cases in the [regex101.com library of community-submitted regex patterns](https://regex101.com/library). To embed this knowledge we will not - however - be using computers. Instead we'll use pen and paper for now. @@ -275,7 +277,7 @@ Then test each other on the answers. If you want to check your logic use [regex1 ::::::::::::::::::::::::::::::::::::::: challenge -## Using square brackets +## 1. Using square brackets What will the regular expression `Fr[ea]nc[eh]` match? @@ -300,7 +302,7 @@ Note that the way this regular expression is constructed, it will match misspell ::::::::::::::::::::::::::::::::::::::: challenge -## Using dollar signs +## 2. Using dollar signs What will the regular expression `Fr[ea]nc[eh]$` match? @@ -325,7 +327,7 @@ This will match the pattern only when it appears at the end of a line. It will a ::::::::::::::::::::::::::::::::::::::: challenge -## Introducing options +## 3. Introducing options What would match the strings `French` and `France` that appear at the beginning of a line? @@ -347,7 +349,7 @@ This will also find words where there were characters after `French` such as `Fr ::::::::::::::::::::::::::::::::::::::: challenge -## Case insensitivity +## 4. Case insensitivity How do you match the whole words `colour` and `color` (case insensitive)? @@ -375,7 +377,7 @@ so `/colou?r/i` will match all case insensitive variants of `colour` and `color` ::::::::::::::::::::::::::::::::::::::: challenge -## Word boundaries +## 5. Word boundaries How would you find the whole word `headrest` and or `head rest` but not head  rest (that is, with two spaces between `head` and `rest`? @@ -397,7 +399,7 @@ Note that although `\bhead\s?rest\b` does work, it will also match zero or one t ::::::::::::::::::::::::::::::::::::::: challenge -## Matching non-linguistic patterns +## 6. Matching non-linguistic patterns How would you find a string that ends with four letters preceded by at least one zero? @@ -415,7 +417,7 @@ How would you find a string that ends with four letters preceded by at least one ::::::::::::::::::::::::::::::::::::::: challenge -## Matching digits +## 7. Matching digits How do you match any four-digit string anywhere? @@ -437,7 +439,7 @@ Note: this will also match four-digit strings within longer strings of numbers a ::::::::::::::::::::::::::::::::::::::: challenge -## Matching dates +## 8. Matching dates How would you match the date format `dd-MM-yyyy`? @@ -459,7 +461,7 @@ Depending on your data, you may choose to remove the word bounding. ::::::::::::::::::::::::::::::::::::::: challenge -## Matching multiple date formats +## 9. Matching multiple date formats How would you match the date format `dd-MM-yyyy` or `dd-MM-yy` at the end of a line only? @@ -481,7 +483,7 @@ Note this will also find strings such as `31-01-198` at the end of a line, so yo ::::::::::::::::::::::::::::::::::::::: challenge -## Matching publication formats +## 10. Matching publication formats How would you match publication formats such as `British Library : London, 2015` and `Manchester University Press: Manchester, 1999`? diff --git a/episodes/02-match-extract-strings.md b/episodes/02-match-extract-strings.md index a1cad54b..a45ae875 100644 --- a/episodes/02-match-extract-strings.md +++ b/episodes/02-match-extract-strings.md @@ -17,13 +17,13 @@ exercises: 30 :::::::::::::::::::::::::::::::::::::::::::::::::: -## Exercise Using Regex101.com +## Exercise: Using Regex101.com For this exercise, open a browser and go to [https://regex101.com](https://regex101.com). Regex101.com is a free regular expression debugger with real time explanation, error detection, and highlighting. Open the [swcCoC.md file](https://github.com/LibraryCarpentry/lc-data-intro/tree/main/episodes/data/swcCoC.md), copy the text, and paste that into the test string box. -For a quick test to see if it is working, type the string `community` into the regular expression box. +For a quick test to see if it is working, type the string `community ` into the regular expression box. If you look in the box on the right of the screen, you see that the expression matches six instances of the string 'community' (the instances are also highlighted within the text). @@ -31,7 +31,7 @@ If you look in the box on the right of the screen, you see that the expression m ### Taking spaces into consideration -Type `community `. You get three matches. Why not six? +Add a space after `community`. You get three matches. Why not six? ::::::::::::::: solution @@ -135,7 +135,7 @@ Find all of the words starting with Comm or comm that are plural. :::::::::::::::::::::::::::::::::::::::::::::::::: -## Exercise finding email addresses using regex101.com +## Exercise: finding email addresses For this exercise, open a browser and go to [https://regex101.com](https://regex101.com). @@ -217,7 +217,7 @@ See the previous exercise for the explanation of the expression up to the `+` :::::::::::::::::::::::::::::::::::::::::::::::::: -## Exercise finding phone numbers, Using regex101.com +## Exercise: finding phone numbers Does this Code of Conduct contain a phone number? @@ -355,9 +355,79 @@ This expression should find one match in the document. One of the reasons we stress the value of consistent and predictable directory and filenaming conventions is that working in this way enables you to use the computer to select files based on the characteristics of their file names. For example, if you have a bunch of files where the first four digits are the year and you only want to do something with files from '2017', then you can. Or if you have 'journal' somewhere in a filename when you have data about journals, you can use the computer to select just those files. Equally, using plain text formats means that you can go further and select files or elements of files based on characteristics of the data *within* those files. See Workshop Overview: [File Naming \& Formatting](https://librarycarpentry.org/lc-overview/06-file-naming-formatting) for further background. + +:::::::::::::::::::::::::::::::::::::::::::::::::: +## Exercise: Extracting substrings in R using regex + +You can use regular expressions in many functions in base R, for example **`grep`** and **`sub`**. We will look at some functions from **`stringr`**, a powerful package for character strings that works well with packages we saw already like **`dplyr`** and **`tidyr`**. To learn more about **`stringr`** after the workshop, you may want to check out this handy [string manipulation with stringr cheatsheet](https://rstudio.github.io/cheatsheets/html/strings.html). + +We will look at just two functions in **`stringr`** that can take regular expressions as an argument: + +* `str_extract(string, pattern)`: return the first pattern match found in each string, as a vector + +* `str_replace(string, pattern, replacement)`: finds the first pattern match in a string, and replaces it with a replacement string + +These functions will return the first pattern match only. To return all possible matches, we can use **`str_extract_all()`** and **`str_replace_all()`**. + +::::::::::::::::::::::::::::::::::::::::: callout +### ESCAPING METACHARACTERS IN REGULAR EXPRESSIONS IN R +Regular expressions in R follow the general syntax we have seen so far, with one main exception. In R, strings use a backslash `\` to escape special behavior - but regular expressions are themselves regarded as strings by R. This creates a problem when we use a metacharacter, like **`\d`**, as this is interpreted by R as `d`. We get around this by using an extra `\` beforehand, like **`\\d`**. :::::::::::::::::::::::::::::::::::::::::::::::::: -## Extracting a substring in Google Sheets using regex +Let's build a regular expression to extract the month from a date, written as a string in the format YYYY-MM-DD: + +```R +library(stringr) +str_extract("2024-05-09", "-\\d{2}-") +``` +```output +[1] "-05-" +``` +This returns the month (MM), with the dashes on either side. There are more advanced ways to remove these, bu a simple approach would be to use **`str_replace_all()`** to replace every "-" with "" (that is, nothing): +```R +str_replace_all("-05-", "-", "") +``` +```output +[1] "05" +``` +We could go one step further and complete both steps in one line of code by wrapping our first **`str_extract()`** function inside **`str_replace_all()`** +```R +str_replace_all(str_extract("2024-05-09", "-\\d{2}-"), "-", "") +``` +```output +[1] "05" +``` +::::::::::::::::::::::::::::::::::::::: challenge + +### Using regex on a dataframe in R with stringr + +Open the "SAFI_clean.csv" dataset we worked with on days 1 and 2 in R. It contains a column "interview_date" with the date of each interview in the format "YYYY-MM-DD". How can we apply the regular expression we just built to add a new column "interview_month" containing just the month (MM)? + +Hint: the mutate() function from dplyr can create new columns containing modified values + +::::::::::::::: solution +### Solution +```R +library(tidyverse) +library(here) + +#Read the data +interviews <- read_csv( + here("data", "SAFI_clean.csv"), + na = "NULL") + +# Add a new column "interview_month", containing the month (MM) extracted from the interview_date (YYYY-MM-DD) +df <- interviews %>% + mutate(interview_month = str_replace_all(str_extract(interview_date,"-(\\d{2})-"),"-", "")) +``` +::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::::: + +## Exercise: Extracting substrings in Google Sheets using regex + +You can also use regular expression in Google Sheets. + ::::::::::::::::::::::::::::::::::::::: challenge @@ -381,9 +451,7 @@ This is one way to solve this challenge. You might have found others. Inside the Latitude and longitude are in decimal degree format and can be positive or negative, so we start with an optional dash for negative values then use `\d+` for a one or more digit match followed by a period `\.`. Note we had to escape the period using `\`. After the period we look for one or more digits `\d+` again followed by a literal comma `,`. We then have a literal space match followed by an optional dash `-` (there are few `0.0` latitude/longitudes that are probably errors, but we'd want to retain so we can deal with them). We then repeat our `\d+\.\d+` we used for the latitude match. ::::::::::::::::::::::::: - :::::::::::::::::::::::::::::::::::::::::::::::::: - :::::::::::::::::::::::::::::::::::::::: keypoints - Regular expressions are useful for searching and cleaning data.