This file mostly describes regular expressions as in R (POSIX BRE and ERE), but also partly covers vim patterns and flags
All escapings in R's regexps should be double-escaped (unlike using in, e.g. cat()
or writeLines()
This also concerns replacement patterns, e.g. \\1
for backreferencing.
. () []
and so on without escaping are special symbols (unlike in vim)
In vim all special symbols but . * ^ $
should be single-escaped!
. * ^ $
in vim without escaping are special symbols, and being escaped mean literally what they are.
- FF
- LF
- CR
- digit class
- its negation
- space class
- its negation
- empty string at the beginning of the word ('word' is locale-dependent)
- empty string at the end of the word
- previous two both
- empty string provided it is not at an edge of a word
- word (a synonym for [[:alnum:]_]
- its negation
\1 ... \9
- numbered backreference
\U \L
- change the text inserted by all following backreferences to uppercase or lowercase [Ex.2]
- insert the following backreferences without any change of case
- a OR b OR c OR d OR e
- not a, nor b
- grooping with capturing
(?: )
- grooping without capturing
- modifier *
belongs to the whole group in ()
(regex1) | (regex2)
- regex1 OR regex2 [Ex.4]
- quantifier (see below)
- Alphanumeric characters: [:alpha:]
and [:digit:]
- Alphabetic characters: [:lower:]
and [:upper:]
- Blank characters: space and tab, and possibly other locale-dependent characters such as non-breaking space
- Control characters. In ASCII, these characters have octal codes 000 through 037, and 177 (DEL). In another character set, these are the equivalent characters, if any
- Digits: 0 1 2 3 4 5 6 7 8 9
- Graphical characters: [:alnum:]
and [:punct:]
- Lower-case letters in the current locale
- Printable characters: [:alnum:]
, [:punct:]
and space
- Punctuation characters: ! " $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~
- Space characters: tab, newline, vertical tab, form feed, carriage return, space and possibly other locale-dependent characters
- Upper-case letters in the current locale
- Hexadecimal digits: 0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f
- any single character
- empty space at the beginning of the line
- empty space at the end of the line
- The preceding item is optional and will be matched at most once
(In vim's search syntax =
is used for that, e.g. /files\=
- The preceding item will be matched zero or more times
- The preceding item will be matched one or more times
- The preceding item is matched exactly ‘n’ times
- The preceding item is matched ‘n’ or more times
- The preceding item is matched at least ‘n’ times, but not more than ‘m’ times
Normally, a repeated expression is greedy, that is, it matches as many characters as possible.
{ }?
makes the quantifier minimal, or non-greedy (also works for one-character quantifiers, like *?
A non-greedy subexpression matches as few characters as possible.
(vim uses {- , }
for that, e.g. /ab\{-1,3}
matches any word between asteriscsSet(Value)?
matches 'Set' or 'SetValue'color=(red|green|blue)
matches duplicated words, like "the the"\<\u\l\+\>\
matches a word beginning with an uppercase letter
sub("(a+)", "z\\U\\1z", c("abc", "def", "cba a", "aa"), perl=TRUE)
#[1] "zAzbc" "def" "cbzAz a" "zAAz"
gsub("(a+)", "z\\1z", c("abc", "def", "cba a", "aa"), perl=TRUE)
#[1] "zAzbc" "def" "cbzAz zAz" "zAAz"
taxa <- c("Limulus polyphemus ", "Gammaridae ", "Amphipoda", " macoma balthica", "Babr baikali")
taxa <- sub('\\+$', '', sub('^\\s*(\\w+)\\s*(\\w*)\\s*', '\\1\\+\\2', taxa))
# Remove extra spaces and turn a space between genus and species into plus sign
Species <- "Gammarus tigrinus Sexton, 1939"
Auth <- regmatches(Species, regexpr("\\S*, \\d{4}\\)*", Species))
# "Sexton, 1939"
# Splitting/removing "Species/Subgenus" and cutting occasional "tails" (like "\n» ")
childBlock <- "Macoma (Macalia) H. Adams, 1861 represented as Macalia H. Adams, 1861 (alternate representation) » Species Macoma (Macalia) bruguieri (Hanley, 1844) represented as Macalia bruguieri (Hanley, 1844) (alternate representation)\nSubgenus Macoma (Macoma) Leach, 1819 represented as Macoma Leach, 1819 (alternate representation) » Species Macoma (Macoma) coani Kafanov & Lutaenko, 1999 represented as Macoma coani Kafanov & Lutaenko, 1999 (alternate representation)\n» "
Children <- sub('(representation\\)).*$', '\\1', unlist(strsplit(childBlock, '[\n]?Species |Subgenus ')))[-1]