This is still in Beta, we'd love to get your feedback on the syntax.
Anything outside of brackets is a literal:
This is a (short) literal :-)
You can use macros like #digit
(short: #d
) or #any
(#a
):
This is a [#lowercase #lc #lc #lc] regex :-)
You can repeat with n
, n+
or n-m
:
This is a [1+ #lc] regex :-)
If you want either of several options, use |
:
This is a ['Happy' | 'Short' | 'readable'] regex :-)
Capture with [capture <kleenexp>]
(short: [c <kleenexp>]
, named: [c:name <kleenexp>]
):
This is a [capture:adjective 1+ [#letter | ' ' | ',']] regex :-)
Reverse a pattern that matches a single character with not
:
[#start_line [0+ #space] [not ['-' | #digit | #space]] [0+ not #space]]
Define your own macros with #name=[<regex>]
:
This is a [#trochee #trochee #trochee] regex :-)[
[comment 'see xkcd 856']
#trochee=['Robot' | 'Ninja' | 'Pirate' | 'Doctor' | 'Laser' | 'Monkey']
]
Lookeahead and lookbehind:
[#start_string
[lookahead [0+ #any] #lowercase]
[lookahead [0+ #any] #uppercase]
[lookahead [0+ #any] #digit]
[not lookahead [0+ #any] ["123" | "pass" | "Pass"]]
[6+ #token]
#end_string
]
[")" [not lookbehind "()"]]
Add comments with the comment
operator:
[[comment "Custom macros can help document intent"]
#has_lower=[lookahead [0+ not #lowercase] #lowercase]
#has_upper=[lookahead [0+ not #uppercase] #uppercase]
#has_digit=[lookahead [0+ not #digit] [capture #digit]]
#no_common_sequences=[not lookahead [0+ #any] ["123" | "pass" | "Pass"]]
#start_string #has_lower #has_upper #has_digit #no_common_sequences [6+ #token_character] #end_string
]
#any, #letter, #lowercase, #uppercase, #digit, #newline, #space, #not_newline, #not_space, #integer, #token_character (digit or letter or underscore), #letters (one or more letters), #a..f (or with other letters), #1..5 (or with other numbers), #word_boundry, #start_line, #start_string
Detailed Table of Macros by Category
This is a literal. Anything outside of brackets is a literal (even text in parentethes and 'quoted' text)
Brackets may contain whitespace-separated #macros: [#macro #macro #macro]
Brackets may contain literals: ['I am a literal' "I am also a literal"]
Brackets may contain pipes to mean "one of these": [#letter | '_'][#digit | #letter | '_'][#digit | #letter | '_']
If they don't, they may begin with an operator: [0-1 #digit][not 'X'][capture #digit #digit #digit]
This is not a legal kleenexp: [#digit capture #digit] because the operator is not at the beginning
This is not a legal kleenexp: [capture #digit | #letter] because it has both an operator and a pipe
Brackets may contain brackets: [[#letter | '_'] [1+ [#digit | #letter | '_']]]
This is a special macro that matches either "c", "d", "e", or "f": [#c..f]
You can define your own macros (note the next '#' is a litral #): ['#' [[6 #hex] | [3 #hex]] #hex=[#digit | #a..f]]
There is a "comment" operator: ['(' [3 #d] ')' [0-1 #s] [3 #d] '.' [4 #d] [comment "ignore extensions for now" [0-1 '#' [1-4 #d]]]]
* Definitions /wrapped in slashes/
are in old regex syntax
Long Name | Short Name | Definition* | Notes |
---|---|---|---|
#any | #a | /./ |
May or may not match newlines depending on your engine and whether the kleenexp is compiled in multiline mode, see your regex engine's documentation |
#any_at_all | #aaa | [#any | #newline] |
|
#digit | #d | /\d/ |
|
#not_digit | #nd | [not #d] |
|
#letter | #l | /[A-Za-z]/ |
When in unicode mode, this will be translated as \p{L} in languages that support it (and throw an error elsewhere) |
#not_letter | #nl | [not #l] |
|
#lowercase | #lc | /[a-z]/ |
Unicode: \p{Ll} |
#not_lowercase | #nlc | [not #lc] |
|
#uppercase | #uc | /[A-Z]/ |
Unicode: \p{Lu} |
#not_uppercase | #nuc | [not #uc] |
|
#newline | #n | [#newline_character | #crlf] |
Note that this may match 1 or 2 characters! |
#space | #s | /\s/ |
|
#not_space | #ns | [not #space] |
|
#token_character | #tc | [#letter | #digit | '_'] |
|
#not_token_character | #ntc | [not #tc] |
|
#token | [#letter | '_'][0+ #token_character] |
||
#<char1>..<char2>, e.g. #a..f , #1..9 |
[<char1>-<char2>] |
char1 and char2 must be of the same class (lowercase english, uppercase english, numbers) and char1 must be strictly below char2 , otherwise it's an error (e.g. these are errors: #a..a , #e..a , #0..f , #!..@ ) |
|
#letters | [1+ #letter] |
||
#token | [#letter | '_'][0+ #token_character] |
Long Name | Short Name | Definition* | Notes |
---|---|---|---|
#newline_character | #nc | /[\r\n\u2028\u2029]/ |
Any of #cr , #lf , and in unicode a couple more (explanation] |
#newline | #n | [#newline_character | #crlf] |
Note that this may match 1 or 2 characters! |
#not_newline | #nn | [not #newline_character] |
Note that this may only match 1 character, and is not the negation of #n but of #nc ! |
#linefeed | #lf | /\n/ |
See also #n (explanation] |
#carriage_return | #cr | /\r/ |
See also #n (explanation] |
#windows_newline | #crlf | /\r\n/ |
Windows newline (explanation] |
#tab | #t | /\t/ |
|
#not_tab | #nt | [not #tab] |
|
#vertical_tab | /\v/ |
Long Name | Short Name | Definition* | Notes |
---|---|---|---|
#word_boundary | #wb | /\b/ |
|
#not_word_boundary | #nwb | [not #wb] |
|
#start_string | #ss | /\A/ (this is the same as #sl unless the engine is in multiline mode) |
|
#end_string | #es | /\Z/ (this is the same as #el unless the engine is in multiline mode) |
|
#start_line | #sl | /^/ (this is the same as #ss unless the engine is in multiline mode) |
|
#end_line | #el | /$/ (this is the same as #es unless the engine is in multiline mode) |
Long Name | Short Name | Definition* | Notes |
---|---|---|---|
#quote | #q | ' |
|
#double_quote | #dq | " |
|
#left_brace | #lb | [ '[' ] |
|
#right_brace | #rb | [ ']' ] |
Long Name | Short Name | Definition* | Notes |
---|---|---|---|
#integer | #int | [[0-1 '-'] [1+ #digit]] |
|
#digits | #ds | [1+ #digit] |
|
#decimal | [#int [0-1 '.' #digits] |
||
#float | [[0-1 '-'] [[#digits '.' [0-1 #digits] | '.' #digits] [0-1 #exponent] | #int #exponent] #exponent=[['e' | 'E'] [0-1 ['+' | '-']] #digits]] |
||
#hex_digit | #hexd | [#digit | #a..f | #A..F] |
|
#hex_number | #hexn | [1+ #hex_digit] |
Long Name | Short Name | Definition* | Notes |
---|---|---|---|
#bell | /\a/ |
||
#backspace | /[\b]/ |
||
#formfeed | /\f/ |
Long Name | Short Name | Definition* | Notes |
---|---|---|---|
#capture_0+_any | #c0 | [capture 0+ #any] |
|
#capture_1+_any | #c1 | [capture 1+ #any] |
* Definitions /wrapped in slashes/
are in old regex syntax (because the macro isn't simply a short way to express something you could express otherwise)
"[not ['a' | 'b']]" => /[^ab]/
"[#digit | [#a..f]]" => /[0-9a-f]/
Trying to compile the empty string raises an error (because this is more often a mistake than not). In the rare case you need it, use []
.
#integer
,#ip
, ...,#a..f
numbers: #number_scientific
- improve readability insice brackets scope with
#dot
,#hash
,#tilde
... abc[ignore_case 'de' #lowercase]
(which translates toabc[['D' | 'd'] ['E'|'e'] [[A-Z] | [a-z]]
, today you just wouldn't try)[#0..255]
(which translates to['25' #0..5 | '2' #0..4 #d | '1' #d #d | #1..9 #d | #d]
[capture:name ...]
,[1+:fewest ...]
(for non-greedy repeat)- unicode support. Full PCRE feature support (lookahead/lookback, some other stuff)
- Option to add your macros permanently.
ke.add_macro("#camelcase=[1+ [#uppercase [0+ lowercase]]], path_optional)
,[add_macro #month=['january', 'January', 'Jan', ....]]
ke.import_macros("./apache_logs_macros.ke")
,ke.export_macros("./my_macros.ke")
, and maybe arrange built-in ke macros in packages
#month
,#word
,#year_month_day
or#yyyy-mm-dd
- See TODO.txt.