Skip to content

Latest commit

 

History

History
225 lines (176 loc) · 17.4 KB

tutorial_documentation.md

File metadata and controls

225 lines (176 loc) · 17.4 KB

Tutorial and Documentation

Tutorial

This is still in Beta, we'd love to get your feedback on the syntax.

Anything outside of brackets is a literal:

This is a (short) literal :-)

You can use macros like #digit (short: #d) or #any (#a):

This is a [#lowercase #lc #lc #lc] regex :-)

You can repeat with n, n+ or n-m:

This is a [1+ #lc] regex :-)

If you want either of several options, use |:

This is a ['Happy' | 'Short' | 'readable'] regex :-)

Capture with [capture <kleenexp>] (short: [c <kleenexp>], named: [c:name <kleenexp>]):

This is a [capture:adjective 1+ [#letter | ' ' | ',']] regex :-)

Reverse a pattern that matches a single character with not:

[#start_line [0+ #space] [not ['-' | #digit | #space]] [0+ not #space]]

Define your own macros with #name=[<regex>]:

This is a [#trochee #trochee #trochee] regex :-)[
    [comment 'see xkcd 856']
    #trochee=['Robot' | 'Ninja' | 'Pirate' | 'Doctor' | 'Laser' | 'Monkey']
]

Lookeahead and lookbehind:

[#start_string
  [lookahead [0+ #any] #lowercase]
  [lookahead [0+ #any] #uppercase]
  [lookahead [0+ #any] #digit]
  [not lookahead [0+ #any] ["123" | "pass" | "Pass"]]
  [6+ #token]
  #end_string
]
[")" [not lookbehind "()"]]

Add comments with the comment operator:

[[comment "Custom macros can help document intent"]
  #has_lower=[lookahead [0+ not #lowercase] #lowercase]
  #has_upper=[lookahead [0+ not #uppercase] #uppercase]
  #has_digit=[lookahead [0+ not #digit] [capture #digit]]
  #no_common_sequences=[not lookahead [0+ #any] ["123" | "pass" | "Pass"]]

  #start_string #has_lower #has_upper #has_digit #no_common_sequences [6+ #token_character] #end_string
]

Common Macros

#any, #letter, #lowercase, #uppercase, #digit, #newline, #space, #not_newline, #not_space, #integer, #token_character (digit or letter or underscore), #letters (one or more letters), #a..f (or with other letters), #1..5 (or with other numbers), #word_boundry, #start_line, #start_string

Detailed Table of Macros by Category

Cheatsheet

This is a literal. Anything outside of brackets is a literal (even text in parentethes and 'quoted' text)
Brackets may contain whitespace-separated #macros: [#macro #macro #macro]
Brackets may contain literals: ['I am a literal' "I am also a literal"]
Brackets may contain pipes to mean "one of these": [#letter | '_'][#digit | #letter | '_'][#digit | #letter | '_']
If they don't, they may begin with an operator: [0-1 #digit][not 'X'][capture #digit #digit #digit]
This is not a legal kleenexp: [#digit capture #digit] because the operator is not at the beginning
This is not a legal kleenexp: [capture #digit | #letter] because it has both an operator and a pipe
Brackets may contain brackets: [[#letter | '_'] [1+ [#digit | #letter | '_']]]
This is a special macro that matches either "c", "d", "e", or "f": [#c..f]
You can define your own macros (note the next '#' is a litral #): ['#' [[6 #hex] | [3 #hex]] #hex=[#digit | #a..f]]
There is a "comment" operator: ['(' [3 #d] ')' [0-1 #s] [3 #d] '.' [4 #d] [comment "ignore extensions for now" [0-1 '#' [1-4 #d]]]]

Detailed Table of Macros by Category

* Definitions /wrapped in slashes/ are in old regex syntax

Basic

Long Name Short Name Definition* Notes
#any #a /./ May or may not match newlines depending on your engine and whether the kleenexp is compiled in multiline mode, see your regex engine's documentation
#any_at_all #aaa [#any | #newline]
#digit #d /\d/
#not_digit #nd [not #d]
#letter #l /[A-Za-z]/ When in unicode mode, this will be translated as \p{L} in languages that support it (and throw an error elsewhere)
#not_letter #nl [not #l]
#lowercase #lc /[a-z]/ Unicode: \p{Ll}
#not_lowercase #nlc [not #lc]
#uppercase #uc /[A-Z]/ Unicode: \p{Lu}
#not_uppercase #nuc [not #uc]
#newline #n [#newline_character | #crlf] Note that this may match 1 or 2 characters!
#space #s /\s/
#not_space #ns [not #space]
#token_character #tc [#letter | #digit | '_']
#not_token_character #ntc [not #tc]
#token [#letter | '_'][0+ #token_character]
#<char1>..<char2>, e.g. #a..f, #1..9 [<char1>-<char2>] char1 and char2 must be of the same class (lowercase english, uppercase english, numbers) and char1 must be strictly below char2, otherwise it's an error (e.g. these are errors: #a..a, #e..a, #0..f, #!..@)
#letters [1+ #letter]
#token [#letter | '_'][0+ #token_character]

Whitespace

Long Name Short Name Definition* Notes
#newline_character #nc /[\r\n\u2028\u2029]/ Any of #cr, #lf, and in unicode a couple more (explanation]
#newline #n [#newline_character | #crlf] Note that this may match 1 or 2 characters!
#not_newline #nn [not #newline_character] Note that this may only match 1 character, and is not the negation of #n but of #nc!
#linefeed #lf /\n/ See also #n (explanation]
#carriage_return #cr /\r/ See also #n (explanation]
#windows_newline #crlf /\r\n/ Windows newline (explanation]
#tab #t /\t/
#not_tab #nt [not #tab]
#vertical_tab /\v/

Boundries

Long Name Short Name Definition* Notes
#word_boundary #wb /\b/
#not_word_boundary #nwb [not #wb]
#start_string #ss /\A/ (this is the same as #sl unless the engine is in multiline mode)
#end_string #es /\Z/ (this is the same as #el unless the engine is in multiline mode)
#start_line #sl /^/ (this is the same as #ss unless the engine is in multiline mode)
#end_line #el /$/ (this is the same as #es unless the engine is in multiline mode)

Special characters

Long Name Short Name Definition* Notes
#quote #q '
#double_quote #dq "
#left_brace #lb [ '[' ]
#right_brace #rb [ ']' ]

Numbers

Long Name Short Name Definition* Notes
#integer #int [[0-1 '-'] [1+ #digit]]
#digits #ds [1+ #digit]
#decimal [#int [0-1 '.' #digits]
#float [[0-1 '-'] [[#digits '.' [0-1 #digits] | '.' #digits] [0-1 #exponent] | #int #exponent] #exponent=[['e' | 'E'] [0-1 ['+' | '-']] #digits]]
#hex_digit #hexd [#digit | #a..f | #A..F]
#hex_number #hexn [1+ #hex_digit]

Very rare characters

Long Name Short Name Definition* Notes
#bell /\a/
#backspace /[\b]/
#formfeed /\f/

Capture shortcuts

Long Name Short Name Definition* Notes
#capture_0+_any #c0 [capture 0+ #any]
#capture_1+_any #c1 [capture 1+ #any]

* Definitions /wrapped in slashes/ are in old regex syntax (because the macro isn't simply a short way to express something you could express otherwise)

"[not ['a' | 'b']]" => /[^ab]/
"[#digit | [#a..f]]" => /[0-9a-f]/

Trying to compile the empty string raises an error (because this is more often a mistake than not). In the rare case you need it, use [].

Coming soon:

  • #integer, #ip, ..., #a..f
  • numbers: #number_scientific
  • improve readability insice brackets scope with #dot, #hash, #tilde...
  • abc[ignore_case 'de' #lowercase] (which translates to abc[['D' | 'd'] ['E'|'e'] [[A-Z] | [a-z]], today you just wouldn't try)
  • [#0..255] (which translates to ['25' #0..5 | '2' #0..4 #d | '1' #d #d | #1..9 #d | #d]
  • [capture:name ...], [1+:fewest ...] (for non-greedy repeat)
  • unicode support. Full PCRE feature support (lookahead/lookback, some other stuff)
  • Option to add your macros permanently. ke.add_macro("#camelcase=[1+ [#uppercase [0+ lowercase]]], path_optional), [add_macro #month=['january', 'January', 'Jan', ....]]
    • ke.import_macros("./apache_logs_macros.ke"), ke.export_macros("./my_macros.ke"), and maybe arrange built-in ke macros in packages
  • #month, #word, #year_month_day or #yyyy-mm-dd
  • See TODO.txt.