diff --git a/CHANGELOG.md b/CHANGELOG.md index 63eb64fe..85b2719a 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -12,6 +12,7 @@ The format mostly follows [Keep a Changelog](http://keepachangelog.com/en/1.0.0/ - New option `ignore_incomplete_reads` (Requested in #725 by wschoot, contributed in #787 by wfrisch) - New option `wait_for` in browser jobs (Requested in #763 by yuis-ice, contributed in #810 by jamstah) - Added tags to jobs and the ability to select them at the command line (#789 by jamstah) +- New filter `re.findall` (Requested in #804 by f0sh, contributed in #805 by jamstah) ### Changed diff --git a/docs/source/filters.rst b/docs/source/filters.rst index 65148d02..9b109e60 100644 --- a/docs/source/filters.rst +++ b/docs/source/filters.rst @@ -77,6 +77,7 @@ At the moment, the following filters are built-in: - **ical2text**: Convert `iCalendar`_ to plaintext - **ocr**: Convert text in images to plaintext using Tesseract OCR - **re.sub**: Replace text with regular expressions using Python's re.sub +- **re.findall**: Find all non-overlapping matches using Python's re.findall - **reverse**: Reverse input items - **sha1sum**: Calculate the SHA-1 checksum of the content - **shellpipe**: Filter using a shell command @@ -485,12 +486,13 @@ Alternatively, ``jq`` can be used for filtering: filter: - jq: '.[0].name' -Remove or replace text using regular expressions ------------------------------------------------- +Find, remove or replace text using regular expressions +------------------------------------------------------ -Just like Python’s ``re.sub`` function, there’s the possibility to apply -a regular expression and either remove of replace the matched text. The -following example applies the filter 3 times: +You can use ``re.sub`` and ``re.findall`` to apply regular expressions. + +``re.sub`` can be used to remove of replace all non-overlapping instances +of matched text. The following example applies the filter 3 times: 1. Just specifying a string as the value will replace the matches with the empty string. @@ -499,11 +501,7 @@ following example applies the filter 3 times: 3. You can use groups (``()``) and back-reference them with ``\1`` (etc..) to put groups into the replacement string. -All features are described in Python’s -`re.sub `__ -documentation (the ``pattern`` and ``repl`` values are passed to this -function as-is, with the value of ``repl`` defaulting to the empty -string). +``repl`` defaults to the empty string, which will remove matched strings. .. code:: yaml @@ -517,15 +515,40 @@ string). pattern: ']*)>' repl: '' -If you want to enable certain flags (e.g. ``re.MULTILINE``) in the -call, this is possible by inserting an "inline flag" documented in -`flags in re.compile`_, here are some examples: +``re.findall`` can be used to find all non-overlapping matches of a +regular expression. Each match is output on its own line. The following +example applies the filter twice: + +1. It uses a group (``()``) and back-reference (``\1``) to extract a + date from the input string. +2. It breaks the numbers in the date out into separate lines. + +If ``repl`` is not specified, the full match will be included in the output. + +.. code:: yaml + + url: https://example.com/regex-findall.html + filter: + - re.findall: + pattern: 'The next draw is on (\d\d\d\d-\d\d-\d\d).' + repl: '\1' + - re.findall: '[0-9]+' + +Note: When using HTML or XML, it is usually better to use CSS selectors or +XPATH expressions. HTML and XML cannot be parsed properly using regular +expressions. If the CSS selector or XPATH cannot provide the targeted +selection required, using an ``html2text`` filter first then using +``re.findall`` can be a good pattern. + +If you want to enable flags (e.g. ``re.MULTILINE``) in ``re.sub`` +or ``re.findall`` filters, use an "inline flag", here are some +examples: * ``re.MULTILINE``: ``(?m)`` (Makes ``^`` match start-of-line and ``$`` match end-of-line) * ``re.DOTALL``: ``(?s)`` (Makes ``.`` also match a newline) * ``re.IGNORECASE``: ``(?i)`` (Perform case-insensitive matching) -.. _flags in re.compile: https://docs.python.org/3/library/re.html#re.compile +.. _full re syntax: https://docs.python.org/3/library/re.html#regular-expression-syntax This allows you, for example, to remove all leading spaces (only space character and tab): diff --git a/lib/urlwatch/filters.py b/lib/urlwatch/filters.py index ed21b4c0..e4fd25ab 100644 --- a/lib/urlwatch/filters.py +++ b/lib/urlwatch/filters.py @@ -848,6 +848,26 @@ def filter(self, data, subfilter): return re.sub(subfilter['pattern'], subfilter.get('repl', ''), data) +class RegexFindall(FilterBase): + """Pick out regular expressions using Python's re.findall""" + + __kind__ = 're.findall' + + __supported_subfilters__ = { + 'pattern': 'Regular expression to search for (required)', + 'repl': 'Replacement string (default: full match)', + } + + __default_subfilter__ = 'pattern' + + def filter(self, data, subfilter): + if 'pattern' not in subfilter: + raise ValueError('{} needs a pattern'.format(self.__kind__)) + + # Default: Replace with full match if no "repl" value is set + return "\n".join(match.expand(subfilter.get('repl', '\\g<0>')) for match in re.finditer(subfilter['pattern'], data)) + + class SortFilter(FilterBase): """Sort input items""" diff --git a/lib/urlwatch/tests/data/filter_documentation_testdata.yaml b/lib/urlwatch/tests/data/filter_documentation_testdata.yaml index cdcf8dfa..cbaf509f 100644 --- a/lib/urlwatch/tests/data/filter_documentation_testdata.yaml +++ b/lib/urlwatch/tests/data/filter_documentation_testdata.yaml @@ -285,6 +285,19 @@ https://example.com/regex-substitute.html: HEADING 1: Welcome to this webpage Some Link +https://example.com/regex-findall.html: + input: |- + Welcome to the lottery webpage. + The numbers for 2020-07-11 are: + + 4, 8, 15, 16, 23 and 42 + + The next draw is on 2020-07-13. + Thank you for visiting the lottery webpage. + output: |- + 2020 + 07 + 13 https://example.net/shellpipe-grep.txt: input: |-

Welcome to our price watching page!

diff --git a/lib/urlwatch/tests/data/filter_tests.yaml b/lib/urlwatch/tests/data/filter_tests.yaml index 081cf5e4..72629d23 100644 --- a/lib/urlwatch/tests/data/filter_tests.yaml +++ b/lib/urlwatch/tests/data/filter_tests.yaml @@ -326,6 +326,32 @@ re_sub_multiline: One Line Another Line +re_findall: + filter: + - re.findall: '-[a-z][a-z][a-z]-' + data: |- + Some-abc-things-def-on-ghi-this-line-and + some-jkl-more-mno-here + expected_result: |- + -abc- + -def- + -ghi- + -jkl- + -mno- +re_findall_repl: + filter: + - re.findall: + pattern: '-([a-z])([a-z])([a-z])-' + repl: '\3\2\1' + data: |- + Some-abc-things-def-on-ghi-this-line-and + some-jkl-more-mno-here + expected_result: |- + cba + fed + ihg + lkj + onm strip: filter: strip data: " The rose is red; \n\nthe violet's blue.\nSugar is sweet, \nand so are you. "