Skip to content

Commit

Permalink
Add re.findall to pick out re matches
Browse files Browse the repository at this point in the history
Actually using re.finditer so we can apply a repl to the result. This
allows users to pick out matches and reformat them in one step.

Fixes #804

Signed-off-by: James Hewitt <[email protected]>
  • Loading branch information
Jamstah committed May 5, 2024
1 parent 17d02c4 commit 89d9c3a
Show file tree
Hide file tree
Showing 5 changed files with 97 additions and 14 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ The format mostly follows [Keep a Changelog](http://keepachangelog.com/en/1.0.0/
- New option `ignore_incomplete_reads` (Requested in #725 by wschoot, contributed in #787 by wfrisch)
- New option `wait_for` in browser jobs (Requested in #763 by yuis-ice, contributed in #810 by jamstah)
- Added tags to jobs and the ability to select them at the command line (#789 by jamstah)
- New filter `re.findall` (Requested in #804 by f0sh, contributed in #805 by jamstah)

### Changed

Expand Down
51 changes: 37 additions & 14 deletions docs/source/filters.rst
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,7 @@ At the moment, the following filters are built-in:
- **ical2text**: Convert `iCalendar`_ to plaintext
- **ocr**: Convert text in images to plaintext using Tesseract OCR
- **re.sub**: Replace text with regular expressions using Python's re.sub
- **re.findall**: Find all non-overlapping matches using Python's re.findall
- **reverse**: Reverse input items
- **sha1sum**: Calculate the SHA-1 checksum of the content
- **shellpipe**: Filter using a shell command
Expand Down Expand Up @@ -485,12 +486,13 @@ Alternatively, ``jq`` can be used for filtering:
filter:
- jq: '.[0].name'
Remove or replace text using regular expressions
------------------------------------------------
Find, remove or replace text using regular expressions
------------------------------------------------------

Just like Python’s ``re.sub`` function, there’s the possibility to apply
a regular expression and either remove of replace the matched text. The
following example applies the filter 3 times:
You can use ``re.sub`` and ``re.findall`` to apply regular expressions.

``re.sub`` can be used to remove of replace all non-overlapping instances
of matched text. The following example applies the filter 3 times:

1. Just specifying a string as the value will replace the matches with
the empty string.
Expand All @@ -499,11 +501,7 @@ following example applies the filter 3 times:
3. You can use groups (``()``) and back-reference them with ``\1``
(etc..) to put groups into the replacement string.

All features are described in Python’s
`re.sub <https://docs.python.org/3/library/re.html#re.sub>`__
documentation (the ``pattern`` and ``repl`` values are passed to this
function as-is, with the value of ``repl`` defaulting to the empty
string).
``repl`` defaults to the empty string, which will remove matched strings.

.. code:: yaml
Expand All @@ -517,15 +515,40 @@ string).
pattern: '</([^>]*)>'
repl: '<END OF TAG \1>'
If you want to enable certain flags (e.g. ``re.MULTILINE``) in the
call, this is possible by inserting an "inline flag" documented in
`flags in re.compile`_, here are some examples:
``re.findall`` can be used to find all non-overlapping matches of a
regular expression. Each match is output on its own line. The following
example applies the filter twice:

1. It uses a group (``()``) and back-reference (``\1``) to extract a
date from the input string.
2. It breaks the numbers in the date out into separate lines.

If ``repl`` is not specified, the full match will be included in the output.

.. code:: yaml
url: https://example.com/regex-findall.html
filter:
- re.findall:
pattern: 'The next draw is on (\d\d\d\d-\d\d-\d\d).'
repl: '\1'
- re.findall: '[0-9]+'
Note: When using HTML or XML, it is usually better to use CSS selectors or
XPATH expressions. HTML and XML cannot be parsed properly using regular
expressions. If the CSS selector or XPATH cannot provide the targeted
selection required, using an ``html2text`` filter first then using
``re.findall`` can be a good pattern.

If you want to enable flags (e.g. ``re.MULTILINE``) in ``re.sub``
or ``re.findall`` filters, use an "inline flag", here are some
examples:

* ``re.MULTILINE``: ``(?m)`` (Makes ``^`` match start-of-line and ``$`` match end-of-line)
* ``re.DOTALL``: ``(?s)`` (Makes ``.`` also match a newline)
* ``re.IGNORECASE``: ``(?i)`` (Perform case-insensitive matching)

.. _flags in re.compile: https://docs.python.org/3/library/re.html#re.compile
.. _full re syntax: https://docs.python.org/3/library/re.html#regular-expression-syntax

This allows you, for example, to remove all leading spaces (only
space character and tab):
Expand Down
20 changes: 20 additions & 0 deletions lib/urlwatch/filters.py
Original file line number Diff line number Diff line change
Expand Up @@ -848,6 +848,26 @@ def filter(self, data, subfilter):
return re.sub(subfilter['pattern'], subfilter.get('repl', ''), data)


class RegexFindall(FilterBase):
"""Pick out regular expressions using Python's re.findall"""

__kind__ = 're.findall'

__supported_subfilters__ = {
'pattern': 'Regular expression to search for (required)',
'repl': 'Replacement string (default: full match)',
}

__default_subfilter__ = 'pattern'

def filter(self, data, subfilter):
if 'pattern' not in subfilter:
raise ValueError('{} needs a pattern'.format(self.__kind__))

# Default: Replace with full match if no "repl" value is set
return "\n".join(match.expand(subfilter.get('repl', '\\g<0>')) for match in re.finditer(subfilter['pattern'], data))


class SortFilter(FilterBase):
"""Sort input items"""

Expand Down
13 changes: 13 additions & 0 deletions lib/urlwatch/tests/data/filter_documentation_testdata.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -285,6 +285,19 @@ https://example.com/regex-substitute.html:
HEADING 1: Welcome to this webpage<END OF TAG h1>
<a>Some Link<END OF TAG a>
<END OF TAG div>
https://example.com/regex-findall.html:
input: |-
Welcome to the lottery webpage.
The numbers for 2020-07-11 are:
4, 8, 15, 16, 23 and 42
The next draw is on 2020-07-13.
Thank you for visiting the lottery webpage.
output: |-
2020
07
13
https://example.net/shellpipe-grep.txt:
input: |-
<h1>Welcome to our price watching page!</h1>
Expand Down
26 changes: 26 additions & 0 deletions lib/urlwatch/tests/data/filter_tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -326,6 +326,32 @@ re_sub_multiline:
One Line
Another Line
re_findall:
filter:
- re.findall: '-[a-z][a-z][a-z]-'
data: |-
Some-abc-things-def-on-ghi-this-line-and
some-jkl-more-mno-here
expected_result: |-
-abc-
-def-
-ghi-
-jkl-
-mno-
re_findall_repl:
filter:
- re.findall:
pattern: '-([a-z])([a-z])([a-z])-'
repl: '\3\2\1'
data: |-
Some-abc-things-def-on-ghi-this-line-and
some-jkl-more-mno-here
expected_result: |-
cba
fed
ihg
lkj
onm
strip:
filter: strip
data: " The rose is red; \n\nthe violet's blue.\nSugar is sweet, \nand so are you. "
Expand Down

0 comments on commit 89d9c3a

Please sign in to comment.