Added HTML parsing for content from Threatmatch #2846

pietrocapece · 2024-10-25T11:15:52Z

Proposed changes

Added a function that parses the content being fetched from Threatmatch and removes any HTML tags

Related issues

None

Checklist

I consider the submitted work as finished
I tested the code for its functionality using different use cases
I added/update the relevant documentation (either on github or on notion)
Where necessary I refactored code to improve the overall quality

Further comments

flavienSindou

I have reviewed and tested part of this implementation, and it works well overall.

I left a few suggestions that I believe would help make the code simpler.

Thank you for adding information and error logs.

The connector still follows an old template of the connector implementation. New features, such as pycti.OpenCTIConnectorHelper.helper.schedule_iso, could help you manage the connector's scheduled runs, for instance.

flavienSindou · 2024-10-28T15:45:44Z

external-import/threatmatch/src/threatmatch.py

+    def remove_html_tags(self, text):
+        class HTMLTagRemover(HTMLParser):
+            def __init__(self):
+                super().__init__()
+                self.fed = []
+
+            def handle_data(self, data):
+                self.fed.append(data)
+
+            def get_data(self):
+                return "".join(self.fed)
+
+        parser = HTMLTagRemover()
+        parser.feed(text)
+        return parser.get_data()
+


I think this is not necessary as you could use popular BeatifulSoup library later in the code to remove all tags

Suggested change

def remove_html_tags(self, text):

class HTMLTagRemover(HTMLParser):

def __init__(self):

super().__init__()

self.fed = []

def handle_data(self, data):

self.fed.append(data)

def get_data(self):

return "".join(self.fed)

parser = HTMLTagRemover()

parser.feed(text)

return parser.get_data()

object["description"] = bs4.BeautifulSoup(object["description"], "html.parser")..get_text()

external-import/threatmatch/src/threatmatch.py

Co-authored-by: flavienSindou <[email protected]>

helene-nguyen · 2024-11-15T05:20:52Z

@pietrocapece Thank you for your contribution. Could you resolve the conflicts ?

pietrocapece and others added 5 commits August 28, 2024 11:04

Clean Branch for submission

ee32771

Merge branch 'OpenCTI-Platform:master' into master

a0ad3ab

Added HTML parsing

00f4af1

Merge branch 'OpenCTI-Platform:master' into master

fce8ba6

Added HTML Parsing for content from Threatmatch

11f9826

flavienSindou approved these changes Oct 28, 2024

View reviewed changes

richard-julien force-pushed the master branch from 416a305 to 982a01c Compare October 28, 2024 22:36

pietrocapece and others added 5 commits November 1, 2024 10:50

Update external-import/threatmatch/src/threatmatch.py

e3f25ef

Co-authored-by: flavienSindou <[email protected]>

Update external-import/threatmatch/src/threatmatch.py

ebcd701

Co-authored-by: flavienSindou <[email protected]>

Added BEautifulsoup Parsing

f156b27

Added Beautifulsoup and Linted to pass checks

6b2525f

Removed unused import

0e3fb64

Merge branch 'master' into master

385ca99

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added HTML parsing for content from Threatmatch #2846

Added HTML parsing for content from Threatmatch #2846

pietrocapece commented Oct 25, 2024

flavienSindou left a comment

flavienSindou Oct 28, 2024

helene-nguyen commented Nov 15, 2024

Added HTML parsing for content from Threatmatch #2846

Are you sure you want to change the base?

Added HTML parsing for content from Threatmatch #2846

Conversation

pietrocapece commented Oct 25, 2024

Proposed changes

Related issues

Checklist

Further comments

flavienSindou left a comment

Choose a reason for hiding this comment

flavienSindou Oct 28, 2024

Choose a reason for hiding this comment

helene-nguyen commented Nov 15, 2024