Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added HTML parsing for content from Threatmatch #2846

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

pietrocapece
Copy link
Contributor

Proposed changes

  • Added a function that parses the content being fetched from Threatmatch and removes any HTML tags

Related issues

  • None

Checklist

  • I consider the submitted work as finished
  • I tested the code for its functionality using different use cases
  • I added/update the relevant documentation (either on github or on notion)
  • Where necessary I refactored code to improve the overall quality

Further comments

Copy link
Contributor

@flavienSindou flavienSindou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reviewed and tested part of this implementation, and it works well overall.

I left a few suggestions that I believe would help make the code simpler.

Thank you for adding information and error logs.

The connector still follows an old template of the connector implementation. New features, such as pycti.OpenCTIConnectorHelper.helper.schedule_iso, could help you manage the connector's scheduled runs, for instance.

Comment on lines 85 to 100
def remove_html_tags(self, text):
class HTMLTagRemover(HTMLParser):
def __init__(self):
super().__init__()
self.fed = []

def handle_data(self, data):
self.fed.append(data)

def get_data(self):
return "".join(self.fed)

parser = HTMLTagRemover()
parser.feed(text)
return parser.get_data()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not necessary as you could use popular BeatifulSoup library later in the code to remove all tags

Suggested change
def remove_html_tags(self, text):
class HTMLTagRemover(HTMLParser):
def __init__(self):
super().__init__()
self.fed = []
def handle_data(self, data):
self.fed.append(data)
def get_data(self):
return "".join(self.fed)
parser = HTMLTagRemover()
parser.feed(text)
return parser.get_data()
object["description"] =  bs4.BeautifulSoup(object["description"], "html.parser")..get_text()

external-import/threatmatch/src/threatmatch.py Outdated Show resolved Hide resolved
@helene-nguyen
Copy link
Member

@pietrocapece Thank you for your contribution. Could you resolve the conflicts ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants