Skip to content

Commit

Permalink
Merge pull request #273 from massi-ang/fix_crawler
Browse files Browse the repository at this point in the history
fix(crawler): only process "text/html" Content-Type pages
  • Loading branch information
massi-ang authored Dec 16, 2023
2 parents 7a3e918 + 17d6da4 commit f9277a7
Showing 1 changed file with 2 additions and 0 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -101,6 +101,8 @@ def parse_url(url: str):
base_url = f"{root_url_parse.scheme}://{root_url_parse.netloc}"

response = requests.get(url, timeout=20)
if response.headers["Content-Type"] != "text/html":
raise Exception(f"Invalid content type {response.headers['Content-Type']}")
soup = BeautifulSoup(response.content, "html.parser")
content = soup.text
content = re.sub(r"[ \n]+", " ", content)
Expand Down

0 comments on commit f9277a7

Please sign in to comment.