Web Scraper does not support application/xml Type #3513

wwwDESIGN-basti · 2024-04-14T10:55:44Z

wwwDESIGN-basti
Apr 14, 2024

Self Checks

This is only for bug report, if you would like to ask a quesion, please head to Discussions.
I have searched for existing issues search for existing issues, including closed ones.
I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
Pleas do not modify this template :) and fill in all the required fields.

Dify version

0.6.1

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

Hey Guys,

ive tried to call a rss feed (xml) with the webscraper module.
How can i fix this?

Input

{
  "url": "https://news.google.com/rss/search?q=Cloud&hl=de&gl=DE&ceid=DE:de",
  "user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.1000.0 Safari/537.36"
}

Output:

{
  "text": "Unsupported content-type [application/xml] of URL.\n",
  "files": []
}

✔️ Expected Behavior

a success call

❌ Actual Behavior

application/xml Type is not supported

@dosu-bot · 2024-04-14T10:56:11Z

dosubot[bot]
bot Apr 14, 2024

To address the issue of the Dify web scraper module not supporting the application/xml content type when trying to scrape an RSS feed, you'll need to modify the code to include support for this content type. Specifically, you should update the get_url function to recognize application/xml as a supported content type. Here's a brief overview of what needs to be done:

Add "application/xml" to the list of supported content types within your function. This ensures that when your scraper encounters content of type application/xml, it knows to process it rather than reject it.
Implement logic to handle application/xml content appropriately. If your current processing logic (such as ExtractProcessor) doesn't support XML, you'll need to integrate XML parsing and handling. This could involve using an XML parser to extract the necessary data from the RSS feed.

Here's a snippet to illustrate how you might modify the get_url function:

def get_url(url: str, user_agent: str = None) -> str:
    headers = {"User-Agent": user_agent or "default_user_agent"}
    supported_content_types = ["text/html", "application/xml"]
    # Fetch and process the URL content here, checking against supported_content_types
    # Include logic for handling application/xml content

This modification ensures that URLs with content type application/xml are processed instead of being rejected. If you're not familiar with handling XML in Python, consider using libraries such as xml.etree.ElementTree for parsing XML content.

Remember, the key is to ensure that your scraper can recognize and process application/xml content types, allowing you to successfully scrape RSS feeds or similar XML-based resources.

Sources

api/core/tools/utils/web_reader_tool.py

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Web Scraper does not support application/xml Type #3513

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

About Dosu

Select a reply

Web Scraper does not support application/xml Type #3513

wwwDESIGN-basti Apr 14, 2024

Self Checks

Dify version

Cloud or Self Hosted

Steps to reproduce

✔️ Expected Behavior

❌ Actual Behavior

Replies: 1 comment

dosubot[bot] bot Apr 14, 2024

Sources

About Dosu

wwwDESIGN-basti
Apr 14, 2024

dosubot[bot]
bot Apr 14, 2024