Skip to content
ChantalG1 edited this page Jan 28, 2020 · 12 revisions

** Note: This page is under construction. We are working on a better documentation for the webscraping feature.**

Webscraping with CSS and XPath

If you set the response format to "text", after downloading HTML files you will find the HTML source code of pages in the text property. You can use CSS selectors and XPath to scrape data using the pipe & modifier syntax inside of keys. Values can be passed to further functions using the pipe | operator followed by the modifiers css: or xpath:.

For example, text|css:div.article will first select the text key and second pass the value to an XML parser in order to select all elements matching the CSS selector div.article (=all div elements containing "article" in the class attribute).

The same may be achieved using XPath: text|xpath://div[@class='article'] will first select the text key and second pass the value to an XML parser in order to select all elements matching the given XPath expression (=all div elements with class attribute "article").

You can even chain functions. For example, text|css:div.article|xpath://text() will first select the text key, then select div elements with class "article", and finally extract all text in the elements.

If the extracted HTML or XML elements contain JSON, you can convert it from text to JSON with the modifier json:. Add a key if you are only interested in specific values of the object.

The pipe & modifier syntax can be used inside of keys at different places:

  1. You can extract elements and values when downloading data using the fields Key to extract and Key for Object ID in the Generic Module. New child nodes are created from the extracted data.
  2. In order to show HTML data in the data view and export it as a CSV file you can scrape data in the column setup.
  3. The function Extract data right above the Detail View lets you extract data after downloading. This function follows the same logic like in the downloading step. New child nodes are created from the extracted data.
Clone this wiki locally