Datafields which happen to hold data like '&.+$' (regex - ampersands) are not escaped and destroy XML output #118

johannespostler · 2013-02-15T12:16:25Z

Any datafield in the database that holds data that is captured with the regex '&.+$' breaks the output of all XSAMS files. This regex fits to all HTML entities e.g. ä. If one of these is outputted through regex, they are not escaped, therefore breaking most browsers and the validator (if it doesn't happen to be a html entity). Browsers expect a semicolon as the sixth character after the ampersand.

Testcase:
http://ideadb.uibk.ac.at/view/107/

The url field of this scan contains the following characters (within the link):
52fed736-74fc-11e2-9a8e-00000aacb35f&acdnat=1360663964_abbc8fd43c6ff547c477bb7648e5250d

Since this is a rather common pattern for URLs this is a problem.

ivh · 2013-02-15T12:33:58Z

This is indeed a problem, and related to #83 . However, the NoseSoftware cannot know if the database content is already escaped or not and we certainly do not want to escape twice. Therefore the node needs to make sure itself to not deliver things that break validation. This can either be done in the database itself (make an escaped copy of the column in question) or in the models.py by a small method that applies the escape function to the field.

johannespostler · 2013-02-15T14:35:38Z

I agree - we cannot just escape by default. Sometimes even I as a database provider don't know what content a field has - e.g. a comment field for one piece of data. I can't rule out that somebody puts a series of ampersands there...

However, we could check whether the content of URL in a Source is already encoded. The escaping function used (xml.sax.saxutils.escape) seems to be rather intelligent. My workaround will be to unescape and escape all content for the URL field. This should leave all content in an escaped state behind.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datafields which happen to hold data like '&.+$' (regex - ampersands) are not escaped and destroy XML output #118

Datafields which happen to hold data like '&.+$' (regex - ampersands) are not escaped and destroy XML output #118

johannespostler commented Feb 15, 2013

ivh commented Feb 15, 2013

johannespostler commented Feb 15, 2013

Datafields which happen to hold data like '&.+$' (regex - ampersands) are not escaped and destroy XML output #118

Datafields which happen to hold data like '&.+$' (regex - ampersands) are not escaped and destroy XML output #118

Comments

johannespostler commented Feb 15, 2013

ivh commented Feb 15, 2013

johannespostler commented Feb 15, 2013