Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datafields which happen to hold data like '&.+$' (regex - ampersands) are not escaped and destroy XML output #118

Open
johannespostler opened this issue Feb 15, 2013 · 2 comments

Comments

@johannespostler
Copy link
Contributor

Any datafield in the database that holds data that is captured with the regex '&.+$' breaks the output of all XSAMS files. This regex fits to all HTML entities e.g. ä. If one of these is outputted through regex, they are not escaped, therefore breaking most browsers and the validator (if it doesn't happen to be a html entity). Browsers expect a semicolon as the sixth character after the ampersand.

Testcase:
http://ideadb.uibk.ac.at/view/107/

The url field of this scan contains the following characters (within the link):
52fed736-74fc-11e2-9a8e-00000aacb35f&acdnat=1360663964_abbc8fd43c6ff547c477bb7648e5250d

Since this is a rather common pattern for URLs this is a problem.

@ivh
Copy link
Member

ivh commented Feb 15, 2013

This is indeed a problem, and related to #83 . However, the NoseSoftware cannot know if the database content is already escaped or not and we certainly do not want to escape twice. Therefore the node needs to make sure itself to not deliver things that break validation. This can either be done in the database itself (make an escaped copy of the column in question) or in the models.py by a small method that applies the escape function to the field.

@johannespostler
Copy link
Contributor Author

I agree - we cannot just escape by default. Sometimes even I as a database provider don't know what content a field has - e.g. a comment field for one piece of data. I can't rule out that somebody puts a series of ampersands there...

However, we could check whether the content of URL in a Source is already encoded. The escaping function used (xml.sax.saxutils.escape) seems to be rather intelligent. My workaround will be to unescape and escape all content for the URL field. This should leave all content in an escaped state behind.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants