Parse functions: consistency #748

ChristianGruen · 2023-10-15T17:40:14Z

The functions for parsing input have been defined by different people, and the current state is quite inconsistent:

Function	Parameters
`fn:parse-xml`	`$value as xs:string?`
`fn:doc`	`$href as xs:string?`
`fn:parse-json`	`$value as xs:string?, $options as map(*)`
`fn:json-doc`	`$href as xs:string?, $options as map(*)`
`fn:parse-html`	`$html as union(xs:string, xs:hexBinary, xs:base64Binary)?, $options as map(*)`
`fn:parse-csv`	`$csv as xs:string?, $options as map(*)`

I believe there’s some need to unify the functions, and we could at least:

introduce a fn:XYZ-doc($href, $options) function for each input format (with at least one encoding option), and
restrict the type of the input parameter of fn:parse-XYZ to xs:string? and always name it $value.

And I wonder if we should tag all fn:XYZ-doc functions as ·nondeterministic· (if it’s not too late)?

The text was updated successfully, but these errors were encountered:

dnovatchev · 2023-10-15T18:18:20Z

I believe there’s some need to unify the functions, and we could at least:

introduce a fn:XYZ-doc($href, $options) function for each input format (with at least one encoding option), and

restrict the type of the input parameter of fn:parse-XYZ to xs:string? and alwys name it $value.

Isn't there also fn:parse-xml-fragment ?
So, shall we have two groups of parsing functions: one for docs and one for fragments? Aren't docs also kind of fragments themselves?

As for the name of the input parameter, it should be obvious that the name "input" is more precise than "value". In fact "value" seems to be a most generic and useless name - everything can be regarded as a value of something,

ChristianGruen · 2023-10-15T18:34:56Z

Isn't there also fn:parse-cml-fragment ?

Yes, there are various functions that I didn’t list here, including fn:json-to-xml and the additional CSV functions. I’m not sure if we need a dedicated fn:doc-fragments function?

As for the name of the input parameter, it should be obvious that the name "input" is more precise than "value". In fact "value" seems to be a most generic and useless name - everything can be regarded as a value of something,

I agree, but this would conflict with the current conventions for naming the functions in the spec. I’ve forgotten where the semantics had been specified; simply spoken, atomic parameters are called $values.

benibela · 2023-10-15T18:41:27Z

Could rename doc to xml-doc, and add a new function doc that can load any kind of input and detect the kind automatically (e.g. from the http content-type header)

ChristianGruen · 2023-10-15T19:00:07Z

Could rename doc to xml-doc, and add a new function doc that can load any kind of input and detect the kind automatically (e.g. from the http content-type header)

Probably too late, as we cannot change the behavior of existing functions. We could introduce an options parameter to fn:doc, but it will be difficult to do justice to everyone, as the exact behavior of the function depends a lot on the implementation (for example, the referenced input can be stored in a database or in the file system).

However, we could introduce an fn:xml-doc function with much stricter semantics. Possible options could be:

encoding
strip-whitespaces
strip-namespaces
parse-dtd
parse-xinclude
catalog
…

(would be topic for another issue)

michaelhkay · 2023-10-15T22:23:59Z

Parsing functions in general

There were a lot of questions regarding the scope and naming of parsing functions, and the approaches that had been, or could potentially be, taken.

The two scope approaches were, broadly, several single-purpose/single-output-format functions, or one multi-purpose function whose output format was controlled with an option passed in an options parameter map.

There were specific questions about CSV and why there were two functions proposed that had XDM output instead of one.

fn:parse-csv, as proposed, produced very basic output that could be used to build more complex processing on, while fn:csv-to-xdm and fn:csv-to-xml produced a more generalised, but richer, output that could be processed immediately.

With fn:parse-json, fn:parse-html, and fn:parse-xml, the parse-* function returns the immediately useful output.

This confusion suggests to me that if a new data-format function has functions to support consuming it added they should add a parse-* format that produces immediately useful output, and if the precedent established by parse-json and json-* is followed, extra functions should prefix their name with the format, following the format-verb-output naming structure used by json-to-xml where possible. (I am biased in favour of more functions with limited scope over fewer functions that do more...)

Input sources

There was discussion about what input parse functions should be able to accept. json-doc acts almost, but not exactly, like parse-json(unparsed-text('uri.json')), which was offered as both a potential convention to follow, and a confusion/source of proliferation.

Since that discussion, further discussion about handling binary data, from the proposal for fn:unparsed-binary in #557, has happened there and in this issue's comments.

For myself, I have been thinking about this and wondering if unparsed-text (and unparsed-binary or whatever fills that need) could be used as input to the parse-* (and json-to-*, csv-to-*, and friends) functions. They ostensibly return xs:string, but in my wondering they are somewhat lazy which permits streaming data as it comes in if the parse functions support that. Currently only json-doc is in a published standard, perhaps we could avoid using that as a precedent if there's a way to compose unparsed-text/unparsed-binary and the parse functions in a way which doesn't require the special case shim within json-doc.

ChristianGruen · 2023-10-24T13:50:54Z

Thanks, Matt for the summary.

For myself, I have been thinking about this and wondering if unparsed-text (and unparsed-binary or whatever fills that need) could be used as input to the parse-* (and json-to-*, csv-to-*, and friends) functions. They ostensibly return xs:string, but in my wondering they are somewhat lazy which permits streaming data as it comes in if the parse functions support that. Currently only json-doc is in a published standard, perhaps we could avoid using that as a precedent if there's a way to compose unparsed-text/unparsed-binary and the parse functions in a way which doesn't require the special case shim within json-doc.

Interestingly, I had similar thoughts in the past: It seemed simple enough to me to combine fn:unparsed-text/file:read-text and file:read-binary with the subsequent parse function to convert heterogeneous input to XML. I also agree that it should be up to the implementation to stream input between functions whenever possible. It was only because of repeated user requests that we added the convenience functions json:doc, csv:doc, html:doc, etc. in BaseX, and I imagine there could have been similar reasons for introducing fn:json-doc (I was not involved).

michaelhkay · 2023-10-24T13:59:12Z

There was another reason for introducing json-doc() - it was a way to bypass the inconvenient fact that unparsed-text() rejects files containing non-XML characters.

fidothe · 2023-10-24T16:02:33Z

There was another reason for introducing json-doc() - it was a way to bypass the inconvenient fact that unparsed-text() rejects files containing non-XML characters.

In my unchained-by-reality wondering, I imagine an unparsed-text() that returns a function. That function takes an argument which specifies what to do with non-XML characters. parse-json asks for json-style escaping, parse-csv asks for something else...

There's a 2-argument form that returns the text, with the second argument specifying what to do with non-xml characters, allowing someone to skip the indirection...

ChristianGruen added XQFO An issue related to Functions and Operators Enhancement A change or improvement to an existing feature labels Oct 15, 2023

ChristianGruen changed the title ~~Parse functions: Consistency~~ Parse functions: consistency Oct 15, 2023

ChristianGruen mentioned this issue Oct 15, 2023

413: Spec for CSV-related functions #719

Merged

This was referenced Oct 17, 2023

fn:unparsed-binary: accessing and manipulating binary types #557

Open

Symmetry: fn:html-doc, fn:csv-doc #618

Closed

ChristianGruen mentioned this issue Oct 18, 2023

Serialize functions: consistency #760

Open

fidothe mentioned this issue Oct 24, 2023

Function families #757

Closed

This was referenced Feb 10, 2024

Editorial comments on fn:parse-csv() #1016

Closed

Output of parse-csv() #1018

Closed

ChristianGruen mentioned this issue Mar 28, 2024

Binary resources #1127

Open

michaelhkay mentioned this issue May 8, 2024

Parsing Functions: Empty input #1193

Closed

ndw added PRG-easy Categorized as "easy" at the Prague f2f, 2024 PRG-required Categorized as "required for 4.0" at the Prague f2f, 2024 labels Jun 4, 2024

ChristianGruen mentioned this issue Jun 5, 2024

Add a new function fn:html-doc #1252

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parse functions: consistency #748

Parse functions: consistency #748

ChristianGruen commented Oct 15, 2023 •

edited

Loading

dnovatchev commented Oct 15, 2023 •

edited

Loading

ChristianGruen commented Oct 15, 2023

benibela commented Oct 15, 2023

ChristianGruen commented Oct 15, 2023

michaelhkay commented Oct 15, 2023

rhdunn commented Oct 17, 2023

ChristianGruen commented Oct 17, 2023

rhdunn commented Oct 17, 2023

ChristianGruen commented Oct 17, 2023

fidothe commented Oct 24, 2023

ChristianGruen commented Oct 24, 2023

michaelhkay commented Oct 24, 2023

fidothe commented Oct 24, 2023

Parse functions: consistency #748

Parse functions: consistency #748

Comments

ChristianGruen commented Oct 15, 2023 • edited Loading

dnovatchev commented Oct 15, 2023 • edited Loading

ChristianGruen commented Oct 15, 2023

benibela commented Oct 15, 2023

ChristianGruen commented Oct 15, 2023

michaelhkay commented Oct 15, 2023

rhdunn commented Oct 17, 2023

ChristianGruen commented Oct 17, 2023

rhdunn commented Oct 17, 2023

ChristianGruen commented Oct 17, 2023

fidothe commented Oct 24, 2023

Parsing functions in general

Input sources

ChristianGruen commented Oct 24, 2023

michaelhkay commented Oct 24, 2023

fidothe commented Oct 24, 2023

ChristianGruen commented Oct 15, 2023 •

edited

Loading

dnovatchev commented Oct 15, 2023 •

edited

Loading