Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse functions: consistency #748

Open
ChristianGruen opened this issue Oct 15, 2023 · 13 comments
Open

Parse functions: consistency #748

ChristianGruen opened this issue Oct 15, 2023 · 13 comments
Labels
Enhancement A change or improvement to an existing feature PRG-easy Categorized as "easy" at the Prague f2f, 2024 PRG-required Categorized as "required for 4.0" at the Prague f2f, 2024 XQFO An issue related to Functions and Operators

Comments

@ChristianGruen
Copy link
Contributor

ChristianGruen commented Oct 15, 2023

The functions for parsing input have been defined by different people, and the current state is quite inconsistent:

Function Parameters
fn:parse-xml $value as xs:string?
fn:doc $href as xs:string?
fn:parse-json $value as xs:string?, $options as map(*)
fn:json-doc $href as xs:string?, $options as map(*)
fn:parse-html $html as union(xs:string, xs:hexBinary, xs:base64Binary)?, $options as map(*)
fn:parse-csv $csv as xs:string?, $options as map(*)

I believe there’s some need to unify the functions, and we could at least:

  • introduce a fn:XYZ-doc($href, $options) function for each input format (with at least one encoding option), and
  • restrict the type of the input parameter of fn:parse-XYZ to xs:string? and always name it $value.

And I wonder if we should tag all fn:XYZ-doc functions as ·nondeterministic· (if it’s not too late)?

@ChristianGruen ChristianGruen added XQFO An issue related to Functions and Operators Enhancement A change or improvement to an existing feature labels Oct 15, 2023
@ChristianGruen ChristianGruen changed the title Parse functions: Consistency Parse functions: consistency Oct 15, 2023
@dnovatchev
Copy link
Contributor

dnovatchev commented Oct 15, 2023

I believe there’s some need to unify the functions, and we could at least:

  • introduce a fn:XYZ-doc($href, $options) function for each input format (with at least one encoding option), and
  • restrict the type of the input parameter of fn:parse-XYZ to xs:string? and alwys name it $value.

Isn't there also fn:parse-xml-fragment ?
So, shall we have two groups of parsing functions: one for docs and one for fragments? Aren't docs also kind of fragments themselves?

As for the name of the input parameter, it should be obvious that the name "input" is more precise than "value". In fact "value" seems to be a most generic and useless name - everything can be regarded as a value of something,

@ChristianGruen
Copy link
Contributor Author

Isn't there also fn:parse-cml-fragment ?

Yes, there are various functions that I didn’t list here, including fn:json-to-xml and the additional CSV functions. I’m not sure if we need a dedicated fn:doc-fragments function?

As for the name of the input parameter, it should be obvious that the name "input" is more precise than "value". In fact "value" seems to be a most generic and useless name - everything can be regarded as a value of something,

I agree, but this would conflict with the current conventions for naming the functions in the spec. I’ve forgotten where the semantics had been specified; simply spoken, atomic parameters are called $values.

@benibela
Copy link

Could rename doc to xml-doc, and add a new function doc that can load any kind of input and detect the kind automatically (e.g. from the http content-type header)

@ChristianGruen
Copy link
Contributor Author

Could rename doc to xml-doc, and add a new function doc that can load any kind of input and detect the kind automatically (e.g. from the http content-type header)

Probably too late, as we cannot change the behavior of existing functions. We could introduce an options parameter to fn:doc, but it will be difficult to do justice to everyone, as the exact behavior of the function depends a lot on the implementation (for example, the referenced input can be stored in a database or in the file system).

However, we could introduce an fn:xml-doc function with much stricter semantics. Possible options could be:

  • encoding
  • strip-whitespaces
  • strip-namespaces
  • parse-dtd
  • parse-xinclude
  • catalog

(would be topic for another issue)

@michaelhkay
Copy link
Contributor

See also issue #490

@rhdunn
Copy link
Contributor

rhdunn commented Oct 17, 2023

The rationale for allowing fn:parse-html to take binary objects is so that it can use the HTML encoding detection/conversion logic, and be compatible with content sent as binary in which case the user does not need to implement their own encoding detection/decoding logic.

@ChristianGruen
Copy link
Contributor Author

The rationale for allowing fn:parse-html to take binary objects is so that it can use the HTML encoding detection/conversion logic, and be compatible with content sent as binary in which case the user does not need to implement their own encoding detection/decoding logic.

Hi Reece, I agree that’s a good idea. I believe it would be similary relevant for XML input, so I think we should either restrict binary input to a new fn:html-doc function or also allow binary items for fn:parse-xml and (ideally) the other parse functions.

@rhdunn
Copy link
Contributor

rhdunn commented Oct 17, 2023

My reseration on restricting enoding to a fn:*-doc function is that binary/encoded text could come from other sources -- network request, zipped/compressed files, etc.

Allowing binary items on other parse functions would be useful for a similar reason.

@ChristianGruen
Copy link
Contributor Author

As we’ve currently no standard function that allows us to read binary contents (which could then be processed with fn:parse-XYZ), I’ve just I‘ve just updated #557.

@fidothe
Copy link
Contributor

fidothe commented Oct 24, 2023

Following on from the QTCG meeting of 2023-10-17, I've tried to summarise the discussion about parse-csv that ranged much wider. Part of that was discussion of parse functions in general:

Parsing functions in general

There were a lot of questions regarding the scope and naming of parsing functions, and the approaches that had been, or could potentially be, taken.

The two scope approaches were, broadly, several single-purpose/single-output-format functions, or one multi-purpose function whose output format was controlled with an option passed in an options parameter map.

There were specific questions about CSV and why there were two functions proposed that had XDM output instead of one.

fn:parse-csv, as proposed, produced very basic output that could be used to build more complex processing on, while fn:csv-to-xdm and fn:csv-to-xml produced a more generalised, but richer, output that could be processed immediately.

With fn:parse-json, fn:parse-html, and fn:parse-xml, the parse-* function returns the immediately useful output.

This confusion suggests to me that if a new data-format function has functions to support consuming it added they should add a parse-* format that produces immediately useful output, and if the precedent established by parse-json and json-* is followed, extra functions should prefix their name with the format, following the format-verb-output naming structure used by json-to-xml where possible. (I am biased in favour of more functions with limited scope over fewer functions that do more...)

Input sources

There was discussion about what input parse functions should be able to accept. json-doc acts almost, but not exactly, like parse-json(unparsed-text('uri.json')), which was offered as both a potential convention to follow, and a confusion/source of proliferation.

Since that discussion, further discussion about handling binary data, from the proposal for fn:unparsed-binary in #557, has happened there and in this issue's comments.

For myself, I have been thinking about this and wondering if unparsed-text (and unparsed-binary or whatever fills that need) could be used as input to the parse-* (and json-to-*, csv-to-*, and friends) functions. They ostensibly return xs:string, but in my wondering they are somewhat lazy which permits streaming data as it comes in if the parse functions support that. Currently only json-doc is in a published standard, perhaps we could avoid using that as a precedent if there's a way to compose unparsed-text/unparsed-binary and the parse functions in a way which doesn't require the special case shim within json-doc.

@ChristianGruen
Copy link
Contributor Author

Thanks, Matt for the summary.

For myself, I have been thinking about this and wondering if unparsed-text (and unparsed-binary or whatever fills that need) could be used as input to the parse-* (and json-to-*, csv-to-*, and friends) functions. They ostensibly return xs:string, but in my wondering they are somewhat lazy which permits streaming data as it comes in if the parse functions support that. Currently only json-doc is in a published standard, perhaps we could avoid using that as a precedent if there's a way to compose unparsed-text/unparsed-binary and the parse functions in a way which doesn't require the special case shim within json-doc.

Interestingly, I had similar thoughts in the past: It seemed simple enough to me to combine fn:unparsed-text/file:read-text and file:read-binary with the subsequent parse function to convert heterogeneous input to XML. I also agree that it should be up to the implementation to stream input between functions whenever possible. It was only because of repeated user requests that we added the convenience functions json:doc, csv:doc, html:doc, etc. in BaseX, and I imagine there could have been similar reasons for introducing fn:json-doc (I was not involved).

@michaelhkay
Copy link
Contributor

There was another reason for introducing json-doc() - it was a way to bypass the inconvenient fact that unparsed-text() rejects files containing non-XML characters.

@fidothe
Copy link
Contributor

fidothe commented Oct 24, 2023

There was another reason for introducing json-doc() - it was a way to bypass the inconvenient fact that unparsed-text() rejects files containing non-XML characters.

In my unchained-by-reality wondering, I imagine an unparsed-text() that returns a function. That function takes an argument which specifies what to do with non-XML characters. parse-json asks for json-style escaping, parse-csv asks for something else...

There's a 2-argument form that returns the text, with the second argument specifying what to do with non-xml characters, allowing someone to skip the indirection...

@ndw ndw added PRG-easy Categorized as "easy" at the Prague f2f, 2024 PRG-required Categorized as "required for 4.0" at the Prague f2f, 2024 labels Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement A change or improvement to an existing feature PRG-easy Categorized as "easy" at the Prague f2f, 2024 PRG-required Categorized as "required for 4.0" at the Prague f2f, 2024 XQFO An issue related to Functions and Operators
Projects
None yet
Development

No branches or pull requests

7 participants