Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1651 Ordered Maps #1609

Closed
wants to merge 15 commits into from
Closed

Conversation

michaelhkay
Copy link
Contributor

@michaelhkay michaelhkay commented Nov 25, 2024

Fix #564

Introduces ordered maps: specifically, sorted maps which return entries in order sorted by key, and fifo maps which return entries in the order of insertion.

Although this has been on the TODO-list for a long time and has many useful applications, raising a PR at this stage is particularly motivated by comments on the elements-to-maps() function pointing out that having a predictable order of properties in serialized JSON can be very useful, and that many existing XML-to-JSON converters achieve this. This gives the opportunity, for example, to parse JSON into a representation that retains input order, delete and/or add some properties, and then serializate the JSON with the order retained.

@michaelhkay michaelhkay added XPath An issue related to XPath XQuery An issue related to XQuery XSLT An issue related to XSLT XQFO An issue related to Functions and Operators XDM An issue related to the XPath Data Model Feature A change that introduces a new feature Serialization An issue related to the Serialization spec Tests Needed Tests need to be written or merged labels Nov 25, 2024
@ChristianGruen
Copy link
Contributor

Impressive. First feedback:

  • fifo lets me think of queues, which are emptied again when outputting data. Maybe insertion is more intuitive?
  • random…maybe unsorted? The order is not really random (even though that may not be discernible by most users).
  • I would suggest having retain-order dropped from fn:json-to-xml, and enforce the result to be ordered instead. The (string) input is ordered, too.
  • Maybe map:range($min, $max) (with an exclusive max value, if supplied) would be sufficient and easier to use?
  • “The result of fn:parse-json has ordering=fifo” …probably wrong: the current (reasonable) default for the function is “random”.

@benibela
Copy link

That sounds overly complicated

With a hidden property, the user forgets which kind of map he has. Then the users gets very confused when the order is not what he expects. Or worse, there is a map with ordering "random", but the user thinks it is "sorted", tests his query, and it randomly produces the correct output, because the keys come out in sorted order by happenstance. But then the next day the query fails, because it is randomly a different order

One could just make all maps in fifo order (insertion order)

Javascript and Python did that afaik

If everything is fifo, there can still be a map:sort function to sort the keys ( without affecting keys that are inserted later)

Reminds me of the unordered { .. } syntax in xquery 1. One could have said the map is fifo, unless it is used in an unorder {..} block. Except the unordered block got removed because no one needed unordered things (I can see that happen again. fifo gets added in XQuery 4, but random is the default. Then fifo becomes the default in XQuery 5 because no one needed unordered things. Then random gets removed in XQuery 6 because no one needed unordered things. )

@ndw
Copy link
Contributor

ndw commented Nov 26, 2024

I confess, my knee-jerk reaction to this issue is, "maps aren't ordered. If you care about the order don't put it in a map."

An unpopular opinion, I'm sure, and one that's not consonant with other languages that already preserve insertion order.

I assume that main argument against always using an order-preserving implementation is one of performance. Is that the case? How bad is the performance hit? How bad a hit would you be willing to accept as a trade-off for the simplicity of not having two kinds of maps?

I guess a kind of counter argument would be, if most users want order-preserving maps most of the time, so that's what they end up always using out of habit and cutting-and-pasting from previous code, and because the results are never a surprise, then does the performance benefit of having unordered maps available justify the increased complexity and maintenance cost imposed on implementors to support a rarely used feature?

@michaelhkay
Copy link
Contributor Author

From Christian:

All things that I considered, and where I don't feel strongly that my solution is the only one acceptable.

  • fifo lets me think of queues, which are emptied again when outputting data. Maybe insertion is more intuitive?

"insertion order" is widely used, but what if we want some day to introduce "insert at start" rather than "insert at end"

  • random…maybe unsorted? The order is not really random (even though that may not be discernible by most users).

Indeed so. But the term "hashed random" is widely used. Well, it used to be.

  • I would suggest having retain-order dropped from fn:json-to-xml, and enforce the result to be ordered instead. The (string) input is ordered, too.

Yes, seems reasonable.

  • Maybe map:range($min, $max) (with an exclusive max value, if supplied) would be sufficient and easier to use?

Yes, probably. I also wondered about how to get the last key in the range without retrieving all of them, perhaps limit:=-1

  • “The result of fn:parse-json has ordering=fifo” …probably wrong: the current (reasonable) default for the function is “random”.

Yes, I changed my mind while writing so it's inconsistent.

@michaelhkay
Copy link
Contributor Author

Norm:

I assume that main argument against always using an order-preserving implementation is one of performance.

Well there are two kinds of ordered maps - sorted, and fifo, and both are useful. I don't know what the performance hit is, but probably measurable (space as well as time). Also sorted maps restrict the keys you can use to those that have a defined ordering.

@ChristianGruen
Copy link
Contributor

How bad is the performance hit? How bad a hit would you be willing to accept as a trade-off for the simplicity of not having two kinds of maps?

@ndw Difficult to tell in general, but too bad to make it the default. Immutable maps (such as the HAMT that we have implemented) are very efficient, but pretty complex data structures, at least compared to conventional mutable maps, and it would be a considerable effort to enhance it to support orderedness. We would probably introduce a new data structure and use it as a fallback for ordered contents. As far as I know, Haskell doesn’t use HAMT for ordered maps either.

@michaelhkay
Copy link
Contributor Author

Benibela:

With a hidden property, the user forgets which kind of map he has.

I've never seen that being a problem in Java, where you choose a map implementation at the time of instantiation and that affects its subsequent behavior. And of course, for most interactions with the map, it doesn't matter because if you assume the order of keys is unpredictable, your code will always work. The only thing that will fail is if you try to put the wrong kind of key into a sorted map, and I think that will be rather uncommon - people tend to know what kind of keys a map is designed to contain.

@ChristianGruen
Copy link
Contributor

I hate to be hesitant, but after my nightly reflection, I think we should take one step at a time, and consider whether we need the full range of features proposed in this PR. We have two challenges that are closely related on the one hand, but that are completely different on the other:

  1. preserve the original insertion order in maps, and
  2. organize map contents in a sorted manner.

Challenge 1 is something we cannot simulate with the existing map. I can see various additional use cases apart from representing JSON data. One common way to process ordered map entries is to create sequences of singleton entries; it would be great to be able to represent them in a single map.

Challenge 2 is something that I clearly regard as a convenience feature. It can easily be simulated by iterating over the map entries:

let $map := map:build(1 to 1_000_000, string#1)
return map:keys($map)
  => sort()
  => filter(fn { . >= '5' and . < '6' })
  =!> $map()

In addition, …

  • custom iterations give you much more flexibility than the proposed map:sort and map:range functions how to sort and filter the map contents.
  • With regard to performance, building an unordered map will generally be faster than building an ordered one, so (depending on the implementation) the query above may often be more efficient than an equivalent version with a sorted map. It’s already very fast with Saxon and BaseX.

Adding multiple map variant is not only yet-another-non-trivial-4.0-todo for implementers. Its implications will also be challenging for users. It is not obvious/intuitive to understand that the parsing of JSON data that retains the order leads to a FIFO map. Most people would rather call the result “ordered” (which would be in line with the proposed retain-order option).

The proposal itself is definitely impressive: it does a great job of trying to solve multiple problems at once. Still, I hope we will spend some more time understanding its long-term consequences before we accept and merge it. Personally, I would be perfectly fine if we only introduced an “ordered map” that takes the insertion order into account.

@ChristianGruen ChristianGruen mentioned this pull request Nov 27, 2024
@michaelhkay
Copy link
Contributor Author

One observation I would make is that, given the availability of libraries providing a variety of persistent collections with similar APIs, it's actually very easy to implement. The difference between creating a random, sorted, or insertion-ordered map can be a single line of code at the point where you create the underlying data structure. (OK, I'm exaggerating slightly: a sorted map requires an appropriate comparator, which is slightly different from the context-sensitive comparators used for fn:sort() - but it's not far off).

@michaelhkay
Copy link
Contributor Author

Building an unsorted map, and then sorting the keys on retrieval is fine if you only have to do it once. The gain comes if you do it repeatedly; and of course it gives you a basis for functionality like finding "nearby" values.

@ChristianGruen
Copy link
Contributor

One observation I would make is that, given the availability of libraries providing a variety of persistent collections with similar APIs, it's actually very easy to implement.

It’s probably no coincidence that we have so few implementations today ;·), so I would only tend to be persuaded only if at least 1, 2 other implementers (possibly non-Java) agreed on it… In terms of our processor, I know it is no trivial extension. It might be easier indeed to drop our own implementation and resort to a library, but there have been various reasons why we did it by ourselves (at least back then).

One general question that arises frequently should be asked again: Do we want the specs to be a document for Saxon and BaseX, or do we believe in embracing other implementers as well?

Building an unsorted map, and then sorting the keys on retrieval is fine if you only have to do it once. The gain comes if you do it repeatedly; and of course it gives you a basis for functionality like finding "nearby" values.

It would really be interesting to learn more about the use cases that have been encountered so far. Which applications would be really hard to solve with unordered maps? If performance has been an issue here in the past (it hasn’t been for us), couldn’t it as well be a chance to to tweak existing unordered-map implementations?

@michaelhkay
Copy link
Contributor Author

michaelhkay commented Nov 27, 2024

It would really be interesting to learn more about the use cases [for sorted maps] that have been encountered so far.

The design is inspired by Saxon's range-key maps, which were introduced to meet a specific customer requirement.

https://www.saxonica.com/documentation12/index.html#!functions/saxon/key-map

It's sufficiently long ago that I forget the details, but it was a publishing requirement along the lines of dividing an encyclopedia into sections covering alphabetic ranges. xsl:key gives you an index that allows you to find specific entries by key, but it doesn't allow you to find all the keys in a given range.

I also recollect a customer using this for ranges of dates: find all the appointments in a given week (or find the next appointment after a given date/time). The only other way to achieve that is with a serial search.

@ChristianGruen
Copy link
Contributor

Sorry for cross-posting. I have moved my subsequent comment on sorted maps to #564 (comment) ff.

@michaelhkay michaelhkay force-pushed the 564-ordered-maps branch 2 times, most recently from 0fa6594 to b4b2720 Compare December 4, 2024 17:03
@michaelhkay
Copy link
Contributor Author

I have revised the PR in response to discussion at the WG meeting yesterday. Main changes:

  • "random" becomes "undefined" (with some reservation, phrases like "the order is undefined" can be confusing)
  • added a lot more introductory/justification prose
  • added an interrogative function map:ordering
  • constructor functions for named record types now produce a map with order=insertion (so they are serialized with the fields in declaration order)

@michaelhkay
Copy link
Contributor Author

PR now revised to drop sorted maps.

<xtermref spec="DM40" ref="dt-entry-order">entry order</xtermref>
is <termref def="implementation-dependent"/>.
</fos:value>
<fos:value value="insertion">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<fos:value value="insertion">
<fos:value value="true">

…maybe just ordered instead of retain-order, to describe the result of the function call and not the procedural part?

@ChristianGruen ChristianGruen changed the title 564 Ordered Maps 1651 Ordered Maps Dec 11, 2024
@ChristianGruen
Copy link
Contributor

I have decided to change the title of the PR to reference #1651.

@ChristianGruen ChristianGruen added the Blocked PR is blocked (has merge conflicts, doesn't format, etc.) label Jan 14, 2025
@michaelhkay michaelhkay deleted the 564-ordered-maps branch January 14, 2025 17:23
@michaelhkay michaelhkay restored the 564-ordered-maps branch January 14, 2025 22:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Blocked PR is blocked (has merge conflicts, doesn't format, etc.) Feature A change that introduces a new feature Serialization An issue related to the Serialization spec Tests Needed Tests need to be written or merged XDM An issue related to the XPath Data Model XPath An issue related to XPath XQFO An issue related to Functions and Operators XQuery An issue related to XQuery XSLT An issue related to XSLT
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sorted maps
4 participants