Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hash Data Structure #33

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jabolina
Copy link
Member

Adding the initial design draft for the Hash data structure. I think there are some unknown unknowns for me here, but I tried to cover at least a bit of everything.

[source,java]
.Hash.java
----
public interface Hash<K, V> { // <1>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this different from cache ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it is stored as a single entry in a cache.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, now I get it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this, it looks awfully close our MultiMap, just instead of a Collection it is a Map. Could we not build on top of that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could build upon MultiMap without duplicates, storing key-value pairs. But I am unsure if we can compare for equality based only on the key. On the other hand, maybe we could abstract some of the MultiMap implementations which work over a Collection to work on another structure instead of creating everything from scratch. I would need to look at that, though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't it the same as FineGrainedAtomicMap that we dropped and replaced with the Grouping API?

----

<1> The Hash data structure is a generic interface over `K` and `V`. Should we extend `Map<K, V>`?
This way we could provide a wide range of operations over the Hash.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would avoid extending Map as that has a million methods that have made it a pain in the past for our Cache.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, why not using map directly if we do that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some further information:

  • Map is a sync-only API, so this does not help with the async and mutiny variants
  • Some of the Map methods have effects, such as returning the previous value on put and remove which are costly in remote scenarios.

To create and use a `HashCache<?>`:

```java
EmbeddedCacheManager cm = ...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new api ?


```java
EmbeddedCacheManager cm = ...
HashCacheManager hashCacheManager = EmbeddedHashCacheManagerFactory.from(cm);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the legacy API, we need this in the new API only

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not implement anything on the legacy API please.

----

<1> The Hash data structure is a generic interface over `K` and `V`. Should we extend `Map<K, V>`?
This way we could provide a wide range of operations over the Hash.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, why not using map directly if we do that.

[source,java]
.Hash.java
----
public interface Hash<K, V> { // <1>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, now I get it


A `Hash<?, ?>` structure is local to a node. Meaning that operations applied to the local copy must be submitted
through the public API before taking effect. After that, the `Hash<?, ?>` is distributed/replicated according to
the configuration.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the intended use case?
The user operates on an Hash object and then uses an HashCache obj to submit the changes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say that individual operations are applied immediately to the underlying storage (remote cache). When using tx/batching, these would be collected and sent in a single operation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user operates on an Hash object and then uses an HashCache obj to submit the changes?

Yeah, I was thinking of something of that sort. I'll update the documentation to include single operations without transferring the whole object. For multiple operations, I was thinking of keeping track of the local changes, and once the user submits the changes to the server, we merge the local Hash with the server's Hash.

I'll update the docs.

@jabolina
Copy link
Member Author

Thanks, everyone, for the comments. I updated the document with all the suggestions. While at it, I described more about how we handle the operations. A point of attention is how we should handle operations that directly reference a Hash, for example, HashCache.put("hash", "property", "value"), I added the two options I see for that in the document.

----

<1> Returning a `Map<HK, HV>` to contain all the vargs keys.
<2> The `submit` method is used to apply the local changes to the server's Hash. The server's `Hash<?, ?>` is returned.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need the return value? Hash could be very big. Returning on request or only if submit fails?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but not strictly. I'll add some void methods, too. In one case, for example, the RESP connector needs to return what has changed. But I was aiming at something more general instead of focusing only on the RESP needs.

<1> The Hash data structure is a generic interface over `K` and `V`. We do not extend `Map<K, V>`, instead we add methods
as needed.

A `Hash<?, ?>` structure is local to a node. Meaning that operations applied to the local copy must be submitted

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel a disturbance about Hash. I mean, if it's detached from cache isn't it just a Map do we need a new type? On the other side if individual operations are submitted immediately then some methods seems to overlap with HashCache

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I did not like the naming on the RESP commands either. A Hash to me is not a Map structure but rather a numeric derived from some other fields.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be completely honest, even after writing the word "Hash" so many times, it still sounds weird as a structure name. I thought the same, i.e., a hash is a value after applying a hashing function (or even the function)... Regarding naming, I would not mind changing it to Map(ping), Document, Structure, Container, or something else.

And as the structure Hash, which the user interacts with, would help to keep track of the operations. For example, if we return a Map and the user removes one of the keys, we wouldn't notice that when merging with the server's content. Or, to better reflect, rename the structure HashView, HashCopy, or something of that sort.

<1> The Hash data structure is a generic interface over `K` and `V`. We do not extend `Map<K, V>`, instead we add methods
as needed.

A `Hash<?, ?>` structure is local to a node. Meaning that operations applied to the local copy must be submitted
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I did not like the naming on the RESP commands either. A Hash to me is not a Map structure but rather a numeric derived from some other fields.

[source,java]
.Hash.java
----
public interface Hash<K, V> { // <1>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this, it looks awfully close our MultiMap, just instead of a Collection it is a Map. Could we not build on top of that?

----
public interface HashCache<K> {

<HK, HV> HV put(K key, HK hashKey, HV hashValue);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably name this method getAndPut and have just a put method that has no returned.

[source, java]
.HashCache.java
----
public interface HashCache<K> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This entire interface is blocking, we probably need NonBlocking versions first.

Reading further it looks like we would have all 3 of the variants 👍


<HK, HV> HV put(K key, HK hashKey, HV hashValue);

<HK, HV> Hash<HK, HV> put(K key, Map<HK, HV> values);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, we really need just a put and getAndPut.

* Enable/Disable rebalancing
* Availability get/set
* Replace not supported
* Listeners support to enable interactions with near-caching
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Listeners are going to be very very very expensive for this without some sort of delta notification, which we do not have for listeners. Also listener API was never solidified for the new API sadly :(

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll mark it optional. When we establish the listener API, we can come back to this.

HashCache<String> users = hashes.get("users");

users.put("42", Map.of("auth", "token", "role", "user"));
System.out.println(hashCache.get("42")); // {"auth": "token", "role": "user"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is hashCache here supposed to be users?

Also from this code, I assume that an updates to the returned HashCache instance are immediately updated remotely and applied to the Hashes actual contents? We will need to make sure the interface is documented that that is done that way.

Only the Hash interface is the one that isn't propagated correct?

Copy link
Member Author

@jabolina jabolina Apr 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is hashCache here supposed to be users?

Yes, I'll fix it.

Also from this code, I assume that an updates to the returned HashCache instance are immediately updated remotely and applied to the Hashes actual contents? We will need to make sure the interface is documented that that is done that way.

Only the Hash interface is the one that isn't propagated correct?

Correct. Only the operations directly over the HashCache takes place immediately. If operating over a Hash it needs to be sent to the HashCache to take effect.

I've been thinking more about this and I'll do an update to the HashCache API.

Hashes hashes = ispn.sync().hashes();
HashCache<String> users = hashes.get("users");

Hash<String, String> hash = users.get("42");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the Hash returned will always be not null I assume?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes me wonder if we should have some sort of putHashIfAbsent type of operation, where you can do an initial request to populate the Hash without having to do an initial get round trip.

as needed.

A `Hash<?, ?>` structure is local to a node. Meaning that operations applied to the local copy must be submitted
through the public API before taking effect. After that, the `Hash<?, ?>` is distributed/replicated according to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am debating about this. It seems we could also implement this in a slightly different fashion where the key is just a name of a cache and the underlying Hash is just a normal Cache which itself can be distributed etc. This would then give us delta changes between keys and we can use all of the existing features of Caches. I am not quite sold yet, especially as I have no idea what the query API would look like or listeners.

hash.put("auth", "token");
hash.put("role", "user");

users.submit(hash);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So will this completely overwrite the hash if it is already exists or only apply the new values? It isn't clear or doesn't seem defined yet as to what this would do. I am guessing it completely overwrites those values, in which case you would lose any concurrent writes without a respective remove.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this one, I was thinking to implement with compute and merge the values. We keep track of the operations with the Hash, but effectively all the local changes take precedence. I'll do an update in the docs and API.

@jabolina
Copy link
Member Author

@rigazilla and @wburns, thanks for the comments! I am going to update the doc and the HashCache API introducing your suggestions. I'll try to document better the intents with the API, evidencing the Hashneeds to be sent to take effect. And rename that submit method to something meaningful.

----
public interface Hash<K, V> { // <1>

String getName();
Copy link
Member

@pruivo pruivo Apr 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the name? don't see any reference to it.

through the public API before taking effect. After that, the `Hash<?, ?>` is distributed/replicated according to
the configuration.

We follow a state-based CRDT approach, merging the local Hash with the server's Hash. Since the `Hash<?, ?>` structures
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how are we going to merge when network partitions happen? Are we going to support cross-site?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants