Wishing list: Pulsar SQL support user defined indexes #18763

KannarFr · 2020-05-09T16:41:35Z

KannarFr
May 9, 2020

Is your feature request related to a problem? Please describe.
Currently, there is no index used to query topic using presto. __publish_time__ can be considered as index because of ledger storage way but it's not a real one.

Describe the solution you'd like
AvroSchema used to insert to topic should comes with a indexes definition. Since then, we should be able to have managedledger for indexes referencing classical managedledgers or messageid? And then configure pulsar presto impl to use user defined indexes from schema. (This is a suggestion to initialize the discussion, as @jerrypeng and I discussed it's a large discussion to have).

Describe alternatives you've considered
There are probably multiples ways to do it, feel free to suggest your pov.

Additional context
Reduce the query runtime.

KannarFr · 2020-05-11T13:03:08Z

KannarFr
May 11, 2020
Author

And how can we handle it on offloaded parts?

0 replies

sijie · 2020-05-14T16:14:29Z

sijie
May 14, 2020
Collaborator

@KannarFr

The indexes can be built in a background process using the approach that was used for compaction. The "compacted" ledger is essentially an "index" to the original data.

The index maintains some forms of mapping between "keys" to the "offsets" to the original data. The "offset" is essentially the message-id which is referencing a ledger and an entry id. It doesn't matter if a ledger is in the bookkeeper or already offloaded to the tiered storage.

0 replies

KannarFr · 2020-05-14T16:26:16Z

KannarFr
May 14, 2020
Author

@sijie
Ok, about the indexes definition, what do you think about the definition approach using avroschema to define indexes?

0 replies

sijie · 2020-05-14T17:07:41Z

sijie
May 14, 2020
Collaborator

Are you talking about adding the index definition into schema definition? Or using Avro schema specification for describing the indexes?

0 replies

KannarFr · 2020-05-14T17:52:49Z

KannarFr
May 14, 2020
Author

Adding the index definition into schema definition but maybe it is not the best to do. I'm asking your opinion.

0 replies

sijie · 2020-05-15T03:33:43Z

sijie
May 15, 2020
Collaborator

I don't think it is a good idea to add an index definition to the schema definition. The schema definition defines the structure of the original data. The index definition depends on the schema definition but it is different from the original data. So the index definition should be associated with the storage that is used for storing the index data. For example, if we are using another managed ledger for storing the index, then the index definition should be the schema definition of the managed ledger. Does that make sense?

0 replies

pointearth · 2021-01-25T05:33:52Z

pointearth
Jan 25, 2021

I think user-defined indexes are very important for Pulsar SQL, it could be able to be the real way that we can use pulsar as a database. I think it will make pulsar more popular.
And I agree to define user-defined indexes individually. we can extend "pulsar-admin topic" to manage indexes, to create, read, update, delete, reIndex them.
Can we discuss more and push it forward?

0 replies

KannarFr · 2021-03-16T11:42:05Z

KannarFr
Mar 16, 2021
Author

@pointearth @sijie How do you imagine the index composition regarding ledgers? As the first implementation, regarding pub/sub system, timestamp index-based would be a good start. I think we should first be able to auto-create an index per topic like:

Map[Date, LedgerId]

Or maybe

Map[Date, IndexItem]

IndexItem(PreviousLedgerId, LedgerId, NextLedgerId)

WDYT? Maybe we should directly point to message and not ledger.

0 replies

KannarFr · 2021-03-17T12:12:12Z

KannarFr
Mar 17, 2021
Author

And where/how we store this index?

0 replies

pointearth · 2021-03-18T09:17:02Z

pointearth
Mar 18, 2021

I know querying with timestamp is very fast, because data in bookKeeper save timestamp as key.
My suggestion is to create a non-clustered index, based on the key in bookKeeper, for example:
If we want to create an index on the field name, We can create indexItem as
map[name, List<publish_time>]

Can you supply some describe the source code around this? and then we can discuss it again.

0 replies

pointearth · 2021-03-18T09:21:35Z

pointearth
Mar 18, 2021

Can we store it in bookKeeper? and it may be able to start when the presto enhancement switching is open.

0 replies

golden-yang · 2021-12-18T05:42:34Z

golden-yang
Dec 18, 2021

Is there any progress on this issue?
Being able to support indexes in Pulsar Sql will be a very meaningful feature.

One way is to support it natively, and the other way I think it can be achieved through tiered storage. For example, combined with the data lake, with the help of Apache Hudi and so on.

I saw some articles about the combination of hudi and pulsar, is there any progress?
@sijie

0 replies

tisonkun · 2022-12-06T10:49:26Z

tisonkun
Dec 6, 2022
Collaborator

Moved to general discussions since there's a wishing list item and not actionable.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wishing list: Pulsar SQL support user defined indexes #18763

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 13 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Wishing list: Pulsar SQL support user defined indexes #18763

KannarFr May 9, 2020

Replies: 13 comments

KannarFr May 11, 2020 Author

sijie May 14, 2020 Collaborator

KannarFr May 14, 2020 Author

sijie May 14, 2020 Collaborator

KannarFr May 14, 2020 Author

sijie May 15, 2020 Collaborator

pointearth Jan 25, 2021

KannarFr Mar 16, 2021 Author

KannarFr Mar 17, 2021 Author

pointearth Mar 18, 2021

pointearth Mar 18, 2021

golden-yang Dec 18, 2021

tisonkun Dec 6, 2022 Collaborator

KannarFr
May 9, 2020

KannarFr
May 11, 2020
Author

sijie
May 14, 2020
Collaborator

KannarFr
May 14, 2020
Author

sijie
May 14, 2020
Collaborator

KannarFr
May 14, 2020
Author

sijie
May 15, 2020
Collaborator

pointearth
Jan 25, 2021

KannarFr
Mar 16, 2021
Author

KannarFr
Mar 17, 2021
Author

pointearth
Mar 18, 2021

pointearth
Mar 18, 2021

golden-yang
Dec 18, 2021

tisonkun
Dec 6, 2022
Collaborator