Replies: 13 comments
-
And how can we handle it on offloaded parts? |
Beta Was this translation helpful? Give feedback.
-
The indexes can be built in a background process using the approach that was used for compaction. The "compacted" ledger is essentially an "index" to the original data. The index maintains some forms of mapping between "keys" to the "offsets" to the original data. The "offset" is essentially the message-id which is referencing a ledger and an entry id. It doesn't matter if a ledger is in the bookkeeper or already offloaded to the tiered storage. |
Beta Was this translation helpful? Give feedback.
-
@sijie |
Beta Was this translation helpful? Give feedback.
-
Are you talking about adding the index definition into schema definition? Or using Avro schema specification for describing the indexes? |
Beta Was this translation helpful? Give feedback.
-
Adding the index definition into schema definition but maybe it is not the best to do. I'm asking your opinion. |
Beta Was this translation helpful? Give feedback.
-
I don't think it is a good idea to add an index definition to the schema definition. The schema definition defines the structure of the original data. The index definition depends on the schema definition but it is different from the original data. So the index definition should be associated with the storage that is used for storing the index data. For example, if we are using another managed ledger for storing the index, then the index definition should be the schema definition of the managed ledger. Does that make sense? |
Beta Was this translation helpful? Give feedback.
-
I think user-defined indexes are very important for Pulsar SQL, it could be able to be the real way that we can use pulsar as a database. I think it will make pulsar more popular. |
Beta Was this translation helpful? Give feedback.
-
@pointearth @sijie How do you imagine the index composition regarding ledgers? As the first implementation, regarding pub/sub system, timestamp index-based would be a good start. I think we should first be able to auto-create an index per topic like: Map[Date, LedgerId] Or maybe Map[Date, IndexItem]
IndexItem(PreviousLedgerId, LedgerId, NextLedgerId) WDYT? Maybe we should directly point to message and not ledger. |
Beta Was this translation helpful? Give feedback.
-
And where/how we store this index? |
Beta Was this translation helpful? Give feedback.
-
I know querying with timestamp is very fast, because data in bookKeeper save timestamp as key. Can you supply some describe the source code around this? and then we can discuss it again. |
Beta Was this translation helpful? Give feedback.
-
Can we store it in bookKeeper? and it may be able to start when the presto enhancement switching is open. |
Beta Was this translation helpful? Give feedback.
-
Is there any progress on this issue? One way is to support it natively, and the other way I think it can be achieved through tiered storage. For example, combined with the data lake, with the help of Apache Hudi and so on. I saw some articles about the combination of hudi and pulsar, is there any progress? |
Beta Was this translation helpful? Give feedback.
-
Moved to general discussions since there's a wishing list item and not actionable. |
Beta Was this translation helpful? Give feedback.
-
Is your feature request related to a problem? Please describe.
Currently, there is no index used to query topic using presto.
__publish_time__
can be considered as index because of ledger storage way but it's not a real one.Describe the solution you'd like
AvroSchema used to insert to topic should comes with a indexes definition. Since then, we should be able to have managedledger for indexes referencing classical managedledgers or messageid? And then configure pulsar presto impl to use user defined indexes from schema. (This is a suggestion to initialize the discussion, as @jerrypeng and I discussed it's a large discussion to have).
Describe alternatives you've considered
There are probably multiples ways to do it, feel free to suggest your pov.
Additional context
Reduce the query runtime.
Beta Was this translation helpful? Give feedback.
All reactions