How does Lance compare to Vortex? #3130
Replies: 2 comments 8 replies
-
I think there are a few points they differ today. Lance has both a table format and a file format while Vortex is focused on the file format at the moment so I'll focus the comparison between those two. I'm not really an expert in Vortex so this will be mostly about what we've been focused on. I think it is safe to say the Vortex team has put more effort into compressive encodings. This will probably remain true for a while. Compression hasn't been all that vital to Lance as most of our customers are doing vector search and 90% or more of their data is pre-compressed anyways (e.g. vector embeddings, images, etc.) That being said, we're making sure we have a good story for string compression in 2.1 as large-string datasets (e.g. web crawlers, NLP datasets, etc.) are a key use case for us. Performance-wise this will be most noticeable when doing OLAP style queries. Lately, the Lance file format has been more focused on structural encodings (list / struct), I/O scheduling, backpressure, combining columns, and large-ish objects (e.g. those over 1KiB). If your goal is to do OLAP with scalar data in memory or NVME then my guess is Vortex would give you better speed. If you've got deeply nested data, very large objects, etc. then Lance will give you a fast and robust solution (I can't say that Vortex won't because I really don't know).
Parquet has a few selling points that aren't going to fade anytime soon:
Ideally, you should make the storage format an abstraction that is completely hidden from your users and occasionally test which works best for your use cases and adapt as needed. As for today, I'd say Lance 2.0 is pretty solid and well tested for search solutions with good enough OLAP to beat out any row-based alternative. You'd probably still want parquet for pure-olap. Lance 2.X and Vortex will be better than Lance 2.0 and, at some point, also stable and robust. |
Beta Was this translation helpful? Give feedback.
-
Hi, @philippemnoel. This is Lei from LanceDB. Lance format today is a combination of data format (parquet/ORC/vortex), table format (schema, versioning, global unique id, etc), and secondary indices. Our primary optimization target is data serving (LanceDB) while maintaining a decent columnar scan speed for aggregation (at least not slower than Parquet). Because the Lance format is used in LanceDB, many IO optimizations are done to improve query plans for search and reduce tail latency. We've seen amazing performance in the field for high-traffic search systems, i.e., <10ms p50 for full-text, vector, and metadata searches. Because the table format, secondary index, and data format are designed end-to-end, we can move faster across the stack. However, as mentioned by Weston already, we don't mean to make this a pure OLAP engine / data warehouse. You can also watch our Ray summit talk https://www.youtube.com/watch?v=xmTFEzAh8ho&list=PLzTswPQNepXntmT8jr9WaNfqQ60QwW7-U&index=28 |
Beta Was this translation helpful? Give feedback.
-
Hey everyone! Phil here from @paradedb. We're pretty interested in a Parquet/Arrow successor and are considering both Lance and Vortex for the fast random access read. Could you please share how Lance compares to Vortex in your own words? When should one consider one vs the other?
Beta Was this translation helpful? Give feedback.
All reactions