v0.8.13 manifest bug fix and vector indexing perf improvements
Critical fix: tables written prior to v0.8.0 may have corrupted stats
If a table was written with a Lance version prior to v0.8.0, and then later written by a version >=0.8.0<=0.8.13, it may have incorrect statistics. You can detect whether this affects your table using the LanceDataset.validate()
method. If this affects your table, Lance versions prior to 0.8.13 may not be able to read the table correct. If you do not plan on using older versions of Lance going forward, no action is needed. To fix reads on older Lance versions, commit any write transaction to the table with Lance v0.8.13 or newer. A simple way to make a transaction without changing the data would be:
import lance
dataset = lance.dataset('...')
operation = lance.LanceOperation.Append([])
dataset = lance.LanceDataset.commit(
dataset.uri,
operation,
read_version=dataset.version,
)
(This makes an empty Append commit)
New features
- feat: generic cosine code by @eddyxu in #1537
- feat: make L2 generic to work with all float numbers by @eddyxu in #1532
- feat: safer API for physical_rows by @wjones127 in #1529
- feat: implement datafusion tableprovider trait for
Dataset
by @universalmind303 in #1526 - feat: expose
Dataset.validate()
in Python by @wjones127 in #1538
Bug fixes
- fix: add versioning and bypass broken row counts by @wjones127 in #1534
- fix: fix assertion of cosine values by @eddyxu in #1530
- fix: pq index does not handle dot product metric correctly during search by @rok in #1536
Performance improvements
Other changes
- chore: move scalar_index benchmark to break circular dependency by @westonpace in #1540
New Contributors
- @universalmind303 made their first contribution in #1526
Full Changelog: v0.8.12...v0.8.13