Skip to content

v0.9 release: stats-based predicate pushdown, scalar index, performance improvement, and bug fixes

Compare
Choose a tag to compare
@eddyxu eddyxu released this 17 Dec 23:13
· 997 commits to main since this release

Summary

  • Stats-based predicate pushdown
  • Scalar index
  • Tensorflow and PyTorch data loader
  • Pre / post filter combined with vector search
  • Performance improvement across the stack

Breaking changes:

  • Change IVF_PQ algorithm for cosine distance. Requires rebuilding index with cosine distance.
  • Bump pyarrow version to 12.0+

What's Changed

  • feat: add a cache for dynamodb schema validation by @chebbyChefNEQ in #1308
  • chore: speed up kmean training for cosine by @eddyxu in #1334
  • feat: add take() method to RecordBatchExt by @eddyxu in #1337
  • feat(python): expose row id in python API by @eddyxu in #1339
  • feat: data generation of dbpedia dataset by @eddyxu in #1340
  • feat: build ivf partition using disk based shuffler by @eddyxu in #1312
  • feat: friendlier error messages in nearest api by @rok in #1336
  • chore: index / recall benchmark over dbpedia by @eddyxu in #1348
  • feat: support storing page-level stats by @wjones127 in #1316
  • fix: pq cosine fast lookup table by @eddyxu in #1354
  • chore: compute distance using pytorch and GPU/MPS by @eddyxu in #1351
  • feat: train kmean using pytorch by @eddyxu in #1358
  • build: use larger runner for doc build by @eddyxu in #1364
  • feat(python): gpu based ivf partition training by @eddyxu in #1361
  • fix: stop reading latest manifest by @wjones127 in #1365
  • chore: improve kmean training performance on CUDA by @eddyxu in #1368
  • chore: improve kmean performance on MPS by @eddyxu in #1370
  • chore: use torch.index_add to compute new centroids, to improve training performance on MPS by @eddyxu in #1371
  • feat: schema::field_by_id() by @eddyxu in #1375
  • feat(python): design an image extension type by @rok in #1272
  • perf: improve KNN and ANN performance by @wjones127 in #1367
  • feat: collect int/float/boolean/date page-level statistics on write by @rok in #1346
  • chore: part 1/N of refactoring vector index into separate crate by @eddyxu in #1388
  • fix: handle larger arrays in take by @wjones127 in #1383
  • chore: cleanup tests to avoid errors when optional components are not present by @westonpace in #1374
  • chore: object write trait by @eddyxu in #1389
  • refactor: fast path to find fragments for flat scan by @eddyxu in #1394
  • refactor: move reader trait to lance-core by @eddyxu in #1393
  • refactor: move pq to lance-index by @eddyxu in #1400
  • refactor: make pq a batch transformer by @eddyxu in #1401
  • feat: run pq portion of ivf_pq in parallel by @westonpace in #1386
  • feat: generic shuffler over RecordBatchStream by @eddyxu in #1402
  • feat: remap indices on compaction by @westonpace in #1403
  • chore: remove unused crate by @eddyxu in #1405
  • docs: make overwrite row green by @wjones127 in #1409
  • feat: add removed_indices to CreateIndex transaction operation by @eddyxu in #1408
  • feat(rust): incremental index update by @eddyxu in #1406
  • feat(python): expose index optimization via python by @eddyxu in #1412
  • test: add test case to ensure optimize returns to flat KNN by @westonpace in #1416
  • fix: fix bug in index remapping when plan contained multiple rewrite groups by @westonpace in #1415
  • chore: upgrade to datafusion 32 by @wjones127 in #1391
  • ci: cross compile arm wheels by @wjones127 in #1407
  • test: add new ann scenarios to the python benchmarks by @westonpace in #1411
  • chore: instrument various steps in the ann search by @westonpace in #1404
  • refactor: refactor flat search to lance-index by @eddyxu in #1419
  • refactor: move encodings to lance-core by @eddyxu in #1425
  • feat: expose latest version id api by @chebbyChefNEQ in #1426
  • refactor: use function pointers instead of trait objects by @wjones127 in #1424
  • feat: automatically convert image to tensors in TF data pipeline by @rok in #1420
  • refactor: migrate schema and data types to lance-core by @eddyxu in #1429
  • perf: better parallelism in delete vector prefiltering by @westonpace in #1428
  • test: fix flaky tests involving tokio::fs::File by @westonpace in #1430
  • perf: use selection vector strategy to improve exact knn performance with deletions by @wjones127 in #1418
  • chore: use arrow 47 function by @eddyxu in #1439
  • refactor: move format definitions to lance-core by @eddyxu in #1440
  • refactor: migrate object reader and object writer by @eddyxu in #1442
  • fix: fix an issue where the GPU index trainer was taking too much data into memory by @westonpace in #1447
  • feat: store a separate tensor blob for IVF centroids by @eddyxu in #1446
  • refactor: move all Python operations to the same runtime by @wjones127 in #1445
  • chore: bump prost version to latest by @eddyxu in #1449
  • chore: update half to 2.3.1 by @jacobBaumbach in #1450
  • feat: allow prefiltering to be used with an index by @westonpace in #1435
  • feat: benchmark and improve L2 partition compute by @eddyxu in #1453
  • chore: increase ivf assignment parallism during indexing by @eddyxu in #1451
  • feat: support keyboard interrupt in Python by @wjones127 in #1438
  • feat: add parameter to split by file size by @wjones127 in #1444
  • ci: fix ARM build due to Ring dependency by @wjones127 in #1462
  • refactor: move read and write manifest file to lance-core by @eddyxu in #1467
  • feat: create_index take torch.device object by @eddyxu in #1465
  • feat: added dataset stats api by @albertlockett in #1452
  • refactor: move commit traits to lance-core by @eddyxu in #1469
  • chore: use ruff format to replace isort and black by @eddyxu in #1472
  • refactor: move ObjectStore, FileReader and FileWriter to lance-core by @eddyxu in #1473
  • perf: support multi-threading shuffler by @eddyxu in #1474
  • feat: poor man's SIMD lib by @eddyxu in #1478
  • perf: use simd lib to implement dot by @eddyxu in #1480
  • feat: expose progress on write_fragments and write_dataset by @wjones127 in #1464
  • feat: split out datagen utilities, expand them, expose to python by @westonpace in #1315
  • chore: remove outdated warnings about prefiltering with a vector index by @westonpace in #1484
  • fix: fix L2 computation on GPU by @eddyxu in #1485
  • perf: improve kmeans and make pq training multi-threaded by @eddyxu in #1479
  • chore: mention GPU support in README by @eddyxu in #1489
  • fix: fix PQ training metric type is not appropriately propogated by @eddyxu in #1493
  • docs: clarify behaviour of refine_factor by @albertlockett in #1496
  • ci: cancel in progress runs on new push by @albertlockett in #1497
  • chore: remove unused value settings by @eddyxu in #1494
  • feat: provide a f32x16 abstraction to make unrolling 256-bit code easier by @eddyxu in #1495
  • fix: remove channel closed messages by @wjones127 in #1502
  • perf: dimension-based kernel for L2 and Cosine by @eddyxu in #1503
  • feat: add location for all error by @Weijun-H in #1475
  • feat: add sorting to the scanner by @westonpace in #1498
  • feat: add tf.data APIs for reading batches by @wjones127 in #1488
  • feat: experimental avx512 features by @eddyxu in #1506
  • feat: add read ahead for take scan by @wjones127 in #1501
  • feat: use caller location in error conversion functions by @chebbyChefNEQ in #1510
  • chore(rust): reduce debug message log level by @changhiskhan in #1512
  • feat: collect page-level statistics on write by @rok in #1335
  • feat(rust): simd ops of reduce min, min, find and gather by @eddyxu in #1514
  • feat: add btree scalar index by @westonpace in #1476
  • feat: support true in deletion logic by @Weijun-H in #1515
  • fix: make sure we have physical rows by @wjones127 in #1511
  • chore: benchmark of large IVF parrtitions by @eddyxu in #1524
  • feat: make dot generic to support bf16/f16/f32 with one dot_distance interface. by @eddyxu in #1522
  • chore: add same target-features to python pyo3 build by @eddyxu in #1527
  • feat: expose index cache configure via open dataset API by @eddyxu in #1523
  • fix: fix assertion of cosine values by @eddyxu in #1530
  • feat: generic cosine code by @eddyxu in #1537
  • perf: improve f16 performance for norm L2 on aarch64 by @eddyxu in #1539
  • feat: make L2 generic to work with all float numbers by @eddyxu in #1532
  • fix: pq index does not handle dot product metric correctly during search by @rok in #1536
  • chore: move scalar_index benchmark to break circular dependency by @westonpace in #1540
  • feat: safer API for physical_rows by @wjones127 in #1529
  • feat: implement datafusion tableprovider trait for Dataset by @universalmind303 in #1526
  • feat: expose Dataset.validate() in Python by @wjones127 in #1538
  • fix: add versioning and bypass broken row counts by @wjones127 in #1534
  • feat: generic kmeans that supports bf16 and f16 by @eddyxu in #1544
  • chore: disable avx512 for now by @eddyxu in #1546
  • chore: fix type inference errors in benchmarks by @westonpace in #1556
  • chore: provide a trait to dynamically dispatch different pq based on different vector data type by @eddyxu in #1555
  • chore: update the CI build to check/build all crates in the workspace and not just the lance crate by @westonpace in #1557
  • feat: make it possible to create and load scalar indices for a dataset by @westonpace in #1516
  • feat: generic Product Quantizatoin by @eddyxu in #1560
  • test: add property-based testing for statistics by @wjones127 in #1554
  • feat: ffi to accelerate norm_l2 for f16 if the instruction set is available by @eddyxu in #1562
  • feat: extend FSL with sample by @eddyxu in #1572
  • feat: allow for more advanced storage options in objectstore by @universalmind303 in #1547
  • feat: implement as_slice for bfloat16 array by @eddyxu in #1574
  • perf: add bf16 benchmarks by @eddyxu in #1575
  • feat: f16 for L2 by @eddyxu in #1577
  • chore: update cc dependency to 1.0.83 by @westonpace in #1578
  • feat: make IVF model support f16 and bf16 by @eddyxu in #1573
  • feat: allow the scanner to take advantage of scalar indices by @westonpace in #1543
  • chore: dotprod should be on mac target, not haswell and better randomness for bf16 by @westonpace in #1579
  • feat: make Dataset::nearest() accepts arbitrary query type by @eddyxu in #1582
  • chore(rust): remove extraneous dbg message by @changhiskhan in #1598
  • feat: torch cache-able dataset, with sampling support by @eddyxu in #1591
  • fix: tell writer correct schema when writing index file by @wjones127 in #1518
  • feat: add support for remapping scalar indices during compaction by @westonpace in #1571
  • refactor: switch to using DataFusions physical expr by @wjones127 in #1581
  • chore: various fixes for Python benchmarks by @wjones127 in #1513
  • feat: adaptive cuda allocation for l2/cosine distance computation by @eddyxu in #1601
  • fix: fix a memory leak where a dataset would not be fully deleted by @westonpace in #1606
  • fix: google objectstore uses proper gs configuration by @universalmind303 in #1608
  • perf: kmean fit uses cached torch dataset by @eddyxu in #1603
  • fix: add migration for bad fragment bitmaps by @westonpace in #1611
  • feat: allow scalar indices to be updated with new data by @westonpace in #1576
  • feat: add python bindings for creating scalar indices by @westonpace in #1592
  • fix: handle no max value for string by @wjones127 in #1600
  • feat: expose index cache size by @rok in #1587
  • feat: track index cache hit rate by @rok in #1586
  • feat: serialize arbitrary float type of PQ to protobuf by @eddyxu in #1624
  • ci: use M1 runner for now for release by @wjones127 in #1623
  • feat: coerce float array for nearest query by @eddyxu in #1618
  • chore: expose avx512fp16 feature via main lance crate by @eddyxu in #1626
  • feat: make partition calculation parallel by @chebbyChefNEQ in #1625
  • feat(rust): simplify object store option API by @wjones127 in #1627
  • fix: fix chunk size issue by @wjones127 in #1630
  • perf: more efficient treemap implementation for row ids by @wjones127 in #1632
  • feat(python): add index_cache_hit_rate to index_stats() by @rok in #1631
  • chore: make lance-linalg benchmark ready to test bf16 data by @eddyxu in #1634
  • perf: fast L2 distance table build by @eddyxu in #1639
  • fix: correctly avg centroids in update logic in GPU IVF training by @chebbyChefNEQ in #1646
  • perf: add a fast path for converting bytes into array when the bytes has the correct alignment by @chebbyChefNEQ in #1652
  • fix: prevent OOM when IVF centroids are provided by @wjones127 in #1653
  • test: fix for test by @wjones127 in #1644
  • perf: minor change to cleanup allowing for size to be collected in parallel by @westonpace in #1649
  • perf: add type coersion for in-list expressions by @westonpace in #1655
  • chore: minor changes to tracing instrumentation by @westonpace in #1619
  • fix: fix error message for invalid nprobes by @albertlockett in #1666
  • feat: add support for update queries by @wjones127 in #1585
  • fix: support no-op filters again by @wjones127 in #1669
  • fix: row_id range fix for index training on gpu by @jerryyifei in #1663
  • feat: better warnings when the PQ assignment over cosine distance is wrong by @eddyxu in #1672
  • fix: add retries for failed response stream by @wjones127 in #1671
  • chore: add utility to compute ground truth for benchmarks by @eddyxu in #1668
  • fix: dont use scalar indices unless we are prefiltering by @westonpace in #1678
  • fix: lance pytorch dataset parameter to load with row_id by @eddyxu in #1676
  • feat: a tensor dataset that shared with the same behavior as Lance torch Dataset by @eddyxu in #1679
  • chore: add new python benchmarks for testing scalar indices by @westonpace in #1658
  • feat: add option to pass in precomputed row_id -> ivf partiton mapping and compute partiiton on GPU by @chebbyChefNEQ in #1680
  • fix: make sure to prefilter the flat portion of a combined knn by @westonpace in #1583
  • perf: use datafusion to shuffle index partition data by @wjones127 in #1645
  • feat: add batch buffering and async loading to torch.LanceDataset by @chebbyChefNEQ in #1687
  • feat: optimized pushdown scanner by @wjones127 in #1328
  • fix: add shutdown to async loader by @chebbyChefNEQ in #1690
  • fix: use eplison to handle all zero cosine values by @eddyxu in #1696
  • fix: prevent stats meta from breaking old readers by @wjones127 in #1699
  • fix: add _rowid when use_stats=False by @wjones127 in #1700
  • perf: revert back to hashmap by @chebbyChefNEQ in #1692
  • fix: remove default memory cap for index training by @wjones127 in #1702
  • feat: do not use residual vector for cosine similarity by @eddyxu in #1708
  • feat: add support for new and deleted data to scalar indices by @westonpace in #1689
  • fix: update list_indices to report if an index is vector or scalar by @westonpace in #1710
  • perf: allow take to process multiple fragments in parallel by @westonpace in #1713
  • feat: turn on argument tracking in tracing by @wjones127 in #1706
  • perf: make sure we use multiple threads when scanning by @wjones127 in #1705
  • chore: kmeans fit takes pyarrow FixedSizeListArray by @eddyxu in #1714
  • revert: use eplison to handle all zero cosine values (#1696) by @eddyxu in #1715
  • chore: add ruff copyright check by @eddyxu in #1716
  • chore: compute pairwise cosine using pytorch by @eddyxu in #1717
  • chore: normalize vector kernel by @eddyxu in #1720
  • fix: fix l2 normalize by @eddyxu in #1722
  • perf: use an asynchronous open function even for local files by @westonpace in #1721
  • perf: small performance fixes for scan by @wjones127 in #1719
  • fix: cosine kmeans by @eddyxu in #1723
  • fix: cosine kmeans on GPU by @eddyxu in #1726
  • fix: pq code for cosine distance by @eddyxu in #1727
  • chore: adjust cosine value from l2 distance by @eddyxu in #1730
  • fix: various fixes to GPU kmeans by @chebbyChefNEQ in #1731
  • feat: handroll ivf partition shuffle by @chebbyChefNEQ in #1729

Full Changelog: v0.8.0...v0.9.0