-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend Arrow support to cover nullable data. #4049
Conversation
This pull request has been linked to Shortcut Story #27472: Extend arrow i/o operations to nullable columns. |
Looks like a few msvc errors, https://github.com/TileDB-Inc/TileDB/actions/runs/4758463351/jobs/8456556083?pr=4049#step:10:593
|
I also saw an Ubuntu build fail but figure that would be spurious. My changes so far are quite gingerly and limited. Let's see what happens once new tests get injected. |
e237b20
to
17c1492
Compare
@eddelbuettel if you see my last commit, we were basically never testing col sizes greater than zero.
or something similar. Do you have any idea? Spent some hours but couldn't find any solution. |
Howdy. Love that you are trying to resurrect the PR. Theses days it would probably be better to wrap some of the code in the nanoarrow (header-only) helper functions we already use in tiledb-r and tiledbsoma (mostly the R parts). See https://github.com/apache/arrow-nanoarrow and check out the recently expanded python side (to manipulate objects; nanoarrow is really three packages namely the header-only C/C++ helper, a somewhat mature R package to work on object s and a growing Python package). I think I wrote this PR after I had similar work in the R package where I since have changed things. (Just took a look.) Hm, that is on the other hand quite motivated by SOMA's libtiledbsoma and its ColumnBuffers etc. My pedestrian approach has usually been to setup a similar mock function standalone, or as an R extension, and ten abstract out the pure C++ side for Core. Back to your PR: I honestly do not know what the col size is mad about. Is there something driving this now and/or urgency? |
fb02a98
to
c9d0878
Compare
c9d0878
to
8597b33
Compare
Thanks for giving this PR some attention and care! I tried to take a good look today 'from first principles' to see if we could use the updated to code read columns containing nulls, and whether we can write. The overall summary is 'yes' and 'no'. Reading works, writing has an issue which I identified but have no fix for yet. (This is likely an old issue and not caused by this update.) The reading example worked. I created a (pretty simple) standalone C++ program which only relies on Core as in this PR, plus nanoarrow which we already vendor elsewhere. Because the interface we have is still on a column-by-column basis I opted to point a column of the standard 'penguins' data set which has two known NAs among the 344 int values. Relying on $ ./arrow_ex2
{"name": "body_mass_g", "count": 344, "VALIDITY": [1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], "DATA": [3750, 3800, 3250, 0, 3450, 3650, 3625, 4675, 3475, 4250, 3300, 3700, 3200, 3800, 4400, 3700, 3450, 4500, 3325, 4200, 3400, 3600, 3800, 3950, 3800, 3800, 3550, 3200, 3150, 3950, 3250, 3900, 3300, 3900, 3325, 4150, 3950, 3550, 3300, 4650, 3150, 3900, 3100, 4400, 3000, 4600, 3425, 2975, 3450, 4150, 3500, 4300, 3450, 4050, 2900, 3700, 3550, 3800, 2850, 3750, 3150, 4400, 3600, 4050, 2850, 3950, 3350, 4100, 3050, 4450, 3600, 3900, 3550, 4150, 3700, 4250, 3700, 3900, 3550, 4000, 3200, 4700, 3800, 4200, 3350, 3550, 3800, 3500, 3950, 3600, 3550, 4300, 3400, 4450, 3300, 4300, 3700, 4350, 2900, 4100, 3725, 4725, 3075, 4250, 2925, 3550, 3750, 3900, 3175, 4775, 3825, 4600, 3200, 4275, 3900, 4075, 2900, 3775, 3350, 3325, 3150, 3500, 3450, 3875, 3050, 4000, 3275, 4300, 3050, 4000, 3325, 3500, 3500, 4475, 3425, 3900, 3175, 3975, 3400, 4250, 3400, 3475, 3050, 3725, 3000, 3650, 4250, 3475, 3450, 3750, 3700, 4000, 4500, 5700, 4450, 5700, 5400, 4550, 4800, 5200, 4400, 5150, 4650, 5550, 4650, 5850, 4200, 5850, 4150, 6300, 4800, 5350, 5700, 5000, 4400, 5050, 5000, 5100, 4100, 5650, 4600, 5550, 5250, 4700, 5050, 6050, 5150, 5400, 4950, 5250, 4350, 5350, 3950, 5700, 4300, 4750, 5550, 4900, 4200, 5400, 5100, 5300, 4850, 5300, 4400, 5000, 4900, 5050, 4300, 5000, 4450, 5550, 4200, 5300, 4400, 5650, 4700, 5700, 4650, 5800, 4700, 5550, 4750, 5000, 5100, 5200, 4700, 5800, 4600, 6000, 4750, 5950, 4625, 5450, 4725, 5350, 4750, 5600, 4600, 5300, 4875, 5550, 4950, 5400, 4750, 5650, 4850, 5200, 4925, 4875, 4625, 5250, 4850, 5600, 4975, 5500, 4725, 5500, 4700, 5500, 4575, 5500, 5000, 5950, 4650, 5500, 4375, 5850, 4875, 6000, 4925, 0, 4850, 5750, 5200, 5400, 3500, 3900, 3650, 3525, 3725, 3950, 3250, 3750, 4150, 3700, 3800, 3775, 3700, 4050, 3575, 4050, 3300, 3700, 3450, 4400, 3600, 3400, 2900, 3800, 3300, 4150, 3400, 3800, 3700, 4550, 3200, 4300, 3350, 4100, 3600, 3900, 3850, 4800, 2700, 4500, 3950, 3650, 3550, 3500, 3675, 4450, 3400, 4300, 3250, 3675, 3325, 3950, 3600, 4050, 3350, 3450, 3250, 4050, 3800, 3525, 3950, 3650, 3650, 4000, 3400, 3775, 4100, 3775]}
$ I am also quite please with how ArrowAdapter aa(&ctx, &query);
nanoarrow::UniqueArray arr;
nanoarrow::UniqueSchema sch;
struct ArrowError ec;
if (NANOARROW_OK != ArrowSchemaInitFromType(sch.get(), NANOARROW_TYPE_INT32)) error_exit("Bad init");
if (NANOARROW_OK != ArrowArrayInitFromSchema(arr.get(), sch.get(), &ec)) error_exit(ec.message);
aa.export_buffer("body_mass_g", arr.get(), sch.get()); which has our However, on the write side no such luck. Our issue is that the importer looks at a TypeInfo struct inferred from the schema. It has a boolean field nullable, but it is not correctly set to false for dimensions to $ ./arrow_ex3
$ r -pe 'tiledb::tiledb_array("/tmp/tiledb/arrow_ex_write/", return_as="data.table")[]'
row data
<int> <int>
1: 1 10
2: 21 20
3: 42 NA
4: 67 40
$ As this is not really an issue with the current PR, I think we could also pass it and merge, and address this in a follow-up. |
Found it. A column schema gets set nullable set by default in dsch.get()->flags &= ~ARROW_FLAG_NULLABLE; As that is 'client-side' this PR should be good to go. But as I started this PR I don't seem to be able to review and approve the subsequent changes. PS Issue filed here apache/arrow-nanoarrow#604 |
Can we rebase the PR branch to dev (or merge dev in -- I tend to prefer rebase) ? |
This allows to e.g. include the nanoarrow headers first.
I added another two-line commit that protects the Arrow C Data Interface definition we have from being redefined (which throws a compiler error) which can happen when, e.g., the nanoarrow headers were already included. |
Interesting: Approved, all tests passsed but merging blocked. Needs one more merge from dev, I presume? |
Co-authored-by: Isaiah Norton <[email protected]>
The
arrowio
header provides import export support from/to TileDB and Arrow with its interface of twovoid*
pointers. This PR extends the support to cover 'nullable' aka 'validity map' data.The PR is in need of some tests but the existing tests is a little involved between Python, pybind11 and C++ so @ihnorton has kindly 'volunteered' to add this.
The PR will remain a draft til we have tests.
[sc-27472]
TYPE: FEATURE
DESC: Extend Arrow support to cover nullable data.