[optimize] `readRleBitPackedHybrid` cost too much time, and `typedarray` for column result? #7

simline · 2024-04-29T12:46:20Z

and if i can do something for this will be great

platypii · 2024-04-29T17:47:45Z

I have plans to make readRleBitPackedHybrid faster by reducing the number of memory allocations. Instead of readVarInt returning an array with value and byte offset, I want to change it to use DataReader like the rest of the read functions like in f826bff7

I like the idea of returning TypedArrays, but will need to think more about how to actually implement that. Suggestions and PRs welcome!

I've found the parsing to be quite fast in general. Can you share an example parquet file which is particularly slow?

simline · 2024-04-29T18:56:05Z

I've found the parsing to be quite fast in general. Can you share an example parquet file which is particularly slow?

A more than 20x faster example: TPCH lineitem table (sf1 almost 250MB). Without reading xxx levels in readDataPage, and return BigInt64Array, only cost 570ms :

columnMetadata.codec {
  type: 'INT64',
  encodings: [ 'PLAIN' ],
  path_in_schema: [ 'l_extendedprice' ],
  codec: 'SNAPPY',
  num_values: 122880n,
  total_uncompressed_size: 983071n,
  total_compressed_size: 634625n,
  key_value_metadata: undefined,
  data_page_offset: 1580630n,
  index_page_offset: undefined,
  dictionary_page_offset: undefined,
  statistics: {
    max: '^\x10�\x00\x00\x00\x00\x00',
    min: '�i\x01\x00\x00\x00\x00\x00',
    null_count: 0n,
    distinct_count: undefined
  },
  encoding_stats: undefined
} {
  num_values: 122880,
  encoding: 'PLAIN',
  definition_level_encoding: 'RLE',
  repetition_level_encoding: 'RLE',
  statistics: undefined
}

But the codec is codec: 'SNAPPY', so I removed its condition because the decompressed length is equal with column's data length (header 8 Bytes + int8[122880] )

And lineitem just a 6M rows data, what if there's a 60M rows data (2.5GB)，or query with a HIVE PARTITION, performance matters.

I can do more, but not good at parquet format yet, 20x faster just a try

simline · 2024-04-29T19:23:44Z

with more:
parallel support for node.js and browsers
saving memory usage for data analyze and performance: ignore dict map (seprate keys and values), DATE type parser (days diffs from 1970-01-01), etc
read huge file more than browser's limit

platypii · 2024-04-30T07:52:41Z

Just pushed commit 488bfa7f9 which increases performance by 11.7% when parsing tpch on node.js!

There is still much room for improvement. Nice job finding optimizations!

simline · 2024-04-30T13:55:12Z

Yeah, I see the commit , great job. And here is some of small test might help

function concatArrayBuffers(buffer1, buffer2) {
    const newBuffer = new ArrayBuffer(buffer1.byteLength + buffer2.byteLength)
    const view = new Uint8Array(newBuffer)
    view.set(new Uint8Array(buffer1), 0)
    view.set(new Uint8Array(buffer2), buffer1.byteLength)
    console.log(buffer1.byteLength + buffer2.byteLength)
    return newBuffer
}
const buffer1 = new ArrayBuffer(40000000)
const buffer2 = new ArrayBuffer(40000000)
console.time('concatArrayBuffers')
var r = new Int32Array(concatArrayBuffers(buffer1, buffer2))
console.timeEnd('concatArrayBuffers')


function concatTypedArrays(array1, array2) {
    const newTypedArray = new array1.constructor(array1.length + array2.length)
    newTypedArray.set(array1, 0)
    newTypedArray.set(array2, array1.length)
    return newTypedArray
}
const typedArray1 = new Int32Array(10000000)
const typedArray2 = new Int32Array(10000000)
console.time('concatTypedArrays')
var r = concatTypedArrays(typedArray1, typedArray2)
console.timeEnd('concatTypedArrays')
r


var src1 = Array(10000000).fill(0)
var src2 = Array(10000000).fill(0)
console.time('array concat')
var dst = src1.concat(src2)
console.timeEnd('array concat')



// dst.push(...src) might throw an error
var dst = []
for (let i=0; i++!=10000000;) {
	dst.push(i)
}
console.time('array push')
for (let i=0; i++!=10000000;) {
	dst.push(i)
}
console.timeEnd('array push')

Nodejs v22 on my machine:

`concatArrayBuffers`     50ms
`concatenatedTypedArray` 49ms
`array concat`           77ms
`array push`             98ms

if create a function TypedArray.prototype.concat(), will suitable for any ArrayLike variable.

ArrayBuffer cast to TypedArray almost zero cost, and SharedArrayBuffer suitable for parallel web workers.

When should use TypedArray:

numbers without RLE, turn to TypedArray, use set() to concat
Enum dict code, turn to DataView and pick item by index
... get item through array position

NOT:

RLE / Dict, user loop data by RowGroup, might need the original data
Nested Object Column
Text, might init with an global inverted index
JSON, might has any injected parser
... optmize memory usage and performance

simline · 2024-04-30T14:06:51Z

v0.7.8, another performance issue is the readDataPage function at datapage, this code block cost a lot time more than imagine.

But I didn't test anything for datapageV2 and your new commits yet.

simline · 2024-04-30T19:51:44Z

I'll go to sleep, here is some answer for readRleBitPackedHybrid, use single array instead:

export function readData(reader, encoding, count, bitWidth) {
  /** @type {any[]} */
  const values = new Array()
  if (encoding === 'RLE') {
    let seen = 0
    while (seen < count) {
      seen = readRleBitPackedHybrid(reader, bitWidth, 0, count, values, seen)
    }
  } else {
    throw new Error(`parquet encoding not supported ${encoding}`)
  }
  return values
}

export function readRleBitPackedHybrid(reader, width, length, numValues, values, seen=0) {
  if(!values)throw error('')
  if (!length) {
    length = reader.view.getInt32(reader.offset, true)
    reader.offset += 4
    if (length < 0) throw new Error(`parquet invalid rle/bitpack length ${length}`)
  }

  const startOffset = reader.offset
  while (reader.offset - startOffset < length && seen < numValues) {
    const [header, newOffset] = readVarInt(reader.view, reader.offset)
    reader.offset = newOffset

    if ((header & 1) === 0) {
      let [val, count] = readRle(reader, header, width)
      for (let i=0; i++ < count;) { values[seen++] = val}

    } else {
      // bit-packed
      let res = readBitPacked(reader, header, width, numValues - seen)
      for (let i=0; i++ < res.length;) { values[seen++] = res[i]}
    }
  }

  return seen
}

function readRle(reader, header, bitWidth) {
  const count = header >>> 1
  const width = (bitWidth + 7) >> 3

  let value
       if (width === 1) {value = reader.view.getUint8(reader.offset) }
  else if (width === 2) {value = reader.view.getUint16(reader.offset, true) }
  else if (width === 4) {value = reader.view.getUint32(reader.offset, true) }
  else { throw new Error(`parquet invalid rle width ${width}`)}
  reader.offset += width

  return [value, count]
}

platypii · 2024-05-01T08:05:08Z

I just pushed some more optimizations in v0.7.10. It is about 37% faster than v0.7.8 on the TPCH dataset!

I tried switching to TypedArrays but things got slower for me.

Thanks for the investigations @simline! Let me know if you have more suggestions!

simline · 2024-05-01T10:41:12Z

Yes I have, with TPCH lineitem, it works fine on my machine, for INT64(decimal) types (l_quantity, l_extendedprice, l_discount, l_tax ...), it's 5x faster.

/**
 * Read `count` int32 values.
 *
 * @param {DataReader} reader - buffer to read data from
 * @param {number} count - number of values to read
 * @returns {Int32Array} array of int32 values
 */
function readPlainInt32(reader, count) {
  reader.offset += count * 4
  return new Int32Array(reader.view.buffer, 8)
}

/**
 * Read `count` int64 values.
 *
 * @param {DataReader} reader - buffer to read data from
 * @param {number} count - number of values to read
 * @returns {BigInt64Array} array of int64 values
 */
function readPlainInt64(reader, count) {
  reader.offset += count * 8
  return new BigInt64Array(reader.view.buffer, 8)
}

simline · 2024-05-01T11:16:34Z

For data analistics, another thing is DICTIONARY_PAGE at readColumn(), if we could read pages at different time, will save more time:

User filter column data with expr (colxxx in (valA, valB, valC)), if val[ABCDEFG] is saved in DICTIONARY_PAGE and not match the expr, will not need read DATA_PAGE

inverted index is also helpful, but step by step

platypii · 2024-05-01T17:08:34Z

it's 5x faster

I made those changes to readPlainInt32 and readPlainInt64 and speed was the same or slightly slower for me. Maybe I need to also change to .set? I don't see how you get a 5x speedup?

I tried on node, chrome, and firefox and saw the same slowdown with TypedArrays. What environment are you testing in?

simline · 2024-05-01T18:09:28Z

@platypii sorry, I get the result from readRowGroup() rather than parquetRead(...), concat(a, idx, b) always cost too many time so didn't do any test, too many steps may smooth out actual performance differences.

If must return the whole result as once, example above like newTypedArray.set(array2, array1.length) might helpful.

platypii · 2024-05-02T07:23:56Z

Got TypedArrays working!

Published v0.7.11 with a tpch parse time of 8,951 ms, compared to 19,358 ms for version v0.7.8. A speedup of 53.7%!

simline · 2024-05-02T11:42:09Z

fix for my test:

function concatTypedArrays(array1, array2) {
    const newTypedArray = new array1.constructor(array1.length + array2.length)
    newTypedArray.set(array1, 0)
    newTypedArray.set(array2, array1.length)
    return newTypedArray
}

for (let i = 0; i < 49; i++) {
    var newA = new BigInt64Array(122880).fill(BigInt(i+1))

    console.time('concatTypedArrays')
    dst = concatTypedArrays(dst, typedArray2)
    console.timeEnd('concatTypedArrays')
}

Time increased every times in loops, from 0.7ms to 19.6ms, cause newTypedArray is an new object, but what if build a whole size at first and seems already did in project:

var dst = new BigInt64Array(122880 * 49)

for (let i = 0; i < 49; i++) {
    var newA = new BigInt64Array(122880).fill(BigInt(i+1))

    console.time('concatTypedArrays')
    dst.set(newA, 122880*(i))
    console.timeEnd('concatTypedArrays')
}

Cost <1ms every time!

simline · 2024-05-02T11:51:58Z

readDefinitionLevels bitWidth condition

When reader.view.byteLength-8 equals to daph.num_values * valueBitSize or something like this (The reality may be more complicated), will not necessary.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[optimize] `readRleBitPackedHybrid` cost too much time, and `typedarray` for column result? #7

[optimize] `readRleBitPackedHybrid` cost too much time, and `typedarray` for column result? #7

simline commented Apr 29, 2024

platypii commented Apr 29, 2024

simline commented Apr 29, 2024

simline commented Apr 29, 2024

platypii commented Apr 30, 2024

simline commented Apr 30, 2024

simline commented Apr 30, 2024

simline commented Apr 30, 2024

platypii commented May 1, 2024

simline commented May 1, 2024

simline commented May 1, 2024

platypii commented May 1, 2024

simline commented May 1, 2024

platypii commented May 2, 2024

simline commented May 2, 2024

simline commented May 2, 2024

[optimize] readRleBitPackedHybrid cost too much time, and typedarray for column result? #7

[optimize] readRleBitPackedHybrid cost too much time, and typedarray for column result? #7

Comments

simline commented Apr 29, 2024

platypii commented Apr 29, 2024

simline commented Apr 29, 2024

simline commented Apr 29, 2024

platypii commented Apr 30, 2024

simline commented Apr 30, 2024

simline commented Apr 30, 2024

simline commented Apr 30, 2024

platypii commented May 1, 2024

simline commented May 1, 2024

simline commented May 1, 2024

platypii commented May 1, 2024

simline commented May 1, 2024

platypii commented May 2, 2024

simline commented May 2, 2024

simline commented May 2, 2024

[optimize] `readRleBitPackedHybrid` cost too much time, and `typedarray` for column result? #7

[optimize] `readRleBitPackedHybrid` cost too much time, and `typedarray` for column result? #7