Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[optimize] readRleBitPackedHybrid cost too much time, and typedarray for column result? #7

Open
simline opened this issue Apr 29, 2024 · 15 comments

Comments

@simline
Copy link

simline commented Apr 29, 2024

and if i can do something for this will be great

@platypii
Copy link
Collaborator

I have plans to make readRleBitPackedHybrid faster by reducing the number of memory allocations. Instead of readVarInt returning an array with value and byte offset, I want to change it to use DataReader like the rest of the read functions like in f826bff7

I like the idea of returning TypedArrays, but will need to think more about how to actually implement that. Suggestions and PRs welcome!

I've found the parsing to be quite fast in general. Can you share an example parquet file which is particularly slow?

@simline
Copy link
Author

simline commented Apr 29, 2024

I've found the parsing to be quite fast in general. Can you share an example parquet file which is particularly slow?

A more than 20x faster example: TPCH lineitem table (sf1 almost 250MB). Without reading xxx levels in readDataPage, and return BigInt64Array, only cost 570ms :

columnMetadata.codec {
  type: 'INT64',
  encodings: [ 'PLAIN' ],
  path_in_schema: [ 'l_extendedprice' ],
  codec: 'SNAPPY',
  num_values: 122880n,
  total_uncompressed_size: 983071n,
  total_compressed_size: 634625n,
  key_value_metadata: undefined,
  data_page_offset: 1580630n,
  index_page_offset: undefined,
  dictionary_page_offset: undefined,
  statistics: {
    max: '^\x10�\x00\x00\x00\x00\x00',
    min: '�i\x01\x00\x00\x00\x00\x00',
    null_count: 0n,
    distinct_count: undefined
  },
  encoding_stats: undefined
} {
  num_values: 122880,
  encoding: 'PLAIN',
  definition_level_encoding: 'RLE',
  repetition_level_encoding: 'RLE',
  statistics: undefined
}

But the codec is codec: 'SNAPPY', so I removed its condition because the decompressed length is equal with column's data length (header 8 Bytes + int8[122880] )

And lineitem just a 6M rows data, what if there's a 60M rows data (2.5GB),or query with a HIVE PARTITION, performance matters.

I can do more, but not good at parquet format yet, 20x faster just a try

@simline
Copy link
Author

simline commented Apr 29, 2024

with more:
parallel support for node.js and browsers
saving memory usage for data analyze and performance: ignore dict map (seprate keys and values), DATE type parser (days diffs from 1970-01-01), etc
read huge file more than browser's limit

@platypii
Copy link
Collaborator

Just pushed commit 488bfa7f9 which increases performance by 11.7% when parsing tpch on node.js!

statistically significant

There is still much room for improvement. Nice job finding optimizations!

@simline
Copy link
Author

simline commented Apr 30, 2024

Yeah, I see the commit , great job. And here is some of small test might help

function concatArrayBuffers(buffer1, buffer2) {
    const newBuffer = new ArrayBuffer(buffer1.byteLength + buffer2.byteLength)
    const view = new Uint8Array(newBuffer)
    view.set(new Uint8Array(buffer1), 0)
    view.set(new Uint8Array(buffer2), buffer1.byteLength)
    console.log(buffer1.byteLength + buffer2.byteLength)
    return newBuffer
}
const buffer1 = new ArrayBuffer(40000000)
const buffer2 = new ArrayBuffer(40000000)
console.time('concatArrayBuffers')
var r = new Int32Array(concatArrayBuffers(buffer1, buffer2))
console.timeEnd('concatArrayBuffers')


function concatTypedArrays(array1, array2) {
    const newTypedArray = new array1.constructor(array1.length + array2.length)
    newTypedArray.set(array1, 0)
    newTypedArray.set(array2, array1.length)
    return newTypedArray
}
const typedArray1 = new Int32Array(10000000)
const typedArray2 = new Int32Array(10000000)
console.time('concatTypedArrays')
var r = concatTypedArrays(typedArray1, typedArray2)
console.timeEnd('concatTypedArrays')
r


var src1 = Array(10000000).fill(0)
var src2 = Array(10000000).fill(0)
console.time('array concat')
var dst = src1.concat(src2)
console.timeEnd('array concat')



// dst.push(...src) might throw an error
var dst = []
for (let i=0; i++!=10000000;) {
	dst.push(i)
}
console.time('array push')
for (let i=0; i++!=10000000;) {
	dst.push(i)
}
console.timeEnd('array push')

Nodejs v22 on my machine:

`concatArrayBuffers`     50ms
`concatenatedTypedArray` 49ms
`array concat`           77ms
`array push`             98ms

if create a function TypedArray.prototype.concat(), will suitable for any ArrayLike variable.

ArrayBuffer cast to TypedArray almost zero cost, and SharedArrayBuffer suitable for parallel web workers.

When should use TypedArray:

  • numbers without RLE, turn to TypedArray, use set() to concat
  • Enum dict code, turn to DataView and pick item by index
  • ... get item through array position

NOT:

  • RLE / Dict, user loop data by RowGroup, might need the original data
  • Nested Object Column
  • Text, might init with an global inverted index
  • JSON, might has any injected parser
  • ... optmize memory usage and performance

@simline
Copy link
Author

simline commented Apr 30, 2024

v0.7.8, another performance issue is the readDataPage function at datapage, this code block cost a lot time more than imagine.

But I didn't test anything for datapageV2 and your new commits yet.

@simline
Copy link
Author

simline commented Apr 30, 2024

I'll go to sleep, here is some answer for readRleBitPackedHybrid, use single array instead:

export function readData(reader, encoding, count, bitWidth) {
  /** @type {any[]} */
  const values = new Array()
  if (encoding === 'RLE') {
    let seen = 0
    while (seen < count) {
      seen = readRleBitPackedHybrid(reader, bitWidth, 0, count, values, seen)
    }
  } else {
    throw new Error(`parquet encoding not supported ${encoding}`)
  }
  return values
}

export function readRleBitPackedHybrid(reader, width, length, numValues, values, seen=0) {
  if(!values)throw error('')
  if (!length) {
    length = reader.view.getInt32(reader.offset, true)
    reader.offset += 4
    if (length < 0) throw new Error(`parquet invalid rle/bitpack length ${length}`)
  }

  const startOffset = reader.offset
  while (reader.offset - startOffset < length && seen < numValues) {
    const [header, newOffset] = readVarInt(reader.view, reader.offset)
    reader.offset = newOffset

    if ((header & 1) === 0) {
      let [val, count] = readRle(reader, header, width)
      for (let i=0; i++ < count;) { values[seen++] = val}

    } else {
      // bit-packed
      let res = readBitPacked(reader, header, width, numValues - seen)
      for (let i=0; i++ < res.length;) { values[seen++] = res[i]}
    }
  }

  return seen
}

function readRle(reader, header, bitWidth) {
  const count = header >>> 1
  const width = (bitWidth + 7) >> 3

  let value
       if (width === 1) {value = reader.view.getUint8(reader.offset) }
  else if (width === 2) {value = reader.view.getUint16(reader.offset, true) }
  else if (width === 4) {value = reader.view.getUint32(reader.offset, true) }
  else { throw new Error(`parquet invalid rle width ${width}`)}
  reader.offset += width

  return [value, count]
}

@platypii
Copy link
Collaborator

platypii commented May 1, 2024

I just pushed some more optimizations in v0.7.10. It is about 37% faster than v0.7.8 on the TPCH dataset!

I tried switching to TypedArrays but things got slower for me.

Thanks for the investigations @simline! Let me know if you have more suggestions!

@simline
Copy link
Author

simline commented May 1, 2024

Yes I have, with TPCH lineitem, it works fine on my machine, for INT64(decimal) types (l_quantity, l_extendedprice, l_discount, l_tax ...), it's 5x faster.

/**
 * Read `count` int32 values.
 *
 * @param {DataReader} reader - buffer to read data from
 * @param {number} count - number of values to read
 * @returns {Int32Array} array of int32 values
 */
function readPlainInt32(reader, count) {
  reader.offset += count * 4
  return new Int32Array(reader.view.buffer, 8)
}

/**
 * Read `count` int64 values.
 *
 * @param {DataReader} reader - buffer to read data from
 * @param {number} count - number of values to read
 * @returns {BigInt64Array} array of int64 values
 */
function readPlainInt64(reader, count) {
  reader.offset += count * 8
  return new BigInt64Array(reader.view.buffer, 8)
}

@simline
Copy link
Author

simline commented May 1, 2024

For data analistics, another thing is DICTIONARY_PAGE at readColumn(), if we could read pages at different time, will save more time:

User filter column data with expr (colxxx in (valA, valB, valC)), if val[ABCDEFG] is saved in DICTIONARY_PAGE and not match the expr, will not need read DATA_PAGE

inverted index is also helpful, but step by step

@platypii
Copy link
Collaborator

platypii commented May 1, 2024

it's 5x faster

I made those changes to readPlainInt32 and readPlainInt64 and speed was the same or slightly slower for me. Maybe I need to also change to .set? I don't see how you get a 5x speedup?

I tried on node, chrome, and firefox and saw the same slowdown with TypedArrays. What environment are you testing in?

@simline
Copy link
Author

simline commented May 1, 2024

@platypii sorry, I get the result from readRowGroup() rather than parquetRead(...), concat(a, idx, b) always cost too many time so didn't do any test, too many steps may smooth out actual performance differences.

If must return the whole result as once, example above like newTypedArray.set(array2, array1.length) might helpful.

@platypii
Copy link
Collaborator

platypii commented May 2, 2024

Got TypedArrays working!

Published v0.7.11 with a tpch parse time of 8,951 ms, compared to 19,358 ms for version v0.7.8. A speedup of 53.7%!

@simline
Copy link
Author

simline commented May 2, 2024

fix for my test:

function concatTypedArrays(array1, array2) {
    const newTypedArray = new array1.constructor(array1.length + array2.length)
    newTypedArray.set(array1, 0)
    newTypedArray.set(array2, array1.length)
    return newTypedArray
}

for (let i = 0; i < 49; i++) {
    var newA = new BigInt64Array(122880).fill(BigInt(i+1))

    console.time('concatTypedArrays')
    dst = concatTypedArrays(dst, typedArray2)
    console.timeEnd('concatTypedArrays')
}

Time increased every times in loops, from 0.7ms to 19.6ms, cause newTypedArray is an new object, but what if build a whole size at first and seems already did in project:

var dst = new BigInt64Array(122880 * 49)

for (let i = 0; i < 49; i++) {
    var newA = new BigInt64Array(122880).fill(BigInt(i+1))

    console.time('concatTypedArrays')
    dst.set(newA, 122880*(i))
    console.timeEnd('concatTypedArrays')
}

Cost <1ms every time!

@simline
Copy link
Author

simline commented May 2, 2024

readDefinitionLevels bitWidth condition

When reader.view.byteLength-8 equals to daph.num_values * valueBitSize or something like this (The reality may be more complicated), will not necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants