Skip to content

Latest commit

 

History

History
382 lines (267 loc) · 12.7 KB

slides.md

File metadata and controls

382 lines (267 loc) · 12.7 KB
theme background title info class drawings transition mdc overviewSnapshots coverAuthor coverAuthorUrl
academic
YeSQL Report
## YeSQL Report Presentation slides for my colleague and only for academic purposes. Learn more at https://mocusez.site/zh-CN/posts/78ca1/
text-center
persist
slide-left
true
true
MocusEZ

YeSQL: “You extend SQL”

with Rich and Highly Performant User-Defined Functions in Relational Databases

Presenter: MocusEZ


transition: fade layout: two-cols

Authors

::right::


transition: fade

layout: two-cols


Motivation

  • Many programming language tools to assist developers design pipelines

    • But: Complicated ecosystem, unscalable processing
  • Relational DBMSs offer efficient large data processing

    • But: SQL has limited expressive power
  • UDFs in SQL merge relational and programming syntax and semantics

    • But: Impedance mismatch between declarative (SQL) and procedural (e.g., Python) operation

transition: fade

Main issues addressed

Mismatch between the relational (SQL) evaluation and the procedural (Python) execution

  • (a) Context switching overhead:

    • one facility needs to invoke the other through various levels of indirection. This is potentially expensive when performed frequently.
  • (b) Data conversion overhead:

    • data is represented differently in the two environments and need to be wrapped/unwrapped or checked (e.g., for overflow) and encoded/decoded.
<style> .slidev-vclick-target { transition: all 500ms ease; } .slidev-vclick-hidden { transform: scale(0); } </style>

transition: fade

Our Approach: YeSQL

  • Usability and expressiveness
    • Stateful, parametric, polymorphic, dynamically typed, scalar/aggregate/table UDFs
  • JIT-compiled UDFs and stateful UDFs
    • UDF parallelization and UDF fusion
  • Performance enhancements
    • Tracing JIT compilation
    • Seamless integration with the DBMS(Mainly dependent on CFFI)
    • UDF fusion
    • Parallelism
    • Stateful UDFs.

transition: fade layout: two-cols

Tracing JIT compilation & Parallelism

"Tracing Just-In-Time (JIT) compilation is an approach to dynamic code optimization that focuses on identifying and optimizing the most frequently executed paths, or "hot paths," within a program. Instead of compiling entire methods or functions, tracing JIT compilers record and compile a sequence of instructions as the program runs, specifically targeting these hot paths to produce highly optimized machine code. This optimized code can then be reused in future executions, making frequently executed code much faster."


Examples of tracing JIT implementations include PyPy (for Python) and LuaJIT (for Lua), both of which achieve significant performance gains by optimizing frequently executed paths in their respective interpreted languages.

::right::

Using PyPy Instead of CPython

transition: fade

Seamless Integration with DBMS

"UDFs are wrapped using embedded CFFI"


transition: fade

UDF Fusion

  • Fusable UDFs
    • The second UDF’s input data is the same as the first UDF’s output
    • The argument data types are available in the query plan
  • Example: fuse two scalar UDFs


transition: fade layout: two-cols

Stateful UDFs

Stateful UDFs are functions that can maintain state (memory) across multiple invocations during data processing. Unlike regular UDFs which process each row independently, stateful UDFs can remember information from previous rows and use it in processing subsequent rows.


Key characteristics of Stateful UDFs:

  1. State Maintenance
  • Can store and update variables between function calls
  • Useful for running calculations, aggregations, or pattern detection
  1. Support in Different Systems:
  • Apache Flink: Extensive support for stateful operations
  • Apache Spark: Supported through mapGroupsWithState and flatMapGroupsWithState
  • Many streaming systems support stateful processing

::right::

e.g., via a global dictionary

globaldict = {}
def var(arg1, arg2 = None):
if arg2 is not None:
  globaldict[arg1] = arg2
  return True
else:
  return globaldict[arg1]
SELECT var(‘a’, ’HELLO WORLD’);
>> 1
SELECT lower(var(‘a’));
>> hello world

transition: fade

Evaluation Results

  • Significant performance improvements, achieving up to 68x speedups compared to other traditional python methods.

layout: iframe-right url: https://mocusez.site/zh-CN/posts/78ca1/ class: my-cool-content-on-the-left

More Deatils Information

My Chinese Blog:
如何让数据库中的Python跑的更快-VLDB22-YeSQL文章阅读

Relative Information:
31st Symposium on Advanced Database Systems


layout: center

Thanks for Listening