theme | background | title | info | class | drawings | transition | mdc | overviewSnapshots | coverAuthor | coverAuthorUrl | |
---|---|---|---|---|---|---|---|---|---|---|---|
academic |
YeSQL Report |
## YeSQL Report
Presentation slides for my colleague and only for academic purposes.
Learn more at https://mocusez.site/zh-CN/posts/78ca1/
|
text-center |
|
slide-left |
true |
true |
MocusEZ |
Presenter: MocusEZ
- Yannis Foufoulas
- U. of Athens, Athena R.C.
- [email protected]
- Alkis Simitsis
- Athena Research Center
- [email protected]
- Lefteris Stamatogiannakis
- University of Athens
- [email protected]
- Yannis Ioannidis
- U. of Athens, Athena R.C.
- [email protected]
::right::
transition: fade
-
Many programming language tools to assist developers design pipelines
- But: Complicated ecosystem, unscalable processing
-
Relational DBMSs offer efficient large data processing
- But: SQL has limited expressive power
-
UDFs in SQL merge relational and programming syntax and semantics
- But: Impedance mismatch between declarative (SQL) and procedural (e.g., Python) operation
-
(a) Context switching overhead:
- one facility needs to invoke the other through various levels of indirection. This is potentially expensive when performed frequently.
-
(b) Data conversion overhead:
- data is represented differently in the two environments and need to be wrapped/unwrapped or checked (e.g., for overflow) and encoded/decoded.
- Usability and expressiveness
- Stateful, parametric, polymorphic, dynamically typed, scalar/aggregate/table UDFs
- JIT-compiled UDFs and stateful UDFs
- UDF parallelization and UDF fusion
- Performance enhancements
- Tracing JIT compilation
- Seamless integration with the DBMS(Mainly dependent on CFFI)
- UDF fusion
- Parallelism
- Stateful UDFs.
"Tracing Just-In-Time (JIT) compilation is an approach to dynamic code optimization that focuses on identifying and optimizing the most frequently executed paths, or "hot paths," within a program. Instead of compiling entire methods or functions, tracing JIT compilers record and compile a sequence of instructions as the program runs, specifically targeting these hot paths to produce highly optimized machine code. This optimized code can then be reused in future executions, making frequently executed code much faster."
Examples of tracing JIT implementations include PyPy (for Python) and LuaJIT (for Lua), both of which achieve significant performance gains by optimizing frequently executed paths in their respective interpreted languages.
::right::
- Fusable UDFs
- The second UDF’s input data is the same as the first UDF’s output
- The argument data types are available in the query plan
- Example: fuse two scalar UDFs
Stateful UDFs are functions that can maintain state (memory) across multiple invocations during data processing. Unlike regular UDFs which process each row independently, stateful UDFs can remember information from previous rows and use it in processing subsequent rows.
Key characteristics of Stateful UDFs:
- State Maintenance
- Can store and update variables between function calls
- Useful for running calculations, aggregations, or pattern detection
- Support in Different Systems:
- Apache Flink: Extensive support for stateful operations
- Apache Spark: Supported through mapGroupsWithState and flatMapGroupsWithState
- Many streaming systems support stateful processing
::right::
e.g., via a global dictionary
globaldict = {}
def var(arg1, arg2 = None):
if arg2 is not None:
globaldict[arg1] = arg2
return True
else:
return globaldict[arg1]
SELECT var(‘a’, ’HELLO WORLD’);
>> 1
SELECT lower(var(‘a’));
>> hello world
- Significant performance improvements, achieving up to 68x speedups compared to other traditional python methods.
layout: iframe-right url: https://mocusez.site/zh-CN/posts/78ca1/ class: my-cool-content-on-the-left
My Chinese Blog:
如何让数据库中的Python跑的更快-VLDB22-YeSQL文章阅读
Relative Information:
31st Symposium on Advanced Database Systems