Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Statement Splitter #135

Open
psteinroe opened this issue Aug 1, 2024 · 9 comments · May be fixed by #138
Open

Improve Statement Splitter #135

psteinroe opened this issue Aug 1, 2024 · 9 comments · May be fixed by #138
Assignees
Labels
enhancement New feature or request

Comments

@psteinroe
Copy link
Collaborator

The statement splitter is a crate that receives sql source and returns ranges for every statement in the file. It's supposed to work with incomplete and incorrect statements. Right now, the implementation is very simple and just looks at the first n tokens of a statement to check if a new one starts at the current cursor position.

This is very fragile, and should be improved. My current idea is to instead of just using the first n tokens, allow any tokens to be part of the declaration, so that e.g. the create function syntax

CREATE [ OR REPLACE ] [ CONSTRAINT ] TRIGGER name { BEFORE | AFTER | INSTEAD OF } { event [ OR ... ] }
    ON table_name
    [ FROM referenced_table_name ]
    [ NOT DEFERRABLE | [ DEFERRABLE ] [ INITIALLY IMMEDIATE | INITIALLY DEFERRED ] ]
    [ REFERENCING { { OLD | NEW } TABLE [ AS ] transition_relation_name } [ ... ] ]
    [ FOR [ EACH ] { ROW | STATEMENT } ]
    [ WHEN ( condition ) ]
    EXECUTE { FUNCTION | PROCEDURE } function_name ( arguments )

could be defined in the parser as

CREATE [ OR REPLACE ] [ CONSTRAINT ] TRIGGER <string> { BEFORE | AFTER | INSTEAD OF } <any-tokens> ON <string> <any-tokens> EXECUTE { FUNCTION | PROCEDURE } <string> ( arguments )

we would need the following definitions

  • a specific token
  • optional token
  • one of token
  • any string token
  • any number of tokens
@psteinroe psteinroe self-assigned this Aug 1, 2024
@psteinroe psteinroe added the enhancement New feature or request label Aug 1, 2024
@psteinroe psteinroe mentioned this issue Aug 1, 2024
14 tasks
@psteinroe psteinroe linked a pull request Aug 28, 2024 that will close this issue
4 tasks
@aleclarson
Copy link

I see you've already put a lot of work into this, but I'm curious why you don't split statements with libpg_query?

@mmoncure
Copy link

@aleclarson I think the basic issue is that lib_pgquery faults and doesn't parse on any error; it only produces output if the parse is clean. This behavior seems very hard wired into the library since it directly depends on PG_TRY/PG_CATCH macros since that's what the server wants.

For my part, one approach might be to try to get the 'files with errors' to parse clean by dumping lines until it does parse clean, then for an unfinished / unclosed statement, try to finish it with introduced tokens so that it does complete without error if possible. Having a cache of a prior 'good' parse can help as well, I need to study the approach some more but my suspicion is that there may be some challenges that may be hard to solve.

the more parsing happening out of libpg_query and in the LSP server, i think the more fragile this implementation will be. I noticed parse errors from valid sql statements, for example, != does not seem to be parsing in the library.

@psteinroe
Copy link
Collaborator Author

That's correct. The libg_query parser bails on any error in a file.

At first, I tried to find a "perfect" solution that ended up to be a months-long rabbit hole with never ending edge cases. The PR for that is still open in draft, but I abandoned it.

The current implementation is a very simple Pratt Parser. The idea is to "fallback" to splitting on double newlines, while iteratively implementing custom parsers for the most important statements. In practice, it should be reasonable to ask the user to either properly close a statement with a semicolon or put a double newline between two statements if something is off.

I also dig your idea to iteratively parse line-by-line. We could end up with a similarly endless rabbit hole of edge cases though.

We currently focus on getting the infra rewrite done and will then gear up for an alpha release. Once that's out there, we will listen to feedback from early users to understand if the current approach is good enough, or if we have to go back to the drawing board again.

@mmoncure
Copy link

Thanks. I may have a go at seeing what I can get done with libpg_query. This might be in a different implementation (no rust experience), but I might be able to give you some feedback on my research. I have some thoughts on how it might work, but the devil, as always, is in the details.

I'm an extremely prolific pl/pgsql developer and I understand the value of trying to get this project off the ground. Lack of good tooling integration is one of the two major problems remaining to take postgres to the next level, the other being a fully managed extension system. Good luck, and you'll hear from me now and then :).

@mmoncure
Copy link

huh -- I went down the rabbit hole. I think you're right. I guess the options on the table might be one or more of:

  • hack (or have hacked) libpg_query to produce return partially parsed tree by customizing the error handler. -- I spend an hour on this and punted. partly this is made difficult due to the library works on bison compiled output. I was thinking about hacking the ereport macro to dump contents of yyextra to see what could be grabbed out in flight
  • write a custom statement splitter (your current choice). I poked around a bit and noted that psql depends on a front end statement splitter which might be not to difficult to externalize, see here for an example. This would be useless for plpgsql though. One helpful aspect though is you could get intrinsic handling of psql syntax (\if \endif etc) which is nice benefit.
  • customize bison gram.y to get custom behaviors, maybe coerce it to stateless parser somehow.

I think you're right though, as libpg_query currently stands, it's a dead end. Trying to coerce input to parse clean with the output it gives you doesn't really seem possible.

@mmoncure
Copy link

@psteinroe

I went ahead and had a hack at yanking the relevant stuff out of pgbench and came up with something that compiles. This approach has a lot of advantages. It follows exactly the same set of rules as psql and will compile scripts as psql does -- it would handle backslash commands...many scripts might use various commands to do includes and whatnot.

To make this usable build dependences would have to be worked out-- this would have to live in similar way as libpg_query since it essentially extracts code out of server which pgbench does. This area of the server is pretty stable though, and once done would have to change very rarely. Not sure if you're interested, but this might be a cleaner frame-up of a splitter. I also think this could parse function bodies once unwrapped.

int main()
{
  char* script = 
    "SELECT\nfoo\nfrom biz baz\n;\n\n\n\n/* hello gents */\n"
    "SELECT 1;select 2;\nSELECT foo FROM bar;\n\\if foo\ncreate table x();\n\\endif\nDO $$ begin end;\n $$;";

  printf("length: %zu\n%s\n===================\n", strlen(script), script);

  get_statements(script, "the script\n");
  return 0;
}
mmoncure@DESKTOP-730K2QM:~/src/libpg_split$ ./pg_split
length: 141
SELECT
foo
from biz baz
;



/* hello gents */
SELECT 1;select 2;
SELECT foo FROM bar;
\if foo
create table x();
\endif
DO $$ begin end;
 $$;
===================
buf: 'SELECT
foo
from biz baz
;' line start: 1 end: 4 sr: semicolon offset 25
buf: '/* hello gents */
SELECT 1;' line start: 4 end: 9 sr: semicolon offset 56
buf: 'select 2;' line start: 9 end: 9 sr: semicolon offset 65
buf: 'SELECT foo FROM bar;' line start: 9 end: 10 sr: semicolon offset 86
buf: '' line start: 10 end: 11 sr: backslash offset 88
buf: 'create table x();' line start: 12 end: 12 sr: semicolon offset 112
buf: '' line start: 12 end: 13 sr: backslash offset 114
buf: 'DO $$ begin end;
 $$;' line start: 14 end: 15 sr: semicolon offset 141
buf: '' line start: 15 end: 15 sr: incomplete offset 141

@psteinroe
Copy link
Collaborator Author

hey @mmoncure, that's awesome, thanks for jumping into the rabbit hole!

I am not proficient with C++, but the results look promising. How does it behave when there is no semicolon? I wonder what path is easier to go down, since adding eg support for backslash commands would be relatively straightforward in a custom parser too. But I am definitely open to try it. Since the api of the splitter is very straightforward, we could easily swap the implementations via a config flag.

@mmoncure
Copy link

How does it behave when there is no semicolon?

have to test for this (not at right computer ATM), but psql treats the special case of one statement with no semicolon as a terminated command. So you would either get that or no statements at all; either ought to be fine.

for improperly completed statements (no ;) they would just come back as one statement, and that's fine. Backslash commands are handled differently than statements in that it's up to you to scan them where for regular commands they come out of the scanner completely scanned; this is done intentionally to handle client side parsing cases as psql might do.

So I think this might be exactly what is needed. While it may not provide tons of benefit as statement scanning can likely be done with regex, a custom bison grammar is likely going to smoke performance-wise anything done with simple approaches and be much more reliable as you are directly laying on top of the postgres grammar that is repacked for client parsing needs.

The challenge with this approach is not the scanning implementation but the build dependency; I need some time to workout the best approach for the build environment to be maintenance free...I struggled to get a makefile working. None of the various postgres build environments (postgres-fe, PGXS, full server) seem to be a good fit for this for various reasons, pgbench doesn't have this issue as it is packed with postgresql itself.

Ultimately I think it will be some type of server repacking script along the lines of libpg_query, (or perhaps better) a fork and/or PR to lib_pgquery itself since that library is much more valuable with the option of reducing failure cases from a large script down to a single statement IMNSHO. This could be implemented as a separate library interface or additional modes to the existing interfaces; I'll poke around a bit to verify what's feasible. That would also be ideal for you, since the libpg_query integration is already worked out. So hang 10 on that.

@psteinroe
Copy link
Collaborator Author

that sounds really exciting. If you like to continue going down the rabbit hole, I can put you in contact with the pg_analyze team which is behind libpg_query. Maybe they can support you here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants