-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Statement Splitter #135
Comments
I see you've already put a lot of work into this, but I'm curious why you don't split statements with libpg_query? |
@aleclarson I think the basic issue is that lib_pgquery faults and doesn't parse on any error; it only produces output if the parse is clean. This behavior seems very hard wired into the library since it directly depends on PG_TRY/PG_CATCH macros since that's what the server wants. For my part, one approach might be to try to get the 'files with errors' to parse clean by dumping lines until it does parse clean, then for an unfinished / unclosed statement, try to finish it with introduced tokens so that it does complete without error if possible. Having a cache of a prior 'good' parse can help as well, I need to study the approach some more but my suspicion is that there may be some challenges that may be hard to solve. the more parsing happening out of libpg_query and in the LSP server, i think the more fragile this implementation will be. I noticed parse errors from valid sql statements, for example, != does not seem to be parsing in the library. |
That's correct. The libg_query parser bails on any error in a file. At first, I tried to find a "perfect" solution that ended up to be a months-long rabbit hole with never ending edge cases. The PR for that is still open in draft, but I abandoned it. The current implementation is a very simple Pratt Parser. The idea is to "fallback" to splitting on double newlines, while iteratively implementing custom parsers for the most important statements. In practice, it should be reasonable to ask the user to either properly close a statement with a semicolon or put a double newline between two statements if something is off. I also dig your idea to iteratively parse line-by-line. We could end up with a similarly endless rabbit hole of edge cases though. We currently focus on getting the infra rewrite done and will then gear up for an alpha release. Once that's out there, we will listen to feedback from early users to understand if the current approach is good enough, or if we have to go back to the drawing board again. |
Thanks. I may have a go at seeing what I can get done with libpg_query. This might be in a different implementation (no rust experience), but I might be able to give you some feedback on my research. I have some thoughts on how it might work, but the devil, as always, is in the details. I'm an extremely prolific pl/pgsql developer and I understand the value of trying to get this project off the ground. Lack of good tooling integration is one of the two major problems remaining to take postgres to the next level, the other being a fully managed extension system. Good luck, and you'll hear from me now and then :). |
huh -- I went down the rabbit hole. I think you're right. I guess the options on the table might be one or more of:
I think you're right though, as libpg_query currently stands, it's a dead end. Trying to coerce input to parse clean with the output it gives you doesn't really seem possible. |
I went ahead and had a hack at yanking the relevant stuff out of pgbench and came up with something that compiles. This approach has a lot of advantages. It follows exactly the same set of rules as psql and will compile scripts as psql does -- it would handle backslash commands...many scripts might use various commands to do includes and whatnot. To make this usable build dependences would have to be worked out-- this would have to live in similar way as libpg_query since it essentially extracts code out of server which pgbench does. This area of the server is pretty stable though, and once done would have to change very rarely. Not sure if you're interested, but this might be a cleaner frame-up of a splitter. I also think this could parse function bodies once unwrapped.
|
hey @mmoncure, that's awesome, thanks for jumping into the rabbit hole! I am not proficient with C++, but the results look promising. How does it behave when there is no semicolon? I wonder what path is easier to go down, since adding eg support for backslash commands would be relatively straightforward in a custom parser too. But I am definitely open to try it. Since the api of the splitter is very straightforward, we could easily swap the implementations via a config flag. |
have to test for this (not at right computer ATM), but psql treats the special case of one statement with no semicolon as a terminated command. So you would either get that or no statements at all; either ought to be fine. for improperly completed statements (no ;) they would just come back as one statement, and that's fine. Backslash commands are handled differently than statements in that it's up to you to scan them where for regular commands they come out of the scanner completely scanned; this is done intentionally to handle client side parsing cases as psql might do. So I think this might be exactly what is needed. While it may not provide tons of benefit as statement scanning can likely be done with regex, a custom bison grammar is likely going to smoke performance-wise anything done with simple approaches and be much more reliable as you are directly laying on top of the postgres grammar that is repacked for client parsing needs. The challenge with this approach is not the scanning implementation but the build dependency; I need some time to workout the best approach for the build environment to be maintenance free...I struggled to get a makefile working. None of the various postgres build environments (postgres-fe, PGXS, full server) seem to be a good fit for this for various reasons, pgbench doesn't have this issue as it is packed with postgresql itself. Ultimately I think it will be some type of server repacking script along the lines of libpg_query, (or perhaps better) a fork and/or PR to lib_pgquery itself since that library is much more valuable with the option of reducing failure cases from a large script down to a single statement IMNSHO. This could be implemented as a separate library interface or additional modes to the existing interfaces; I'll poke around a bit to verify what's feasible. That would also be ideal for you, since the libpg_query integration is already worked out. So hang 10 on that. |
that sounds really exciting. If you like to continue going down the rabbit hole, I can put you in contact with the pg_analyze team which is behind libpg_query. Maybe they can support you here? |
The statement splitter is a crate that receives sql source and returns ranges for every statement in the file. It's supposed to work with incomplete and incorrect statements. Right now, the implementation is very simple and just looks at the first n tokens of a statement to check if a new one starts at the current cursor position.
This is very fragile, and should be improved. My current idea is to instead of just using the first n tokens, allow any tokens to be part of the declaration, so that e.g. the create function syntax
could be defined in the parser as
CREATE [ OR REPLACE ] [ CONSTRAINT ] TRIGGER <string> { BEFORE | AFTER | INSTEAD OF } <any-tokens> ON <string> <any-tokens> EXECUTE { FUNCTION | PROCEDURE } <string> ( arguments )
we would need the following definitions
The text was updated successfully, but these errors were encountered: