🚧 Help finish the parser #51

psteinroe · 2023-12-05T19:10:44Z

We recently finished the implementation of the core parser algorithm. Details can be found in this blog post.

While a lot of work has been automated with procedural macros, we still need to provide the keywords for a node in get_node_properties manually. For example, a SelectStmt node has the select keyword as a property, and if there is a from_clause, a from keyword. This is implemented as

fn custom_handlers(node: &Node) -> TokenStream {
    match node.name.as_str() {
	"SelectStmt" => quote! {
	        tokens.push(TokenProperty::from(Token::Select));
	        if n.from_clause.len() > 0 {
	               tokens.push(TokenProperty::from(Token::From));
	        }
	        ...
	},
        ...
}

Requirements

Setup a rust development environment:

Install rustup
Install nightly rust via rustup-init

rustup-init --default-toolchain nightly -y

Fork the repository

Run your first test

To make it as easy as possible for contributors to get into it, we prepared a simple test suite to quickly assert the tokens for a node.

open the file crates/parser/src/codegen.rs. you will find a bunch of tests there, e.g. test_simple_select. In each test, the test_get_node_properties helper is called . The parameters are a valid sql input str, the kind of the node we want to test for, and a Vec<TokenProperty> with the properties we expect.
in a terminal, cd into crates/parser
now, run a test with debugging enabled RUST_LOG=DEBUG cargo test test_simple_select. you will see both the entire graph, as well as the node you are testing against within the logs. The test should pass, since we are expecting only the Select keyword.

Become familiar with the codebase

A lot of logic is generated using macros, which makes it harder to find the definition of struct, enums and functions. Most important in this case are SyntaxKind and TokenProperty.

SyntaxKind is an enum that holds a variant for every node and token that Postgres supports, and some custom ones that we need for parsing such as Eof (end of file). To figure out what variants exactly, you can look into the generated source of pg_query.rs to find out how nodes and tokens are defined. This is also useful to figure out what properties each node has. Open the file crates/parser/src/parse/libpg_query_node.rs and go to the definition of use pg_query::NodeEnum;. The file contains all nodes and tokens. Search for enum Token to find the latter.

TokenProperty is a struct that defines a property of a node. It is defined in the codemod for get_node_properties and holds either a string value or a kind or both. You can create a TokenProperty from a various types, e.g. TokenProperty::from(SyntaxKind::SelectStmt).

Add your own test

Let's say you want to add a test for the SelectStmt node to contain the from property when a from clause is used. We choose a simple select 1 from contact; statement as our input. Since there are many nodes within this statement, we have to pass the kind of the node that we want to test: SyntaxKind::SelectStmt. We expect two properties: Select, and From. The test now looks like:

    #[test]
    fn test_select_with_from() {
        test_get_node_properties(
            "select 1 from contact;",
            SyntaxKind::SelectStmt,
            vec![
                TokenProperty::from(SyntaxKind::Select),
                TokenProperty::from(SyntaxKind::From),
            ],
        )
    }

Note that SyntaxKind contains all nodes and token types. In this case, the implementation has already been done and the test should pass. If your sample contains a case that is not yet implemented, go to crates/codegen/src/get_node_properties and add the missing implementation to custom_handlers. You should be able to figure out what properties are relevant from the node definition generated by pg_query.rs (see above).

Any help is highly appreciated. Please let us know if you have problems setting everything up. We are very happy to support and improve this guide to maximise efficiency for contributors.

The text was updated successfully, but these errors were encountered:

karlhorky · 2023-12-08T13:50:46Z

cc @nene because of https://github.com/nene/sql-parser-cst

nicoabie · 2023-12-09T23:48:11Z

Have you guys explored the idea of using property based testing to have greater confidence in the implementation accuracy? Great work by the way just read the post!

psteinroe · 2023-12-13T13:28:08Z

Have you guys explored the idea of using property based testing to have greater confidence in the implementation accuracy?

I am currently thinking about leveraging an LLM for this. I have no practical experience with it, but in theory, I think something like this could work:

feed model with node structs and enums from pg_query.rs
tell it to go form node to node and have it come up with different statements, where each statement should result in different variations of that node
parse each statement, and feedback the model whether the statement is valid and the statement contains the target node with the target property
generate tests using the statement

and maybe event have it write get_node_properties itself:

pass the statement into our parser. feedback if error.
have the llm extend the implementation of get_node_properties to fix the error. repeat until parser succeeds.

don't know if that really works, but it would take away a lot of repetitive work.

nicoabie · 2023-12-13T14:31:55Z

Well honestly that is a novel approach, I won't be surprised if there are papers being written comparing PBT against the use of LLM.

Good things about PBT is that the sampling phase creates good coverage in all the domain of the problem and that if an error is found there is a shrinking phase that looks for the minimum example.

Those things (I believe) cannot be controlled using LLMs.

The safest approach would be PBT IMHO but I do believe the LLMs is a very good idea (problem is that you will be on your own because of moving into uncharted territory)

psteinroe · 2023-12-29T08:00:23Z

to whoever deleted the comment about combining queries (union, intercept etc), they should now be fixed: #98

psteinroe pinned this issue Dec 5, 2023

psteinroe changed the title ~~🚧 Help finishing the parser~~ 🚧 Help finish the parser Dec 5, 2023

psteinroe mentioned this issue Dec 6, 2023

add test for parsing BEGIN, ROLLBACK, COMMIT statements #46

Closed

psteinroe added help wanted Extra attention is needed good first issue Good for newcomers labels Dec 6, 2023

This was referenced Dec 10, 2023

add support for parsing "create domain" statement #61

Merged

feat: support "create schema" #63

Merged

feat: support "create view" with relpersistence #65

Merged

nicoabie mentioned this issue Dec 14, 2023

Use proptest in codegen tests #80

Closed

psteinroe mentioned this issue Dec 17, 2023

add snapshots test back to parser #82

Closed

jasonpanosso mentioned this issue Dec 27, 2023

feat: support create conversion #97

Merged

This was referenced Dec 30, 2023

add test for SELECT ... WHERE ... #101

Merged

add support for SELECT .. ORDER BY ... #102

Merged

pert5432 mentioned this issue Jan 13, 2024

feat: add support for truncate statement #113

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🚧 Help finish the parser #51

🚧 Help finish the parser #51

psteinroe commented Dec 5, 2023 •

edited

Loading

karlhorky commented Dec 8, 2023

nicoabie commented Dec 9, 2023

psteinroe commented Dec 13, 2023

nicoabie commented Dec 13, 2023

psteinroe commented Dec 29, 2023 •

edited

Loading

🚧 Help finish the parser #51

🚧 Help finish the parser #51

Comments

psteinroe commented Dec 5, 2023 • edited Loading

Requirements

Run your first test

Become familiar with the codebase

Add your own test

karlhorky commented Dec 8, 2023

nicoabie commented Dec 9, 2023

psteinroe commented Dec 13, 2023

nicoabie commented Dec 13, 2023

psteinroe commented Dec 29, 2023 • edited Loading

psteinroe commented Dec 5, 2023 •

edited

Loading

psteinroe commented Dec 29, 2023 •

edited

Loading