fix: robustly strip psql meta commands#4177
fix: robustly strip psql meta commands#4177ignatremizov wants to merge 2 commits intosqlc-dev:mainfrom
Conversation
|
@andrewmbenton please review |
|
Thanks @ignat980 this looks a much more complete solution than I was expecting |
|
@ignat980 if you open this against main I can get this merged. Just make sure to includes Andrew's commits. |
60585cd to
2181f98
Compare
|
@kyleconroy Thanks! I rebased to latest sqlc/main and changed this PR's merge-into branch as sqlc/main. Just waiting on the test CI to finish |
|
@kyleconroy bump for visibility - not sure why the CI tests haven't run, maybe you need to push something? |
|
@kyleconroy bump again ⭐ |
|
@kyleconroy Can you please merge this? It's been in approved state for 5 months now. |
|
I've merged the simple fix in #4390 If you rebase in that we can get this merged |
2181f98 to
e3bd762
Compare
|
@kyleconroy Rebased on top of #4390 (current main), but I didn’t just replay the old patch unchanged. After checking the merged simple fix more closely, I found several correctness gaps if we wanted this to be safe against real-world So the branch now does more than just “robust stripping”, it does "correct stripping":
I did start going further down the path of "full correctness", but pretty quickly that becomes "reimplement psql", which is not the scope of this PR. So I stopped at:
If we ever want to go further, the honest solution is probably not "keep accreting special cases here", but something closer to "run the SQL through a real PostgreSQL environment and reason about the resulting state after psql-side processing". If users want full arbitrary-psql-script semantics in sqlc today, that is outside the scope of this PR and they can suck a lemon. Also split out the repeated mechanical caller refactor as a separate follow-up commit so the actual behavior change stays readable. |
Replace naive PostgreSQL schema preprocessing with a single-pass state machine that correctly distinguishes top-level psql meta-commands from backslashes in valid SQL code, literals, identifiers, comments, and dollar-quoted bodies. The previous implementation would either leave pg_dump-style backslash directives in place for some schema-loading paths or strip too aggressively, breaking valid SQL containing: - Backslashes in string literals, including `E'...'` escapes and simple `standard_conforming_strings` variants - Meta-command text in comments or documentation - Dollar-quoted function bodies, including Unicode-tagged bodies - Double-quoted identifiers and identifiers containing `$` Changes: - Add engine-aware preprocessing helpers so rollback removal always applies while PostgreSQL preprocessing preserves server-side SQL, including PL/pgSQL bodies and extension/language DDL, before stripping psql meta-commands. - Replace the line-based PostgreSQL filter with a single-pass state machine that tracks single quotes, double quotes, dollar quotes, line comments, and nested block comments. - Handle escape-string prefixes, simple `standard_conforming_strings` value variants, Unicode dollar-quote tags, identifier-boundary checks, the documented `\\` separator command, and broader top-level unknown backslash directives. - Preserve SQL that follows a valid whitespace-delimited inline `\\` separator on the same psql meta-command line, matching tested `psql 17.9` behavior. - Strip semantic psql commands such as `\connect`, `\copy`, `\gexec`, `\i`, and `\ir` with warnings in parse/codegen paths, but reject them in schema-application paths where sqlc cannot reproduce their effects safely. - Reject psql conditionals (`\if`, `\elif`, `\else`, `\endif`) during preprocessing instead of flattening branches and changing SQL semantics. - Treat `standard_conforming_strings` and transaction-scoped script behavior as best-effort parsing aids rather than full psql emulation, and surface an explicit warning when preprocessing has to approximate those semantics. - Apply shared schema preprocessing and warning propagation in compiler parsing, `createdb`, `verify`, managed `vet`, and PostgreSQL sqltest seeding paths, including stable hashing of preprocessed read-only PostgreSQL fixtures. - Re-enable the `pg_dump` end-to-end fixture now that its schema parses and seeds successfully. - Add targeted regression coverage covering: * All documented psql meta-commands plus broader top-level unknown backslash directives * String literals with backslashes, escape-string prefixes, and simple `standard_conforming_strings` variants * Unicode dollar-quote tags, identifier-boundary cases, inline `\\` separator behavior, and rejected conditional directives * Semantic-command warnings and apply-mode rejections, `\copy ... from stdin`, line comments, nested block comments, quoted identifiers, helper edge cases, and schema preprocessing rollout * Bare CR normalization, CRLF preservation, and managed/PostgreSQL preprocessing behavior Performance improvements: - Pre-allocate output buffers with `strings.Builder.Grow()` - Keep parsing single-pass rather than rescanning line slices - Consolidate schema preprocessing into engine-aware helpers reused across schema-loading paths
Extract the repeated apply-time schema glob/read/preprocess/warn loop into a shared helper so command setup and PostgreSQL sqltest seeding use the same preprocessing path as the migrations work evolves. Changes: - Add `schemautil.LoadSchemasForApply()` to expand schema globs, read files, run `PreprocessSchemaForApply()`, and surface warnings through a caller-provided callback. - Update `createdb`, `verify`, and managed `vet` setup to reuse the shared loader instead of open-coding the same apply-time preprocessing loop. - Update PostgreSQL sqltest seeding helpers to reuse the shared loader for both regular and read-only database setup. - Preserve the existing read-only PostgreSQL cache-key behavior by hashing the preprocessed DDL returned by the shared loader. - Handle plain `pg_dump` schema replays in PostgreSQL sqltest helpers by dropping the default `public` schema before applying dumps that recreate it. Behavioral effect: - No intended user-visible behavior change in normal schema preprocessing. - Apply-time callers and PostgreSQL test helpers now share one codepath, reducing drift between managed-database setup and sqltest seeding. - Plain `pg_dump` fixtures continue to apply cleanly in PostgreSQL test databases even when the dump recreates `public`. Testing: - `go test ./internal/migrations ./internal/compiler ./internal/cmd/... ./internal/sqltest/... ./internal/endtoend -run 'TestReplay/base/pg_dump|TestValidSchema/endtoend-testdata/pg_dump/sqlc.json-0'`
e3bd762 to
98a37c8
Compare
Summary
This PR makes PostgreSQL schema preprocessing semantically safer for
pg_dump/psql-flavored input without turning sqlc into a partialpsqlinterpreter.It preserves real PostgreSQL SQL, strips or warns on client-side commands, rejects script semantics that cannot be supported safely, and shares the apply-time schema loading path used by command setup and PostgreSQL sqltest seeding.
Changes
PostgreSQL schema preprocessing
psqlmeta-commands outside SQL lexical context.psqlbackslash commands as a reference fixture. See PostgreSQLpsqldocsE'...'\meta ... \\ SQL...separator form and keep the trailing SQL intact.\copy ... from stdinblocks.Semantic command handling
psqlcommands such as\connect,\copy,\gexec,\i, and\irin parse/codegen paths with warnings so users get told when sqlc is ignoring client-side behavior.psqlconditionals (\if,\elif,\else,\endif) instead of pretending to support them.standard_conforming_stringshandling as a best-effort parse aid and warn when preprocessing has to approximate session/script semantics rather than emulatepsql.Rollout across schema-loading paths
generatecreatedbverifyvetpg_dumpend-to-end fixture now that its schema parses and seeds successfully again.Follow-up refactor
schemautil.LoadSchemasForApply().pg_dumpschema replays in PostgreSQL sqltest helpers by dropping the defaultpublicschema before applying dumps that recreate it.Behavioral impact
pg_dump/psql-flavored PostgreSQL schema files without corrupting valid SQL that happens to contain backslashes.psqlinterpreter; full arbitrary client-side script semantics remain out of scope here.Testing
go test ./internal/migrations ./internal/compiler ./internal/cmd/...go test ./internal/sqltest/... -run '^$'go test ./internal/endtoend -run 'TestReplay/base/pg_dump|TestValidSchema/endtoend-testdata/pg_dump/sqlc.json-0'internal/migrations/migrations.gois covered at100%Related
Co-authored-by: Andrew Benton andrew@sqlc.dev
Addresses gbarr's comment in #4082 which closes #4065