Skip to content

fix: robustly strip psql meta commands#4177

Open
ignatremizov wants to merge 2 commits intosqlc-dev:mainfrom
ignatremizov:fix/strip-psql-meta
Open

fix: robustly strip psql meta commands#4177
ignatremizov wants to merge 2 commits intosqlc-dev:mainfrom
ignatremizov:fix/strip-psql-meta

Conversation

@ignatremizov
Copy link
Copy Markdown

@ignatremizov ignatremizov commented Nov 10, 2025

Summary

This PR makes PostgreSQL schema preprocessing semantically safer for pg_dump / psql-flavored input without turning sqlc into a partial psql interpreter.

It preserves real PostgreSQL SQL, strips or warns on client-side commands, rejects script semantics that cannot be supported safely, and shares the apply-time schema loading path used by command setup and PostgreSQL sqltest seeding.

Changes

PostgreSQL schema preprocessing

  • Replace the old line-based PostgreSQL filter with a single-pass state machine that only strips top-level psql meta-commands outside SQL lexical context.
  • Keep explicit coverage of documented psql backslash commands as a reference fixture. See PostgreSQL psql docs
  • Preserve backslashes that appear in valid PostgreSQL SQL, including:
    • string literals, including E'...'
    • quoted identifiers
    • dollar-quoted bodies
    • comments and nested block comments
  • Support the valid \meta ... \\ SQL... separator form and keep the trailing SQL intact.
  • Handle Unicode dollar-quote tags, identifier-boundary checks, CR/LF normalization, and \copy ... from stdin blocks.

Semantic command handling

  • Strip semantic psql commands such as \connect, \copy, \gexec, \i, and \ir in parse/codegen paths with warnings so users get told when sqlc is ignoring client-side behavior.
  • Reject those same semantic commands in apply/live-DB paths where silently flattening them would be unsafe.
  • Reject psql conditionals (\if, \elif, \else, \endif) instead of pretending to support them.
  • Keep standard_conforming_strings handling as a best-effort parse aid and warn when preprocessing has to approximate session/script semantics rather than emulate psql.

Rollout across schema-loading paths

  • Apply the shared PostgreSQL preprocessing behavior in:
    • compiler parsing / generate
    • createdb
    • verify
    • managed vet
    • PostgreSQL sqltest seeding
  • Surface preprocessing warnings through the relevant callers instead of keeping them internal to the migration helper.
  • Re-enable the pg_dump end-to-end fixture now that its schema parses and seeds successfully again.

Follow-up refactor

  • Extract the repeated apply-time glob/read/preprocess/warn loop into schemautil.LoadSchemasForApply().
  • Reuse that loader across command setup and PostgreSQL sqltest helpers so apply-time behavior stays aligned as preprocessing evolves.
  • Preserve read-only PostgreSQL cache hashing on preprocessed DDL.
  • Handle plain pg_dump schema replays in PostgreSQL sqltest helpers by dropping the default public schema before applying dumps that recreate it.

Behavioral impact

  • sqlc is now much more tolerant of pg_dump / psql-flavored PostgreSQL schema files without corrupting valid SQL that happens to contain backslashes.
  • Parse/codegen paths are intentionally permissive with warnings when users are in footgun territory.
  • Apply/live-DB paths stay stricter and reject client-side script behavior that sqlc cannot reproduce safely.
  • This does not try to become a full psql interpreter; full arbitrary client-side script semantics remain out of scope here.

Testing

  • go test ./internal/migrations ./internal/compiler ./internal/cmd/...
  • go test ./internal/sqltest/... -run '^$'
  • go test ./internal/endtoend -run 'TestReplay/base/pg_dump|TestValidSchema/endtoend-testdata/pg_dump/sqlc.json-0'
  • internal/migrations/migrations.go is covered at 100%

Related

Co-authored-by: Andrew Benton andrew@sqlc.dev

Addresses gbarr's comment in #4082 which closes #4065

@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. 🔧 golang labels Nov 10, 2025
@ignatremizov
Copy link
Copy Markdown
Author

@andrewmbenton please review

@gbarr
Copy link
Copy Markdown

gbarr commented Nov 10, 2025

Thanks @ignat980 this looks a much more complete solution than I was expecting

@kyleconroy
Copy link
Copy Markdown
Collaborator

@ignat980 if you open this against main I can get this merged. Just make sure to includes Andrew's commits.

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Dec 10, 2025
@ignatremizov ignatremizov force-pushed the fix/strip-psql-meta branch 2 times, most recently from 60585cd to 2181f98 Compare December 11, 2025 08:46
@dosubot dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Dec 11, 2025
@ignatremizov ignatremizov changed the base branch from andrew/fix-4065 to main December 11, 2025 08:53
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels Dec 11, 2025
@ignatremizov
Copy link
Copy Markdown
Author

@kyleconroy Thanks! I rebased to latest sqlc/main and changed this PR's merge-into branch as sqlc/main. Just waiting on the test CI to finish

@ignatremizov
Copy link
Copy Markdown
Author

@kyleconroy bump for visibility - not sure why the CI tests haven't run, maybe you need to push something?

@ignatremizov
Copy link
Copy Markdown
Author

@kyleconroy bump again ⭐

@atombender
Copy link
Copy Markdown

@kyleconroy Can you please merge this? It's been in approved state for 5 months now.

@kyleconroy
Copy link
Copy Markdown
Collaborator

I've merged the simple fix in #4390

If you rebase in that we can get this merged

@ignatremizov
Copy link
Copy Markdown
Author

ignatremizov commented Apr 19, 2026

@kyleconroy Rebased on top of #4390 (current main), but I didn’t just replay the old patch unchanged.

After checking the merged simple fix more closely, I found several correctness gaps if we wanted this to be safe against real-world pg_dump / psql-flavored schema input, not just the narrow \restrict / \unrestrict case.

So the branch now does more than just “robust stripping”, it does "correct stripping":

  • keeps the single-pass state machine so we only strip top-level psql meta-commands outside real SQL literal/comment/body context
  • preserves valid PostgreSQL SQL containing backslashes in:
    • string literals, including E'...'
    • quoted identifiers
    • dollar-quoted bodies
    • comments / nested block comments
  • handles the valid \meta ... \\ SQL... separator form
  • strips semantic psql commands in parse/codegen paths with warnings, so if users rely on psql-side behavior they get told they are now in footgun territory
  • rejects those same semantic commands in apply/live-DB paths, where silently flattening them would be unsafe
  • rejects psql conditionals (\if, \elif, \else, \endif) instead of pretending to support them

I did start going further down the path of "full correctness", but pretty quickly that becomes "reimplement psql", which is not the scope of this PR.

So I stopped at:

  • lexically correct stripping
  • semantically safer behavior for sqlc users
  • warnings/errors where preprocessing becomes approximate

If we ever want to go further, the honest solution is probably not "keep accreting special cases here", but something closer to "run the SQL through a real PostgreSQL environment and reason about the resulting state after psql-side processing".

If users want full arbitrary-psql-script semantics in sqlc today, that is outside the scope of this PR and they can suck a lemon.

Also split out the repeated mechanical caller refactor as a separate follow-up commit so the actual behavior change stays readable.

Replace naive PostgreSQL schema preprocessing with a single-pass state machine that correctly distinguishes top-level psql meta-commands from backslashes in valid SQL code, literals, identifiers, comments, and dollar-quoted bodies.

The previous implementation would either leave pg_dump-style backslash directives in place for some schema-loading paths or strip too aggressively, breaking valid SQL containing:
- Backslashes in string literals, including `E'...'` escapes and simple `standard_conforming_strings` variants
- Meta-command text in comments or documentation
- Dollar-quoted function bodies, including Unicode-tagged bodies
- Double-quoted identifiers and identifiers containing `$`

Changes:
- Add engine-aware preprocessing helpers so rollback removal always applies while PostgreSQL preprocessing preserves server-side SQL, including PL/pgSQL bodies and extension/language DDL, before stripping psql meta-commands.
- Replace the line-based PostgreSQL filter with a single-pass state machine that tracks single quotes, double quotes, dollar quotes, line comments, and nested block comments.
- Handle escape-string prefixes, simple `standard_conforming_strings` value variants, Unicode dollar-quote tags, identifier-boundary checks, the documented `\\` separator command, and broader top-level unknown backslash directives.
- Preserve SQL that follows a valid whitespace-delimited inline `\\` separator on the same psql meta-command line, matching tested `psql 17.9` behavior.
- Strip semantic psql commands such as `\connect`, `\copy`, `\gexec`, `\i`, and `\ir` with warnings in parse/codegen paths, but reject them in schema-application paths where sqlc cannot reproduce their effects safely.
- Reject psql conditionals (`\if`, `\elif`, `\else`, `\endif`) during preprocessing instead of flattening branches and changing SQL semantics.
- Treat `standard_conforming_strings` and transaction-scoped script behavior as best-effort parsing aids rather than full psql emulation, and surface an explicit warning when preprocessing has to approximate those semantics.
- Apply shared schema preprocessing and warning propagation in compiler parsing, `createdb`, `verify`, managed `vet`, and PostgreSQL sqltest seeding paths, including stable hashing of preprocessed read-only PostgreSQL fixtures.
- Re-enable the `pg_dump` end-to-end fixture now that its schema parses and seeds successfully.
- Add targeted regression coverage covering:
  * All documented psql meta-commands plus broader top-level unknown backslash directives
  * String literals with backslashes, escape-string prefixes, and simple `standard_conforming_strings` variants
  * Unicode dollar-quote tags, identifier-boundary cases, inline `\\` separator behavior, and rejected conditional directives
  * Semantic-command warnings and apply-mode rejections, `\copy ... from stdin`, line comments, nested block comments, quoted identifiers, helper edge cases, and schema preprocessing rollout
  * Bare CR normalization, CRLF preservation, and managed/PostgreSQL preprocessing behavior

Performance improvements:
- Pre-allocate output buffers with `strings.Builder.Grow()`
- Keep parsing single-pass rather than rescanning line slices
- Consolidate schema preprocessing into engine-aware helpers reused across schema-loading paths
Extract the repeated apply-time schema glob/read/preprocess/warn loop into a shared helper so command setup and PostgreSQL sqltest seeding use the same preprocessing path as the migrations work evolves.

Changes:
- Add `schemautil.LoadSchemasForApply()` to expand schema globs, read files, run `PreprocessSchemaForApply()`, and surface warnings through a caller-provided callback.
- Update `createdb`, `verify`, and managed `vet` setup to reuse the shared loader instead of open-coding the same apply-time preprocessing loop.
- Update PostgreSQL sqltest seeding helpers to reuse the shared loader for both regular and read-only database setup.
- Preserve the existing read-only PostgreSQL cache-key behavior by hashing the preprocessed DDL returned by the shared loader.
- Handle plain `pg_dump` schema replays in PostgreSQL sqltest helpers by dropping the default `public` schema before applying dumps that recreate it.

Behavioral effect:
- No intended user-visible behavior change in normal schema preprocessing.
- Apply-time callers and PostgreSQL test helpers now share one codepath, reducing drift between managed-database setup and sqltest seeding.
- Plain `pg_dump` fixtures continue to apply cleanly in PostgreSQL test databases even when the dump recreates `public`.

Testing:
- `go test ./internal/migrations ./internal/compiler ./internal/cmd/... ./internal/sqltest/... ./internal/endtoend -run 'TestReplay/base/pg_dump|TestValidSchema/endtoend-testdata/pg_dump/sqlc.json-0'`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lgtm This PR has been approved by a maintainer size:L This PR changes 100-499 lines, ignoring generated files. 🔧 golang

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SQLC fails for psql meta-commands like \restrict

4 participants