FIX: Fetch VARCHAR UTF-8 collation columns as SQL_C_WCHAR on Windows to prevent lossy ACP conversion#543
Open
FIX: Fetch VARCHAR UTF-8 collation columns as SQL_C_WCHAR on Windows to prevent lossy ACP conversion#543
Conversation
…to prevent lossy ACP conversion On Windows, the ODBC Driver Manager converts SQL_C_CHAR data to the system's ANSI code page (typically CP1252) before delivering it to the application. This is lossy for characters outside CP1252: CJK/Emoji get replaced with '?' (irreversible data loss) and extended Latin characters arrive as CP1252 bytes that fail UTF-8 decoding (returned as raw bytes instead of str). Fix: When charEncoding is 'utf-8' on Windows, fetch VARCHAR/CHAR/LONGVARCHAR columns as SQL_C_WCHAR (UTF-16LE) instead of SQL_C_CHAR. The ODBC driver converts losslessly to UTF-16LE, bypassing the lossy ACP conversion entirely. Changes in ddbc_bindings.cpp: - Add ShouldFetchCharAsWChar() helper (Windows-only, UTF-8 only) - SQLGetData_wrap (fetchone path): fetch VARCHAR as SQL_C_WCHAR when active, decode via PyUnicode_FromWideChar, with LOB streaming fallback - SQLBindColums (batch path): accept charEncoding param, bind VARCHAR into wcharBuffers as SQL_C_WCHAR when active - FetchBatchData: route VARCHAR columns to ProcessWChar dispatcher when active, compute correct fetchBufferSize for WCHAR buffers - FetchMany_wrap/FetchAll_wrap: pass charEncoding to SQLBindColums - Arrow path: unchanged (uses default empty charEncoding, no WCHAR workaround) Not affected: - Linux/macOS (ShouldFetchCharAsWChar always returns false) - Non-UTF-8 encodings (cp1252, latin-1, gbk, etc. use old SQL_C_CHAR path) - NVARCHAR columns (already use SQL_C_WCHAR) - setencoding API (write path, unrelated to fetch) Test: add test_varchar_utf8_collation_unicode_roundtrip covering ASCII, German, Chinese, Japanese, Russian, Greek, Arabic, Emoji, French through fetchone, fetchall, and fetchmany paths.
📊 Code Coverage Report
Diff CoverageDiff: main...HEAD, staged and unstaged changes
Summary
mssql_python/pybind/ddbc_bindings.cppLines 2064-2072 2064 size_t chunkBytes = DAE_CHUNK_SIZE;
2065 while (offset < totalBytes) {
2066 size_t len = std::min(chunkBytes, totalBytes - offset);
2067
! 2068 rc = putData((SQLPOINTER)(dataPtr + offset), static_cast<SQLLEN>(len));
2069 if (!SQL_SUCCEEDED(rc)) {
2070 LOG("SQLExecute: SQLPutData failed for "
2071 "SQL_C_CHAR chunk - offset=%zu",
2072 offset, totalBytes, len, rc);Lines 3351-3363 3351 if (ShouldFetchCharAsWChar(charEncoding)) {
3352 // LCOV_EXCL_START - Windows-only: ShouldFetchCharAsWChar always false on Linux
3353 // On Windows with UTF-8, fetch LOB VARCHAR as WCHAR to avoid
3354 // lossy ACP conversion
! 3355 LOG("SQLGetData: Streaming LOB for column %d (SQL_C_WCHAR via "
! 3356 "UTF-8 workaround) - columnSize=%lu",
! 3357 i, (unsigned long)columnSize);
! 3358 row.append(
! 3359 FetchLobColumnData(hStmt, i, SQL_C_WCHAR, true, false, charEncoding));
3360 // LCOV_EXCL_STOP
3361 } else {
3362 LOG("SQLGetData: Streaming LOB for column %d (SQL_C_CHAR) "
3363 "- columnSize=%lu",Lines 3369-3437 3369 // LCOV_EXCL_START - Windows-only: ShouldFetchCharAsWChar always false on Linux
3370 // On Windows with UTF-8 decoding: fetch VARCHAR as SQL_C_WCHAR
3371 // to bypass the ODBC driver's lossy ACP (e.g. CP1252) conversion.
3372 // The ODBC driver converts losslessly to UTF-16LE for SQL_C_WCHAR.
! 3373 uint64_t wcharBufSize = (columnSize + 1); // in SQLWCHAR units
! 3374 std::vector<SQLWCHAR> wdataBuffer(wcharBufSize);
! 3375 SQLLEN dataLen;
! 3376 ret = SQLGetData_ptr(hStmt, i, SQL_C_WCHAR, wdataBuffer.data(),
! 3377 wcharBufSize * sizeof(SQLWCHAR), &dataLen);
! 3378 if (SQL_SUCCEEDED(ret)) {
! 3379 if (dataLen > 0) {
! 3380 uint64_t numCharsInData = dataLen / sizeof(SQLWCHAR);
! 3381 if (numCharsInData <= columnSize) {
3382 #if defined(_WIN32)
3383 PyObject* pyStr = PyUnicode_FromWideChar(
3384 reinterpret_cast<wchar_t*>(wdataBuffer.data()), numCharsInData);
3385 #else
! 3386 PyObject* pyStr = PyUnicode_DecodeUTF16(
! 3387 reinterpret_cast<const char*>(wdataBuffer.data()),
! 3388 numCharsInData * sizeof(SQLWCHAR), NULL, NULL);
! 3389 #endif
! 3390 if (pyStr) {
! 3391 row.append(py::reinterpret_steal<py::object>(pyStr));
! 3392 LOG("SQLGetData: CHAR column %d fetched as WCHAR (UTF-8 "
! 3393 "workaround), %zu bytes -> decoded",
! 3394 i, (size_t)dataLen);
! 3395 } else {
! 3396 PyErr_Clear();
! 3397 LOG_ERROR("SQLGetData: Failed to decode WCHAR data for "
! 3398 "CHAR column %d",
! 3399 i);
! 3400 row.append(py::none());
! 3401 }
! 3402 } else {
3403 // Buffer too small, fallback to LOB streaming
! 3404 LOG("SQLGetData: CHAR column %d WCHAR data truncated, "
! 3405 "using streaming LOB",
! 3406 i);
! 3407 row.append(FetchLobColumnData(hStmt, i, SQL_C_WCHAR, true, false,
! 3408 charEncoding));
! 3409 }
! 3410 } else if (dataLen == SQL_NULL_DATA) {
! 3411 LOG("SQLGetData: Column %d is NULL (CHAR via WCHAR)", i);
! 3412 row.append(py::none());
! 3413 } else if (dataLen == 0) {
! 3414 row.append(py::str(""));
! 3415 } else if (dataLen == SQL_NO_TOTAL) {
! 3416 LOG("SQLGetData: SQL_NO_TOTAL for column %d (CHAR via WCHAR), "
! 3417 "falling back to LOB",
! 3418 i);
! 3419 row.append(FetchLobColumnData(hStmt, i, SQL_C_WCHAR, true, false,
! 3420 charEncoding));
! 3421 } else if (dataLen < 0) {
! 3422 LOG("SQLGetData: Unexpected negative data length "
! 3423 "for column %d (CHAR via WCHAR) - dataLen=%ld",
! 3424 i, (long)dataLen);
! 3425 ThrowStdException("SQLGetData returned an unexpected negative "
! 3426 "data length");
! 3427 }
! 3428 } else {
! 3429 LOG("SQLGetData: Error retrieving WCHAR data for CHAR column %d "
! 3430 "- SQLRETURN=%d, returning NULL",
! 3431 i, ret);
! 3432 row.append(py::none());
! 3433 }
3434 // LCOV_EXCL_STOP
3435 } else {
3436 // Allocate columnSize * 4 + 1 on ALL platforms (no #if guard).
3437 //Lines 3975-3987 3975 if (fetchCharAsWChar) {
3976 // LCOV_EXCL_START - Windows-only: fetchCharAsWChar always false on Linux
3977 // On Windows with UTF-8: bind VARCHAR as SQL_C_WCHAR to
3978 // bypass the ODBC driver's lossy ACP conversion.
! 3979 uint64_t fetchBufferSize = columnSize + 1 /*null-terminator*/;
! 3980 buffers.wcharBuffers[col - 1].resize(fetchSize * fetchBufferSize);
! 3981 ret = SQLBindCol_ptr(
! 3982 hStmt, col, SQL_C_WCHAR, buffers.wcharBuffers[col - 1].data(),
! 3983 fetchBufferSize * sizeof(SQLWCHAR), buffers.indicators[col - 1].data());
3984 // LCOV_EXCL_STOP
3985 } else {
3986 // Use columnSize * 4 + 1 on Linux/macOS to accommodate UTF-8
3987 // expansion. The ODBC driver returns UTF-8 for SQL_C_CHAR whereLines 4252-4260 4252 case SQL_LONGVARCHAR:
4253 // When fetchCharAsWChar is active, VARCHAR data is in wcharBuffers
4254 // (bound as SQL_C_WCHAR) so use the WCHAR processor for decoding.
4255 if (fetchCharAsWChar) {
! 4256 columnProcessors[col] = ColumnProcessors::ProcessWChar; // LCOV_EXCL_LINE - Windows-only
4257 } else {
4258 columnProcessors[col] = ColumnProcessors::ProcessChar;
4259 }
4260 break;Lines 4514-4522 4514 // When UTF-8 WCHAR workaround is active on Windows,
4515 // VARCHAR is bound as SQL_C_WCHAR (2 bytes per char).
4516 // Account for this in memory estimation.
4517 if (ShouldFetchCharAsWChar(charEncoding)) {
! 4518 rowSize += columnSize * sizeof(SQLWCHAR); // LCOV_EXCL_LINE - Windows-only
4519 } else {
4520 rowSize += columnSize;
4521 }
4522 break;Lines 4979-4989 4979 arrowColumnProducer->ptrValueBuffer = arrowColumnProducer->bitVal.get();
4980 break;
4981 default:
4982 std::ostringstream errorString;
! 4983 errorString << "Unsupported data type for Arrow batch fetch for column - "
! 4984 << columnName.c_str() << ", Type - " << dataType << ", column ID - "
! 4985 << (i + 1);
4986 LOG(errorString.str().c_str());
4987 ThrowStdException(errorString.str());
4988 break;
4989 }Lines 5241-5250 5241 buffers.datetimeoffsetBuffers[idxCol].data(),
5242 sizeof(DateTimeOffset),
5243 buffers.indicators[idxCol].data());
5244 if (!SQL_SUCCEEDED(ret)) {
! 5245 LOG("Error fetching SS_TIMESTAMPOFFSET data for column %d",
! 5246 idxCol + 1);
5247 return ret;
5248 }
5249 break;
5250 }Lines 5292-5302 5292 nullCounts[idxCol] += 1;
5293 continue;
5294 } else if (indicator < 0) {
5295 // Negative value is unexpected, log column index, SQL type & raise exception
! 5296 LOG("Unexpected negative data length. Column ID - %d, SQL Type - %d, Data "
! 5297 "Length - %lld",
! 5298 idxCol + 1, dataType, (long long)indicator);
5299 ThrowStdException("Unexpected negative data length.");
5300 }
5301 auto dataLen = static_cast<uint64_t>(indicator);Lines 5595-5604 5595 arrowSchemaBatchCapsule =
5596 py::capsule(arrowSchemaBatch.get(), "arrow_schema", [](void* ptr) {
5597 auto arrowSchema = static_cast<ArrowSchema*>(ptr);
5598 if (arrowSchema->release) {
! 5599 arrowSchema->release(arrowSchema);
! 5600 }
5601 delete arrowSchema;
5602 });
5603 } catch (...) {
5604 arrowSchemaBatch->release(arrowSchemaBatch.get());📋 Files Needing Attention📉 Files with overall lowest coverage (click to expand)mssql_python.pybind.logger_bridge.cpp: 59.2%
mssql_python.pybind.ddbc_bindings.h: 67.9%
mssql_python.row.py: 70.5%
mssql_python.pybind.logger_bridge.hpp: 70.8%
mssql_python.pybind.ddbc_bindings.cpp: 73.4%
mssql_python.pybind.connection.connection.cpp: 76.2%
mssql_python.__init__.py: 77.3%
mssql_python.ddbc_bindings.py: 79.6%
mssql_python.pybind.connection.connection_pool.cpp: 79.6%
mssql_python.connection.py: 85.3%🔗 Quick Links
|
Contributor
There was a problem hiding this comment.
Pull request overview
Improves Unicode correctness on Windows when fetching VARCHAR columns using SQL Server UTF-8 collations by routing UTF-8-decoded SQL_CHAR fetches through SQL_C_WCHAR (UTF-16) to avoid lossy ANSI code page (ACP) conversion in the Windows ODBC Driver Manager.
Changes:
- Added a Windows-only decision helper (
ShouldFetchCharAsWChar) to fetchVARCHAR/CHARasSQL_C_WCHARwhen UTF-8 decoding is requested. - Updated row-by-row (
SQLGetData_wrap) and bound/batch fetch paths (SQLBindColums/FetchBatchDataviafetchmany/fetchall) to apply the WCHAR workaround consistently. - Added a regression test covering Unicode round-trip for UTF-8-collated
VARCHARacrossfetchone,fetchall, andfetchmany.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
mssql_python/pybind/ddbc_bindings.cpp |
Implements the Windows UTF-8→WCHAR fetch workaround across all main fetch paths. |
tests/test_013_encoding_decoding.py |
Adds a test validating lossless Unicode round-trip for UTF-8-collated VARCHAR. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Work Item / Issue Reference
Summary
This pull request introduces a major improvement to Unicode support for
VARCHARcolumns with UTF-8 collation on Windows, ensuring lossless round-trip of Unicode data. The main change is a new logic path that detects when UTF-8 decoding is requested and, on Windows, fetchesVARCHARdata asSQL_C_WCHAR(UTF-16LE) instead of the default lossy ANSI code page (ACP) conversion. This affects all fetch paths (fetchone,fetchall,fetchmany) and is thoroughly tested with a new test case.Problem
Since SQL Server 2019,
VARCHARcolumns can store full Unicode via UTF-8 collations (e.g.,Latin1_General_100_CI_AS_SC_UTF8). However, mssql-python on Windows could not correctly read this data back.When fetching
VARCHARdata asSQL_C_CHAR, the Windows ODBC Driver Manager converts the bytes to the system's ANSI code page (typically CP1252) before delivering them to the application. This conversion is lossy:?bytesinstead ofstrThe existing
setdecoding(SQL_CHAR, encoding="utf-8")API only controls decoding after the ODBC driver has already performed this lossy conversion — the data is already damaged before our code sees it.This issue does not affect Linux/macOS because the unixODBC Driver Manager converts to the system locale (typically UTF-8), which is lossless for all Unicode characters.
Before the fix (Windows)
'Hello World''Hello World'✅'Grüße'b'Gr\xfc\xdfe''你好世界''????'❌ data loss'こんにちは''?????'❌ data loss'Привет''??????'❌ data loss'Hello 世界''Hello ??'❌ data loss'😀😃😄😁''????????'❌ data loss'café résumé'b'caf\xe9 r\xe9sum\xe9'After the fix (Windows)
'Hello World''Hello World'✅'Grüße''Grüße'✅'你好世界''你好世界'✅'こんにちは''こんにちは'✅'Привет''Привет'✅'Hello 世界''Hello 世界'✅'😀😃😄😁''😀😃😄😁'✅'café résumé''café résumé'✅Fix
On Windows, when UTF-8 decoding is configured for
SQL_CHAR(the default setting), we now fetchVARCHARcolumns asSQL_C_WCHAR(UTF-16LE) instead ofSQL_C_CHAR. The ODBC driver converts losslessly to UTF-16LE since UTF-16 can represent every Unicode character, completely bypassing the lossy ACP conversion.Unicode/encoding improvements:
ShouldFetchCharAsWCharhelper function to detect whenVARCHARcolumns should be fetched asSQL_C_WCHAR(UTF-16LE) on Windows if UTF-8 decoding is requested, avoiding lossy ACP conversion. (mssql_python/pybind/ddbc_bindings.cpp)SQLGetData_wrap,SQLBindColums,FetchBatchData,FetchMany_wrap,FetchAll_wrap) to use this logic, binding and decodingVARCHARasWCHARwhere appropriate. (mssql_python/pybind/ddbc_bindings.cpp)Testing:
test_varchar_utf8_collation_unicode_roundtrip, to verify thatVARCHARcolumns with UTF-8 collation round-trip full Unicode data (including CJK, Cyrillic, Arabic, Greek, Emoji, and mixed strings) correctly across all three fetch paths (fetchone,fetchall,fetchmany). (tests/test_013_encoding_decoding.py)How it works
The fix introduces a gate function
ShouldFetchCharAsWChar(charEncoding):truewhencharEncodingis"utf-8"(the default)false(no change needed)When active, the following changes apply to each fetch path:
SQLGetData_wrap(fetchone)SQL_C_WCHARfromSQLGetData, decodes viaPyUnicode_FromWideCharSQLBindColums(batch bind)wcharBuffersasSQL_C_WCHARFetchBatchData(batch process)ProcessWCharinstead ofProcessCharSQL_C_WCHARviaFetchLobColumnDataInteraction with
setencoding/setdecodingAPIssetdecoding(SQL_CHAR, encoding="utf-8")(default)setdecoding(SQL_CHAR, encoding="cp1252")setdecoding(SQL_CHAR, encoding="latin-1")setdecoding(SQL_WCHAR, ...)setencoding(...)Impact on existing users
charEncoding="utf-8"(default)strinstead ofbytesor?charEncoding="cp1252","latin-1", etc.ShouldFetchCharAsWCharalways returnsfalseSQL_C_WCHARcharEncoding, uses old pathsetencoding(write/send path)The only behavioral change is for Windows users with the default UTF-8 decoding: extended Latin characters in VARCHAR columns that previously returned as raw
bytes(e.g.,b'Gr\xfc\xdfe') now correctly return asstr(e.g.,'Grüße'). This is a correctness improvement — the API contract always specified returningstrfor character columns.Why Windows-only?
The ODBC driver manager behaves differently per platform: