gh-149079: Fix O(n^2) canonical ordering in unicodedata.normalize()#149080
gh-149079: Fix O(n^2) canonical ordering in unicodedata.normalize()#149080sethmlarson wants to merge 1 commit intopython:mainfrom
Conversation
…ze() Replace the insertion sort used for canonical ordering of combining characters with a hybrid approach: insertion sort for short runs (< 20) and counting sort for longer runs, reducing worst-case complexity from O(n^2) to O(n). This prevents denial of service via crafted Unicode strings with many combining characters in alternating CCC order. Co-authored-by: Seokchan Yoon <13852925+ch4n3-yoon@users.noreply.github.com>
|
Reviewers: Note that there are pending changes from previous reviews. |
| self.assertEqual(self.db.normalize('NFC', a), b) | ||
|
|
||
| def test_long_combining_mark_run(self): | ||
| # GH-XXXXX: avoid quadratic canonical ordering. |
There was a problem hiding this comment.
- # GH-XXXXX: avoid quadratic canonical ordering.
+ # gh-149079: avoid quadratic canonical ordering.| self.assertEqual(self.db.normalize("NFKC", payload), nfc) | ||
|
|
||
| def test_combining_mark_run_fast_paths(self): | ||
| # GH-XXXXX: cover short runs and already-sorted long runs. |
There was a problem hiding this comment.
- # GH-XXXXX: cover short runs and already-sorted long runs.
+ # gh-149079: cover short runs and already-sorted long runs.|
|
||
| if (run_length > sortbuflen) { | ||
| Py_UCS4 *new_sortbuf = PyMem_Realloc(sortbuf, | ||
| run_length * sizeof(Py_UCS4)); |
There was a problem hiding this comment.
Maybe PyMem_Resize instead of calculating manually?
Lines 58 to 60 in 005555a
serhiy-storchaka
left a comment
There was a problem hiding this comment.
There is a potential for optimization, but in general LGTM. 👍
| Py_ssize_t i, o, osize; | ||
| int kind; | ||
| const void *data; | ||
| int input_kind, result_kind; |
There was a problem hiding this comment.
Why not reuse the same variable?
There was a problem hiding this comment.
IIRC, I asked to have two different variables for readability purposes. We could reuse it but when reading the code, it was cleaner when I saw the separation. But it can be reverted if you insist.
| data = PyUnicode_DATA(result); | ||
| result_kind = PyUnicode_KIND(result); | ||
| result_data = PyUnicode_DATA(result); | ||
| length = PyUnicode_GET_LENGTH(result); |
|
Ideas for optimization:
|
| for (Py_ssize_t i = start; i < end; i++) { | ||
| Py_UCS4 code = PyUnicode_READ(kind, data, i); | ||
| unsigned char combining = _getrecord_ex(code)->combining; | ||
| counts[combining]++; | ||
| if (combining < min_combining) { | ||
| min_combining = combining; | ||
| } | ||
| if (combining > max_combining) { | ||
| max_combining = combining; | ||
| } | ||
| } | ||
|
|
||
| for (Py_ssize_t i = min_combining; i <= max_combining; i++) { | ||
| Py_ssize_t count = counts[i]; | ||
| counts[i] = total; | ||
| total += count; | ||
| } |
There was a problem hiding this comment.
So we can drop the min/max stuff as I suggest privately.
| int needs_sort = 0; | ||
|
|
||
| prev = _getrecord_ex( | ||
| PyUnicode_READ(result_kind, result_data, i))->combining; |
There was a problem hiding this comment.
And here, for readability, I suggested storing PyUnicode_READ() in a temp var (same some lines below).
| @@ -0,0 +1,5 @@ | |||
| Fix a potential denial of service in :func:`unicodedata.normalize`. The | |||
| canonical ordering step of Unicode normalization used an O(n²) insertion | |||
There was a problem hiding this comment.
| canonical ordering step of Unicode normalization used an O(n²) insertion | |
| canonical ordering step of Unicode normalization used a quadratic-time insertion |
I suggested to avoid possible rendering issues (we also say "linear-time" afterwards)
Replace the insertion sort used for canonical ordering of combining characters with a hybrid approach: insertion sort for short runs (< 20) and counting sort for longer runs, reducing worst-case complexity from O(n^2) to O(n). This prevents denial of service via crafted Unicode strings with many combining characters with a large number of inversions in combing class order.
unicodedata.normalize("NFC")canonical ordering #149079