Skip to content

Security: parse_url() truncates oversized URL fields due to uint16_t offsets/lengths in vendored http-parser #142

@rangersui

Description

@rangersui

Security: parse_url() truncates oversized URL fields because the vendored http-parser stores URL offsets/lengths as uint16_t

tl;drhttptools.parse_url() uses the vendored http-parser URL parser. That parser stores URL field offsets and lengths in uint16_t inside struct http_parser_url.field_data[]. Once a parsed field crosses UINT16_MAX, the stored off/len values wrap, and parse_url() returns truncated bytes sliced from the wrong place. I reproduced this directly against httptools.parse_url() and also end-to-end through uvicorn's httptools protocol, where the truncated result propagates into the ASGI scope.

This was previously reported in httptools as issue #43, which was closed on 2020-02-06 with the maintainer comment:

This is an upstream issue: nodejs/http-parser#481

That upstream repository is now archived and read-only, so the original escalation path is no longer actionable.


Reproduction

1. Direct reproduction with parse_url()

import httptools

print("httptools", httptools.__version__)

for n in [65534, 65535, 65536, 70000, 131071]:
    url = b"http://h/" + b"a" * n
    r = httptools.parse_url(url)
    expected = 1 + n  # "/" + n * "a"
    actual = len(r.path) if r.path else 0
    print(f"body_a={n:<6}  parsed.path_len={actual:<6}  expected={expected}")

Observed on httptools 0.7.1:

httptools 0.7.1
body_a=65534   parsed.path_len=65535   expected=65535
body_a=65535   parsed.path_len=0       expected=65536
body_a=65536   parsed.path_len=1       expected=65537
body_a=70000   parsed.path_len=4465    expected=70001
body_a=131071  parsed.path_len=0       expected=131072

For oversized values, the observed pattern is:

actual ≡ expected (mod 65536)

I also confirmed the same root cause affects other string fields that are sliced from field_data[], such as query.

2. End-to-end reproduction through uvicorn -> ASGI scope

In a local environment with uvicorn 0.42.0, HttpToolsProtocol.on_headers_complete() still does:

parsed_url = httptools.parse_url(self.url)
raw_path = parsed_url.path
path = raw_path.decode("ascii")
...
self.scope["path"] = full_path
self.scope["raw_path"] = full_raw_path
self.scope["query_string"] = parsed_url.query or b""

This minimal app shows the truncated scope["path"]:

import asyncio, socket, uvicorn

async def app(scope, receive, send):
    if scope["type"] != "http":
        return
    body = f"path_len={len(scope['path'])}  head={scope['path'][:32]!r}".encode()
    await send({
        "type": "http.response.start",
        "status": 200,
        "headers": [
            [b"content-type", b"text/plain"],
            [b"content-length", str(len(body)).encode()],
        ],
    })
    await send({"type": "http.response.body", "body": body})

async def main():
    cfg = uvicorn.Config(app, host="127.0.0.1", port=29571,
                         log_level="error", http="httptools")
    srv = uvicorn.Server(cfg)
    task = asyncio.create_task(srv.serve())
    await asyncio.sleep(0.6)

    for n in [100, 65534, 65535, 70000]:
        s = socket.socket()
        s.connect(("127.0.0.1", 29571))
        s.sendall(
            b"GET /" + b"a" * n +
            b" HTTP/1.1\r\nHost: x\r\nConnection: close\r\n\r\n"
        )
        r = b""
        while True:
            c = s.recv(65536)
            if not c:
                break
            r += c
        s.close()
        print(r.split(b"\r\n\r\n", 1)[1].strip().decode("utf-8", "replace"))

    srv.should_exit = True
    await task

asyncio.run(main())

Observed output:

path_len=101    head='/aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
path_len=65535  head='/aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
path_len=0      head=''
path_len=4465   head='/aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'

So this is not limited to direct callers of parse_url(): when uvicorn runs with http="httptools", the truncated parse result reaches application code via scope["path"], scope["raw_path"], and scope["query_string"].


Root cause

Struct layout

The Cython declaration mirrors the vendored http-parser struct:

httptools/parser/url_cparser.pxd

struct http_parser_url_field_data:
    uint16_t off
    uint16_t len

and in vendor/http-parser/http_parser.h:

struct http_parser_url {
  uint16_t field_set;
  uint16_t port;

  struct {
    uint16_t off;
    uint16_t len;
  } field_data[UF_MAX];
};

Slicing in url_parser.pyx

httptools/parser/url_parser.pyx then trusts those values:

off = parsed.field_data[<int>uparser.UF_PATH].off
ln = parsed.field_data[<int>uparser.UF_PATH].len
path = buf_data[off:off+ln]

The Cython code is not the source of the truncation; the wrapped values are already present in the C struct before slicing happens.

Affected string fields sliced from field_data[] include:

  • schema
  • host
  • path
  • query
  • fragment
  • userinfo

(port is also a uint16_t, but it is semantically different and not sliced from field_data[].)


Upstream status

  • nodejs/http-parser was archived on 2022-11-06 and is now read-only.
  • nodejs/http-parser#481 ("http_parser_parse_url fails to handle very long URLs") is still OPEN.
  • nodejs/http-parser#480 ("Fix http_parser_parse_url to handle very long URLs") is still OPEN.

So while the original bug was correctly identified as upstream in 2020, that upstream no longer has a realistic maintenance path.


This repo's own history

PR #56 / merge commit 63b5de2

The commit that switched request parsing from http-parser to llhttp explicitly left URL parsing behind:

Replaces the underlying HTTP parser with llhttp as http-parser is
no longer actively maintained. However we still have http-parser
to power our URL parsing function at the moment.

That merge commit is:

  • 63b5de2bf7c7fa498c7ecac1eb9d8f7352a00dd4
  • merged at 2021-03-30T17:05:59Z

URL parser file history on master

From a fresh clone of this repository:

$ git log --follow --oneline -- httptools/parser/url_parser.pyx
63b5de2 Swap http-parse to llhttp (#56)

$ git log --follow --oneline -- httptools/parser/url_cparser.pxd
63b5de2 Swap http-parse to llhttp (#56)

These two files were introduced in 63b5de2 and have not changed on master since then.

Two unmerged replacement branches still present in origin/

Both are visible in git branch -a on the current repository:

  • origin/rust-url
  • origin/rust-uri

Their commits are:

commit date branch author message
04d7df7 2021-04-23 origin/rust-url Fantix King Local draft
294ac41 2021-04-25 origin/rust-uri Fantix King WIP: use the http::Uri crate

Both delete httptools/parser/url_cparser.pxd and httptools/parser/url_parser.pyx and replace them with a Rust-based implementation.


FastAPI / Uvicorn propagation path

This is not limited to applications that explicitly and deliberately choose httptools.

FastAPI's current documentation recommends installing FastAPI with:

pip install "fastapi[standard]"

And FastAPI's docs explicitly state that installing fastapi[standard] also gives you uvicorn[standard].

Uvicorn's own installation docs explicitly state that:

  • uvicorn[standard] installs httptools
  • when httptools is installed, Uvicorn uses it by default for HTTP/1.1 parsing

So the documented "standard" path is:

fastapi[standard]
  -> uvicorn[standard]
  -> installs httptools
  -> Uvicorn uses httptools by default for HTTP/1.1

This does not mean every FastAPI deployment is affected in practice:

  • some users install plain fastapi
  • some switch Uvicorn to h11
  • some use a different ASGI server
  • some are protected in practice by frontend URL-length limits

But it does mean this parser is part of the default path for a large set of users following the official FastAPI + Uvicorn "standard" installation flow.


Impact

I am not claiming that every application using httptools or uvicorn is automatically exploitable.

I am claiming that the following primitive exists:

  • the application stack may receive a parsed path/query that does not match the actual request-target bytes that were sent on the wire.

That primitive can matter in contexts such as:

  • path-based auth or routing decisions
  • cache-key derivation
  • proxy/backend interpretation mismatches
  • audit/log correlation

This is exactly the kind of risk called out in nodejs/http-parser#481, which notes that the parsed and actual path may differ in security-relevant code paths.


Suggested paths forward

Any of these would be better than the current silent truncation:

A. Deprecate parse_url() and document the limitation

Point users to urllib.parse.urlsplit or another maintained parser.

B. Patch the vendored http-parser

Carry a local fix in the vendored copy so that oversized off/len values do not silently wrap.

This is likely a small source change, but it should be evaluated carefully for:

  • ABI impact
  • packaging/build expectations
  • use-system-http-parser compatibility

C. Reject overflow explicitly

If changing the struct layout is undesirable, another improvement would be to detect overflow and raise an error instead of returning truncated fields.

D. Resume a replacement implementation

The existing origin/rust-url / origin/rust-uri branches suggest that this migration path was already explored.


Environment used for reproduction

  • httptools 0.7.1 from PyPI
  • current master checked out at 28d1db15eaeaab5bc7d376d2c2035d966b6e1378 (0.8.0.dev0)
  • uvicorn 0.42.0 for the end-to-end ASGI reproduction
  • Python 3.13 on Windows

The bug is not Windows-specific; the relevant behavior follows from the struct layout and parsing code.


Thanks

httptools has been load-bearing for the Python ASGI ecosystem for a long time. This report is meant to be actionable and specific: there is a concrete reproducible bug, the historical upstream escalation path is closed, and the current repository history suggests this exact corner was already recognized as unfinished when llhttp replaced http-parser for request parsing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions