Security: parse_url() truncates oversized URL fields due to uint16_t offsets/lengths in vendored http-parser

# Security: `parse_url()` truncates oversized URL fields because the vendored `http-parser` stores URL offsets/lengths as `uint16_t`

**tl;dr** — `httptools.parse_url()` uses the vendored `http-parser` URL parser. That parser stores URL field offsets and lengths in `uint16_t` inside `struct http_parser_url.field_data[]`. Once a parsed field crosses `UINT16_MAX`, the stored `off`/`len` values wrap, and `parse_url()` returns truncated bytes sliced from the wrong place. I reproduced this directly against `httptools.parse_url()` and also end-to-end through `uvicorn`'s `httptools` protocol, where the truncated result propagates into the ASGI `scope`.

This was previously reported in `httptools` as issue `#43`, which was closed on 2020-02-06 with the maintainer comment:

> This is an upstream issue: https://github.com/nodejs/http-parser/issues/481

That upstream repository is now archived and read-only, so the original escalation path is no longer actionable.

---

## Reproduction

### 1. Direct reproduction with `parse_url()`

```python
import httptools

print("httptools", httptools.__version__)

for n in [65534, 65535, 65536, 70000, 131071]:
    url = b"http://h/" + b"a" * n
    r = httptools.parse_url(url)
    expected = 1 + n  # "/" + n * "a"
    actual = len(r.path) if r.path else 0
    print(f"body_a={n:<6}  parsed.path_len={actual:<6}  expected={expected}")
```

Observed on `httptools 0.7.1`:

```
httptools 0.7.1
body_a=65534   parsed.path_len=65535   expected=65535
body_a=65535   parsed.path_len=0       expected=65536
body_a=65536   parsed.path_len=1       expected=65537
body_a=70000   parsed.path_len=4465    expected=70001
body_a=131071  parsed.path_len=0       expected=131072
```

For oversized values, the observed pattern is:

```text
actual ≡ expected (mod 65536)
```

I also confirmed the same root cause affects other string fields that are sliced from `field_data[]`, such as `query`.

### 2. End-to-end reproduction through `uvicorn` -> ASGI `scope`

In a local environment with `uvicorn 0.42.0`, `HttpToolsProtocol.on_headers_complete()` still does:

```python
parsed_url = httptools.parse_url(self.url)
raw_path = parsed_url.path
path = raw_path.decode("ascii")
...
self.scope["path"] = full_path
self.scope["raw_path"] = full_raw_path
self.scope["query_string"] = parsed_url.query or b""
```

This minimal app shows the truncated `scope["path"]`:

```python
import asyncio, socket, uvicorn

async def app(scope, receive, send):
    if scope["type"] != "http":
        return
    body = f"path_len={len(scope['path'])}  head={scope['path'][:32]!r}".encode()
    await send({
        "type": "http.response.start",
        "status": 200,
        "headers": [
            [b"content-type", b"text/plain"],
            [b"content-length", str(len(body)).encode()],
        ],
    })
    await send({"type": "http.response.body", "body": body})

async def main():
    cfg = uvicorn.Config(app, host="127.0.0.1", port=29571,
                         log_level="error", http="httptools")
    srv = uvicorn.Server(cfg)
    task = asyncio.create_task(srv.serve())
    await asyncio.sleep(0.6)

    for n in [100, 65534, 65535, 70000]:
        s = socket.socket()
        s.connect(("127.0.0.1", 29571))
        s.sendall(
            b"GET /" + b"a" * n +
            b" HTTP/1.1\r\nHost: x\r\nConnection: close\r\n\r\n"
        )
        r = b""
        while True:
            c = s.recv(65536)
            if not c:
                break
            r += c
        s.close()
        print(r.split(b"\r\n\r\n", 1)[1].strip().decode("utf-8", "replace"))

    srv.should_exit = True
    await task

asyncio.run(main())
```

Observed output:

```
path_len=101    head='/aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
path_len=65535  head='/aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
path_len=0      head=''
path_len=4465   head='/aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
```

So this is not limited to direct callers of `parse_url()`: when `uvicorn` runs with `http="httptools"`, the truncated parse result reaches application code via `scope["path"]`, `scope["raw_path"]`, and `scope["query_string"]`.

---

## Root cause

### Struct layout

The Cython declaration mirrors the vendored `http-parser` struct:

`httptools/parser/url_cparser.pxd`

```cython
struct http_parser_url_field_data:
    uint16_t off
    uint16_t len
```

and in `vendor/http-parser/http_parser.h`:

```c
struct http_parser_url {
  uint16_t field_set;
  uint16_t port;

  struct {
    uint16_t off;
    uint16_t len;
  } field_data[UF_MAX];
};
```

### Slicing in `url_parser.pyx`

`httptools/parser/url_parser.pyx` then trusts those values:

```cython
off = parsed.field_data[<int>uparser.UF_PATH].off
ln = parsed.field_data[<int>uparser.UF_PATH].len
path = buf_data[off:off+ln]
```

The Cython code is not the source of the truncation; the wrapped values are already present in the C struct before slicing happens.

Affected string fields sliced from `field_data[]` include:

- `schema`
- `host`
- `path`
- `query`
- `fragment`
- `userinfo`

(`port` is also a `uint16_t`, but it is semantically different and not sliced from `field_data[]`.)

---

## Upstream status

- `nodejs/http-parser` was archived on **2022-11-06** and is now read-only.
- `nodejs/http-parser#481` ("http_parser_parse_url fails to handle very long URLs") is still **OPEN**.
- `nodejs/http-parser#480` ("Fix http_parser_parse_url to handle very long URLs") is still **OPEN**.

So while the original bug was correctly identified as upstream in 2020, that upstream no longer has a realistic maintenance path.

---

## This repo's own history

### PR `#56` / merge commit `63b5de2`

The commit that switched request parsing from `http-parser` to `llhttp` explicitly left URL parsing behind:

```text
Replaces the underlying HTTP parser with llhttp as http-parser is
no longer actively maintained. However we still have http-parser
to power our URL parsing function at the moment.
```

That merge commit is:

- `63b5de2bf7c7fa498c7ecac1eb9d8f7352a00dd4`
- merged at `2021-03-30T17:05:59Z`

### URL parser file history on `master`

From a fresh clone of this repository:

```text
$ git log --follow --oneline -- httptools/parser/url_parser.pyx
63b5de2 Swap http-parse to llhttp (#56)

$ git log --follow --oneline -- httptools/parser/url_cparser.pxd
63b5de2 Swap http-parse to llhttp (#56)
```

These two files were introduced in `63b5de2` and have not changed on `master` since then.

### Two unmerged replacement branches still present in `origin/`

Both are visible in `git branch -a` on the current repository:

- `origin/rust-url`
- `origin/rust-uri`

Their commits are:

| commit | date | branch | author | message |
|---|---|---|---|---|
| `04d7df7` | 2021-04-23 | `origin/rust-url` | Fantix King | `Local draft` |
| `294ac41` | 2021-04-25 | `origin/rust-uri` | Fantix King | `WIP: use the http::Uri crate` |

Both delete `httptools/parser/url_cparser.pxd` and `httptools/parser/url_parser.pyx` and replace them with a Rust-based implementation.

---

## FastAPI / Uvicorn propagation path

This is not limited to applications that explicitly and deliberately choose `httptools`.

FastAPI's current documentation recommends installing FastAPI with:

```bash
pip install "fastapi[standard]"
```

And FastAPI's docs explicitly state that installing `fastapi[standard]` also gives you `uvicorn[standard]`.

Uvicorn's own installation docs explicitly state that:

- `uvicorn[standard]` installs `httptools`
- when `httptools` is installed, Uvicorn uses it **by default** for HTTP/1.1 parsing

So the documented "standard" path is:

```text
fastapi[standard]
  -> uvicorn[standard]
  -> installs httptools
  -> Uvicorn uses httptools by default for HTTP/1.1
```

This does **not** mean every FastAPI deployment is affected in practice:

- some users install plain `fastapi`
- some switch Uvicorn to `h11`
- some use a different ASGI server
- some are protected in practice by frontend URL-length limits

But it does mean this parser is part of the default path for a large set of users following the official FastAPI + Uvicorn "standard" installation flow.

---

## Impact

I am **not** claiming that every application using `httptools` or `uvicorn` is automatically exploitable.

I **am** claiming that the following primitive exists:

- the application stack may receive a parsed path/query that does not match the actual request-target bytes that were sent on the wire.

That primitive can matter in contexts such as:

- path-based auth or routing decisions
- cache-key derivation
- proxy/backend interpretation mismatches
- audit/log correlation

This is exactly the kind of risk called out in `nodejs/http-parser#481`, which notes that the parsed and actual path may differ in security-relevant code paths.

---

## Suggested paths forward

Any of these would be better than the current silent truncation:

### A. Deprecate `parse_url()` and document the limitation

Point users to `urllib.parse.urlsplit` or another maintained parser.

### B. Patch the vendored `http-parser`

Carry a local fix in the vendored copy so that oversized `off`/`len` values do not silently wrap.

This is likely a small source change, but it should be evaluated carefully for:

- ABI impact
- packaging/build expectations
- `use-system-http-parser` compatibility

### C. Reject overflow explicitly

If changing the struct layout is undesirable, another improvement would be to detect overflow and raise an error instead of returning truncated fields.

### D. Resume a replacement implementation

The existing `origin/rust-url` / `origin/rust-uri` branches suggest that this migration path was already explored.

---

## Environment used for reproduction

- `httptools 0.7.1` from PyPI
- current `master` checked out at `28d1db15eaeaab5bc7d376d2c2035d966b6e1378` (`0.8.0.dev0`)
- `uvicorn 0.42.0` for the end-to-end ASGI reproduction
- Python 3.13 on Windows

The bug is not Windows-specific; the relevant behavior follows from the struct layout and parsing code.

---

## Thanks

`httptools` has been load-bearing for the Python ASGI ecosystem for a long time. This report is meant to be actionable and specific: there is a concrete reproducible bug, the historical upstream escalation path is closed, and the current repository history suggests this exact corner was already recognized as unfinished when `llhttp` replaced `http-parser` for request parsing.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Security: parse_url() truncates oversized URL fields due to uint16_t offsets/lengths in vendored http-parser #142

Security: `parse_url()` truncates oversized URL fields because the vendored `http-parser` stores URL offsets/lengths as `uint16_t`

Reproduction

1. Direct reproduction with `parse_url()`

2. End-to-end reproduction through `uvicorn` -> ASGI `scope`

Root cause

Struct layout

Slicing in `url_parser.pyx`

Upstream status

This repo's own history

PR `#56` / merge commit `63b5de2`

URL parser file history on `master`

Two unmerged replacement branches still present in `origin/`

FastAPI / Uvicorn propagation path

Impact

Suggested paths forward

A. Deprecate `parse_url()` and document the limitation

B. Patch the vendored `http-parser`

C. Reject overflow explicitly

D. Resume a replacement implementation

Environment used for reproduction

Thanks

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

commit	date	branch	author	message
`04d7df7`	2021-04-23	`origin/rust-url`	Fantix King	`Local draft`
`294ac41`	2021-04-25	`origin/rust-uri`	Fantix King	`WIP: use the http::Uri crate`

Security: parse_url() truncates oversized URL fields due to uint16_t offsets/lengths in vendored http-parser #142

Description

Security: parse_url() truncates oversized URL fields because the vendored http-parser stores URL offsets/lengths as uint16_t

Reproduction

1. Direct reproduction with parse_url()

2. End-to-end reproduction through uvicorn -> ASGI scope

Root cause

Struct layout

Slicing in url_parser.pyx

Upstream status

This repo's own history

PR #56 / merge commit 63b5de2

URL parser file history on master

Two unmerged replacement branches still present in origin/

FastAPI / Uvicorn propagation path

Impact

Suggested paths forward

A. Deprecate parse_url() and document the limitation

B. Patch the vendored http-parser

C. Reject overflow explicitly

D. Resume a replacement implementation

Environment used for reproduction

Thanks

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Security: `parse_url()` truncates oversized URL fields because the vendored `http-parser` stores URL offsets/lengths as `uint16_t`

1. Direct reproduction with `parse_url()`

2. End-to-end reproduction through `uvicorn` -> ASGI `scope`

Slicing in `url_parser.pyx`

PR `#56` / merge commit `63b5de2`

URL parser file history on `master`

Two unmerged replacement branches still present in `origin/`

A. Deprecate `parse_url()` and document the limitation

B. Patch the vendored `http-parser`