Continue Add scan file body endpoint by TrevisGordan · Pull Request #83 · element-hq/matrix-content-scanner-python

TrevisGordan · 2026-03-09T15:05:02Z

Continues #79 by @Half-Shot, which implements #78 — the ability to scan a file before uploading it to Matrix.

PR #79 was closed as Half-Shot shifted away from the content scanner (comment). This PR picks up the work, addresses the review feedback from @reivilibre, and restructures the implementation.

What changed from #79

Instead of adding the new scan_file functionality on top of the existing code, I refactored the scanner so that all scan paths follow the same pipeline:

decrypt (if metadata) → write to disk → _check_mimetype() → _run_scan() → cleanup

Key change: no more streaming to disk for multipart uploads. In #79, write_multipart_to_disk streamed the upload to disk, but since ~99% of files will be encrypted, the content would immediately be read back into memory for decryption and then written to disk again. The streaming benefit was entirely negated. The multipart body is now read into memory once in the servlet layer and passed as raw bytes to the scanner.

Scanner changes

scan_file_on_disk renamed to scan_content — now accepts raw bytes instead of a file path.
_do_scan — extracted shared scan logic (mimetype check, run scan, cleanup) that was previously duplicated between _scan_media and scan_file_on_disk.
Removed write_multipart_to_disk — no longer needed.
Added async file writing via aiofiles in _write_file_to_disk.

Bug fix from #79

write_multipart_to_disk was called from the servlet with a BodyPartReader, but scan_file_on_disk received the resulting file path as its file_path parameter — then passed it to _decrypt_file which expected bytes, not a path. This is fixed by the restructuring above.

Documentation

Improved docs/api.md for the scan_file endpoint:

Fixed request/response sections being conflated.
Added curl and Python usage examples.
Clarified required vs. optional fields.

Suggestions for follow-up or now

Rename scan_file endpoint to scan_upload or scan_content to avoid confusion with the internal scan_file / scan_content scanner methods.
Use UUID for temp files in scan_media — the media path is used as filename but the file is deleted after scanning anyway.
file parameter naming — confusing, but inherited from the Matrix client-server spec EncryptedFile structure.

CLAassistant · 2026-03-09T15:07:56Z

All committers have signed the CLA.

TrevisGordan · 2026-03-11T12:20:55Z

Please check out the changes and share your feedback.
Regarding the suggestion to rename the scan_file endpoint to scan_upload or scan_content to avoid confusion with the internal scan_file / scan_content scanner methods, I’d like to make that change before we merge - but I’ll wait for your input first.

dklimpel · 2026-04-14T07:13:45Z

@TrevisGordan This branch has conflicts :)

…le-body

Copilot

Pull request overview

This PR continues the work from #79/#78 by adding a POST /_matrix/media_proxy/unstable/scan_file endpoint to scan uploaded content (optionally encrypted) prior to Matrix upload, and refactors the scanner to share a single scan pipeline across download- and upload-based scan paths.

Changes:

Add /scan_file multipart endpoint to scan raw uploaded bytes with optional encryption metadata.
Refactor scanner internals to share a unified decrypt → write → mimetype-check → scan → cleanup pipeline (scan_content, _do_scan).
Update API documentation for the new endpoint and add a new error code for malformed multipart requests.

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
`src/matrix_content_scanner/utils/constants.py`	Adds `MALFORMED_MULTIPART` error code for multipart parsing failures.
`src/matrix_content_scanner/servlets/scan.py`	Implements `/scan_file` multipart parsing and calls the new `scan_content()` pathway.
`src/matrix_content_scanner/servlets/__init__.py`	Adds helper to parse/validate encryption metadata from multipart `file` part.
`src/matrix_content_scanner/scanner/scanner.py`	Introduces `scan_content()` and shared `_do_scan()` pipeline; makes temp-file writing async.
`src/matrix_content_scanner/httpserver.py`	Registers the new `POST /scan_file` route.
`pyproject.toml` / `poetry.lock`	Adds `aiofile` dependency to support async file writes.
`docs/api.md`	Documents `POST /scan_file` with examples and response format.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+
+    @web_handler
+    async def handle_file(self, request: web.Request) -> Tuple[int, JsonDict]:
+        """Handles GET requests to ../scan_file"""


+                try:
+                    file_json = await field.json()
+                    if file_json is None:
+                        raise Exception("'file' field is empty")
+                except json.decoder.JSONDecodeError as e:
+                    raise ContentScannerRestError(400, ErrCode.MALFORMED_JSON, str(e))
+
+                metadata = await get_media_metadata_from_filebody(
+                    file_json, self._crypto_handler
+                )


+    Args:
+        request: The request to extract the data from.
+        crypto_handler: The crypto handler to use if we need to decrypt an Olm-encrypted
+            body.


anoadragon453 · 2026-05-28T09:29:10Z

+    metadata = _metadata_from_body(file_body, crypto_handler)
+
+    validate_encrypted_file_metadata(metadata)
+
+    # Unlike get_media_metadata_from_request, we intentionally skip extracting
+    # the file URL from the metadata because the caller already has the media content.
+
+    return metadata


The jsonschema for an encrypted file metadata requires url to exist:

matrix-content-scanner-python/src/matrix_content_scanner/utils/encrypted_file_metadata.py

Line 19 in b09cade

"required": ["v", "iv", "url", "hashes", "key"],

.

Technically it can be an empty string. But indeed, it's awkward that the field must exist. The requirement comes from the spec.

What are clients expected to do in this case?

anoadragon453 · 2026-05-28T09:24:45Z

+| Part name | Required | Type | Description |
+|-----------|----------|-a-----|-------------|
+| `body`    | **Yes**  | Binary (file content) | The raw file to scan. |
+| `file`    | No       | JSON string | Decryption metadata for an encrypted file. Follows the [`EncryptedFile`](https://spec.matrix.org/v1.2/client-server-api/#extensions-to-mroommessage-msgtypes) structure from the Matrix specification. Only needed when the file in `body` is encrypted. |


It doesn't actually break table rendering. But yes, might as well fix it.

+    @web_handler
+    async def handle_file(self, request: web.Request) -> Tuple[int, JsonDict]:
+        """Handles GET requests to ../scan_file"""
+        try:
+            reader = await request.multipart()
+        except Exception:


+    async def scan_content(
+        self, content: bytes, metadata: Optional[JsonDict] = None
+    ) -> None:
+        """Scan raw file bytes. The content is written to disk once (decrypted if
+        needed), scanned, and cleaned up.
+
+        This does not use the result cache or concurrent-request deduplication.
+
+        Args:
+            content: The raw file bytes (possibly still encrypted).
+            metadata: The metadata attached to the file (e.g. decryption key), or None
+                if the file isn't encrypted.
+
+        Raises:
+            FileDirtyError if the result of the scan said that the file is dirty.
+        """
+        exit_code = await self._do_scan(content, metadata)
+        result = exit_code == 0
+
+        cacheable = exit_code not in self._exit_codes_to_ignore
+
+        if result is False:
+            raise FileDirtyError(cacheable=cacheable)


anoadragon453

Thanks for taking this on, and apologies for the slow review. Our backlog is fairly large and this repository tends to get put on the backburner sometimes.

Lots of comments below, but nothing too fundamental.

anoadragon453 · 2026-05-28T09:01:49Z

+                    if file_json is None:
+                        raise Exception("'file' field is empty")


nit: This bit can be removed from the try / except.

anoadragon453 · 2026-05-28T09:09:50Z

            [
                web.get("/scan" + _MEDIA_PATH_REGEXP, scan_handler.handle_plain),
                web.post("/scan_encrypted", scan_handler.handle_encrypted),
+                web.post("/scan_file", scan_handler.handle_file),


The API design here is getting pretty confusing. From the naming, nothing indicates to me that /scan_encrypted will download the file from a media repo, whereas /scan_file will allow you to upload the file yourself (and that file may be encrypted or not).

What do you think about POST /scan/upload? It can still be differentiated from /scan/{serverName}/{mediaId} in configuration, and is clearer/more REST-like.

anoadragon453 · 2026-05-28T09:15:43Z

+            raise ContentScannerRestError(
+                400,
+                ErrCode.MALFORMED_MULTIPART,
+                "Request body was not a multipart body.",


Suggested change

"Request body was not a multipart body.",

"Unable to parse request body as multipart.",

Could you also add a logger.exception(e) here so the actual error doesn't get lost?

anoadragon453 · 2026-05-28T09:18:09Z

+            # Delete the file if the write fails.
+            try:
+                os.unlink(full_path)
+            except OSError:


Would be useful to log this failure.

anoadragon453 · 2026-05-28T09:20:51Z

+            self._check_mimetype(file_path)
+            # Scan the file and see if the result is positive or negative.
+            exit_code = await self._run_scan(file_path)
+            # Log the result of the scan.


Suggested change

# Log the result of the scan.

anoadragon453 · 2026-05-28T09:21:25Z

+            # Log the result of the scan.
+            logger.info("Scan has finished")
+        finally:
+            # This could be own function.


This comment seems temporary?

anoadragon453 · 2026-05-28T09:22:43Z

+        except Exception:
+            # Delete the file if the write fails.
+            try:
+                os.unlink(full_path)


Should we be using self.removal_command here?

If so, it probably makes sense to pull running it out to a helper function.

anoadragon453 · 2026-05-28T09:24:45Z

+| Part name | Required | Type | Description |
+|-----------|----------|-a-----|-------------|
+| `body`    | **Yes**  | Binary (file content) | The raw file to scan. |
+| `file`    | No       | JSON string | Decryption metadata for an encrypted file. Follows the [`EncryptedFile`](https://spec.matrix.org/v1.2/client-server-api/#extensions-to-mroommessage-msgtypes) structure from the Matrix specification. Only needed when the file in `body` is encrypted. |


It doesn't actually break table rendering. But yes, might as well fix it.

anoadragon453 · 2026-05-28T09:29:10Z

+    metadata = _metadata_from_body(file_body, crypto_handler)
+
+    validate_encrypted_file_metadata(metadata)
+
+    # Unlike get_media_metadata_from_request, we intentionally skip extracting
+    # the file URL from the metadata because the caller already has the media content.
+
+    return metadata


The jsonschema for an encrypted file metadata requires url to exist:

matrix-content-scanner-python/src/matrix_content_scanner/utils/encrypted_file_metadata.py

Line 19 in b09cade

"required": ["v", "iv", "url", "hashes", "key"],

.

Technically it can be an empty string. But indeed, it's awkward that the field must exist. The requirement comes from the spec.

What are clients expected to do in this case?

Half-Shot and others added 5 commits July 1, 2025 11:58

Add ability to scan file bodies.

bf46c2f

Add documentation

e2cb3e5

Adds restructured scan_content and removes write_multipart_to_disk

0567a1f

Adds Async file writing via aiofile removes write_multipart_to_disk

fb95f07

Updates API.docs

4b7209f

TrevisGordan requested a review from a team as a code owner March 9, 2026 15:05

TrevisGordan added 2 commits March 9, 2026 16:15

Merge branch 'main' into continue-hs-scan-file-body

3948614

downgraded lockfile

62def34

TrevisGordan added 4 commits April 28, 2026 15:27

Merge remote-tracking branch 'upstream/main' into continue-hs-scan-fi…

bb951e4

…le-body

lock update

050716d

Merge branch 'main' into continue-hs-scan-file-body

e50e321

update lock file

69e69af

anoadragon453 requested a review from Copilot May 27, 2026 16:22

Copilot started reviewing on behalf of anoadragon453 May 27, 2026 16:22 View session

Copilot AI reviewed May 27, 2026

View reviewed changes

anoadragon453 requested changes May 28, 2026

View reviewed changes

		if file_json is None:
		raise Exception("'file' field is empty")

	"Request body was not a multipart body.",
	"Unable to parse request body as multipart.",

Conversation

TrevisGordan commented Mar 9, 2026

What changed from #79

Scanner changes

Bug fix from #79

Documentation

Suggestions for follow-up or now

Uh oh!

CLAassistant commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TrevisGordan commented Mar 11, 2026 • edited by MadLittleMods Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dklimpel commented Apr 14, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anoadragon453 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

CLAassistant commented Mar 9, 2026 •

edited

Loading

TrevisGordan commented Mar 11, 2026 •

edited by MadLittleMods

Loading