Skip to content

Continue Add scan file body endpoint #83

Open
TrevisGordan wants to merge 11 commits into
element-hq:mainfrom
TrevisGordan:continue-hs-scan-file-body
Open

Continue Add scan file body endpoint #83
TrevisGordan wants to merge 11 commits into
element-hq:mainfrom
TrevisGordan:continue-hs-scan-file-body

Conversation

@TrevisGordan
Copy link
Copy Markdown
Contributor

Continues #79 by @Half-Shot, which implements #78 — the ability to scan a file before uploading it to Matrix.

PR #79 was closed as Half-Shot shifted away from the content scanner (comment). This PR picks up the work, addresses the review feedback from @reivilibre, and restructures the implementation.

What changed from #79

Instead of adding the new scan_file functionality on top of the existing code, I refactored the scanner so that all scan paths follow the same pipeline:

decrypt (if metadata) → write to disk → _check_mimetype() → _run_scan() → cleanup

Key change: no more streaming to disk for multipart uploads. In #79, write_multipart_to_disk streamed the upload to disk, but since ~99% of files will be encrypted, the content would immediately be read back into memory for decryption and then written to disk again. The streaming benefit was entirely negated. The multipart body is now read into memory once in the servlet layer and passed as raw bytes to the scanner.

Scanner changes

  • scan_file_on_disk renamed to scan_content — now accepts raw bytes instead of a file path.
  • _do_scan — extracted shared scan logic (mimetype check, run scan, cleanup) that was previously duplicated between _scan_media and scan_file_on_disk.
  • Removed write_multipart_to_disk — no longer needed.
  • Added async file writing via aiofiles in _write_file_to_disk.

Bug fix from #79

write_multipart_to_disk was called from the servlet with a BodyPartReader, but scan_file_on_disk received the resulting file path as its file_path parameter — then passed it to _decrypt_file which expected bytes, not a path. This is fixed by the restructuring above.

Documentation

Improved docs/api.md for the scan_file endpoint:

  • Fixed request/response sections being conflated.
  • Added curl and Python usage examples.
  • Clarified required vs. optional fields.

Suggestions for follow-up or now

  • Rename scan_file endpoint to scan_upload or scan_content to avoid confusion with the internal scan_file / scan_content scanner methods.
  • Use UUID for temp files in scan_media — the media path is used as filename but the file is deleted after scanning anyway.
  • file parameter naming — confusing, but inherited from the Matrix client-server spec EncryptedFile structure.

@TrevisGordan TrevisGordan requested a review from a team as a code owner March 9, 2026 15:05
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Mar 9, 2026

CLA assistant check
All committers have signed the CLA.

@TrevisGordan
Copy link
Copy Markdown
Contributor Author

TrevisGordan commented Mar 11, 2026

Please check out the changes and share your feedback.
Regarding the suggestion to rename the scan_file endpoint to scan_upload or scan_content to avoid confusion with the internal scan_file / scan_content scanner methods, I’d like to make that change before we merge - but I’ll wait for your input first.

@dklimpel
Copy link
Copy Markdown
Contributor

@TrevisGordan This branch has conflicts :)

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR continues the work from #79/#78 by adding a POST /_matrix/media_proxy/unstable/scan_file endpoint to scan uploaded content (optionally encrypted) prior to Matrix upload, and refactors the scanner to share a single scan pipeline across download- and upload-based scan paths.

Changes:

  • Add /scan_file multipart endpoint to scan raw uploaded bytes with optional encryption metadata.
  • Refactor scanner internals to share a unified decrypt → write → mimetype-check → scan → cleanup pipeline (scan_content, _do_scan).
  • Update API documentation for the new endpoint and add a new error code for malformed multipart requests.

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
src/matrix_content_scanner/utils/constants.py Adds MALFORMED_MULTIPART error code for multipart parsing failures.
src/matrix_content_scanner/servlets/scan.py Implements /scan_file multipart parsing and calls the new scan_content() pathway.
src/matrix_content_scanner/servlets/__init__.py Adds helper to parse/validate encryption metadata from multipart file part.
src/matrix_content_scanner/scanner/scanner.py Introduces scan_content() and shared _do_scan() pipeline; makes temp-file writing async.
src/matrix_content_scanner/httpserver.py Registers the new POST /scan_file route.
pyproject.toml / poetry.lock Adds aiofile dependency to support async file writes.
docs/api.md Documents POST /scan_file with examples and response format.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


@web_handler
async def handle_file(self, request: web.Request) -> Tuple[int, JsonDict]:
"""Handles GET requests to ../scan_file"""
Comment on lines +93 to +102
try:
file_json = await field.json()
if file_json is None:
raise Exception("'file' field is empty")
except json.decoder.JSONDecodeError as e:
raise ContentScannerRestError(400, ErrCode.MALFORMED_JSON, str(e))

metadata = await get_media_metadata_from_filebody(
file_json, self._crypto_handler
)
Comment on lines +190 to +193
Args:
request: The request to extract the data from.
crypto_handler: The crypto handler to use if we need to decrypt an Olm-encrypted
body.
Comment on lines +199 to +206
metadata = _metadata_from_body(file_body, crypto_handler)

validate_encrypted_file_metadata(metadata)

# Unlike get_media_metadata_from_request, we intentionally skip extracting
# the file URL from the metadata because the caller already has the media content.

return metadata
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The jsonschema for an encrypted file metadata requires url to exist:

.

Technically it can be an empty string. But indeed, it's awkward that the field must exist. The requirement comes from the spec.

What are clients expected to do in this case?

Comment thread docs/api.md
Comment on lines +140 to +143
| Part name | Required | Type | Description |
|-----------|----------|-a-----|-------------|
| `body` | **Yes** | Binary (file content) | The raw file to scan. |
| `file` | No | JSON string | Decryption metadata for an encrypted file. Follows the [`EncryptedFile`](https://spec.matrix.org/v1.2/client-server-api/#extensions-to-mroommessage-msgtypes) structure from the Matrix specification. Only needed when the file in `body` is encrypted. |
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't actually break table rendering. But yes, might as well fix it.

Comment on lines +70 to +75
@web_handler
async def handle_file(self, request: web.Request) -> Tuple[int, JsonDict]:
"""Handles GET requests to ../scan_file"""
try:
reader = await request.multipart()
except Exception:
Comment on lines +325 to +347
async def scan_content(
self, content: bytes, metadata: Optional[JsonDict] = None
) -> None:
"""Scan raw file bytes. The content is written to disk once (decrypted if
needed), scanned, and cleaned up.

This does not use the result cache or concurrent-request deduplication.

Args:
content: The raw file bytes (possibly still encrypted).
metadata: The metadata attached to the file (e.g. decryption key), or None
if the file isn't encrypted.

Raises:
FileDirtyError if the result of the scan said that the file is dirty.
"""
exit_code = await self._do_scan(content, metadata)
result = exit_code == 0

cacheable = exit_code not in self._exit_codes_to_ignore

if result is False:
raise FileDirtyError(cacheable=cacheable)
Copy link
Copy Markdown
Member

@anoadragon453 anoadragon453 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking this on, and apologies for the slow review. Our backlog is fairly large and this repository tends to get put on the backburner sometimes.

Lots of comments below, but nothing too fundamental.

Comment on lines +95 to +96
if file_json is None:
raise Exception("'file' field is empty")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This bit can be removed from the try / except.

[
web.get("/scan" + _MEDIA_PATH_REGEXP, scan_handler.handle_plain),
web.post("/scan_encrypted", scan_handler.handle_encrypted),
web.post("/scan_file", scan_handler.handle_file),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API design here is getting pretty confusing. From the naming, nothing indicates to me that /scan_encrypted will download the file from a media repo, whereas /scan_file will allow you to upload the file yourself (and that file may be encrypted or not).

What do you think about POST /scan/upload? It can still be differentiated from /scan/{serverName}/{mediaId} in configuration, and is clearer/more REST-like.

raise ContentScannerRestError(
400,
ErrCode.MALFORMED_MULTIPART,
"Request body was not a multipart body.",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Request body was not a multipart body.",
"Unable to parse request body as multipart.",

Could you also add a logger.exception(e) here so the actual error doesn't get lost?

# Delete the file if the write fails.
try:
os.unlink(full_path)
except OSError:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be useful to log this failure.

self._check_mimetype(file_path)
# Scan the file and see if the result is positive or negative.
exit_code = await self._run_scan(file_path)
# Log the result of the scan.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Log the result of the scan.

# Log the result of the scan.
logger.info("Scan has finished")
finally:
# This could be own function.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment seems temporary?

except Exception:
# Delete the file if the write fails.
try:
os.unlink(full_path)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be using self.removal_command here?

If so, it probably makes sense to pull running it out to a helper function.

Comment thread docs/api.md
Comment on lines +140 to +143
| Part name | Required | Type | Description |
|-----------|----------|-a-----|-------------|
| `body` | **Yes** | Binary (file content) | The raw file to scan. |
| `file` | No | JSON string | Decryption metadata for an encrypted file. Follows the [`EncryptedFile`](https://spec.matrix.org/v1.2/client-server-api/#extensions-to-mroommessage-msgtypes) structure from the Matrix specification. Only needed when the file in `body` is encrypted. |
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't actually break table rendering. But yes, might as well fix it.

Comment on lines +199 to +206
metadata = _metadata_from_body(file_body, crypto_handler)

validate_encrypted_file_metadata(metadata)

# Unlike get_media_metadata_from_request, we intentionally skip extracting
# the file URL from the metadata because the caller already has the media content.

return metadata
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The jsonschema for an encrypted file metadata requires url to exist:

.

Technically it can be an empty string. But indeed, it's awkward that the field must exist. The requirement comes from the spec.

What are clients expected to do in this case?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants