Continue Add scan file body endpoint #83
Conversation
|
Please check out the changes and share your feedback. |
|
@TrevisGordan This branch has conflicts :) |
There was a problem hiding this comment.
Pull request overview
This PR continues the work from #79/#78 by adding a POST /_matrix/media_proxy/unstable/scan_file endpoint to scan uploaded content (optionally encrypted) prior to Matrix upload, and refactors the scanner to share a single scan pipeline across download- and upload-based scan paths.
Changes:
- Add
/scan_filemultipart endpoint to scan raw uploaded bytes with optional encryption metadata. - Refactor scanner internals to share a unified decrypt → write → mimetype-check → scan → cleanup pipeline (
scan_content,_do_scan). - Update API documentation for the new endpoint and add a new error code for malformed multipart requests.
Reviewed changes
Copilot reviewed 7 out of 8 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
src/matrix_content_scanner/utils/constants.py |
Adds MALFORMED_MULTIPART error code for multipart parsing failures. |
src/matrix_content_scanner/servlets/scan.py |
Implements /scan_file multipart parsing and calls the new scan_content() pathway. |
src/matrix_content_scanner/servlets/__init__.py |
Adds helper to parse/validate encryption metadata from multipart file part. |
src/matrix_content_scanner/scanner/scanner.py |
Introduces scan_content() and shared _do_scan() pipeline; makes temp-file writing async. |
src/matrix_content_scanner/httpserver.py |
Registers the new POST /scan_file route. |
pyproject.toml / poetry.lock |
Adds aiofile dependency to support async file writes. |
docs/api.md |
Documents POST /scan_file with examples and response format. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| @web_handler | ||
| async def handle_file(self, request: web.Request) -> Tuple[int, JsonDict]: | ||
| """Handles GET requests to ../scan_file""" |
| try: | ||
| file_json = await field.json() | ||
| if file_json is None: | ||
| raise Exception("'file' field is empty") | ||
| except json.decoder.JSONDecodeError as e: | ||
| raise ContentScannerRestError(400, ErrCode.MALFORMED_JSON, str(e)) | ||
|
|
||
| metadata = await get_media_metadata_from_filebody( | ||
| file_json, self._crypto_handler | ||
| ) |
| Args: | ||
| request: The request to extract the data from. | ||
| crypto_handler: The crypto handler to use if we need to decrypt an Olm-encrypted | ||
| body. |
| metadata = _metadata_from_body(file_body, crypto_handler) | ||
|
|
||
| validate_encrypted_file_metadata(metadata) | ||
|
|
||
| # Unlike get_media_metadata_from_request, we intentionally skip extracting | ||
| # the file URL from the metadata because the caller already has the media content. | ||
|
|
||
| return metadata |
There was a problem hiding this comment.
The jsonschema for an encrypted file metadata requires url to exist:
Technically it can be an empty string. But indeed, it's awkward that the field must exist. The requirement comes from the spec.
What are clients expected to do in this case?
| | Part name | Required | Type | Description | | ||
| |-----------|----------|-a-----|-------------| | ||
| | `body` | **Yes** | Binary (file content) | The raw file to scan. | | ||
| | `file` | No | JSON string | Decryption metadata for an encrypted file. Follows the [`EncryptedFile`](https://spec.matrix.org/v1.2/client-server-api/#extensions-to-mroommessage-msgtypes) structure from the Matrix specification. Only needed when the file in `body` is encrypted. | |
There was a problem hiding this comment.
It doesn't actually break table rendering. But yes, might as well fix it.
| @web_handler | ||
| async def handle_file(self, request: web.Request) -> Tuple[int, JsonDict]: | ||
| """Handles GET requests to ../scan_file""" | ||
| try: | ||
| reader = await request.multipart() | ||
| except Exception: |
| async def scan_content( | ||
| self, content: bytes, metadata: Optional[JsonDict] = None | ||
| ) -> None: | ||
| """Scan raw file bytes. The content is written to disk once (decrypted if | ||
| needed), scanned, and cleaned up. | ||
|
|
||
| This does not use the result cache or concurrent-request deduplication. | ||
|
|
||
| Args: | ||
| content: The raw file bytes (possibly still encrypted). | ||
| metadata: The metadata attached to the file (e.g. decryption key), or None | ||
| if the file isn't encrypted. | ||
|
|
||
| Raises: | ||
| FileDirtyError if the result of the scan said that the file is dirty. | ||
| """ | ||
| exit_code = await self._do_scan(content, metadata) | ||
| result = exit_code == 0 | ||
|
|
||
| cacheable = exit_code not in self._exit_codes_to_ignore | ||
|
|
||
| if result is False: | ||
| raise FileDirtyError(cacheable=cacheable) |
anoadragon453
left a comment
There was a problem hiding this comment.
Thanks for taking this on, and apologies for the slow review. Our backlog is fairly large and this repository tends to get put on the backburner sometimes.
Lots of comments below, but nothing too fundamental.
| if file_json is None: | ||
| raise Exception("'file' field is empty") |
There was a problem hiding this comment.
nit: This bit can be removed from the try / except.
| [ | ||
| web.get("/scan" + _MEDIA_PATH_REGEXP, scan_handler.handle_plain), | ||
| web.post("/scan_encrypted", scan_handler.handle_encrypted), | ||
| web.post("/scan_file", scan_handler.handle_file), |
There was a problem hiding this comment.
The API design here is getting pretty confusing. From the naming, nothing indicates to me that /scan_encrypted will download the file from a media repo, whereas /scan_file will allow you to upload the file yourself (and that file may be encrypted or not).
What do you think about POST /scan/upload? It can still be differentiated from /scan/{serverName}/{mediaId} in configuration, and is clearer/more REST-like.
| raise ContentScannerRestError( | ||
| 400, | ||
| ErrCode.MALFORMED_MULTIPART, | ||
| "Request body was not a multipart body.", |
There was a problem hiding this comment.
| "Request body was not a multipart body.", | |
| "Unable to parse request body as multipart.", |
Could you also add a logger.exception(e) here so the actual error doesn't get lost?
| # Delete the file if the write fails. | ||
| try: | ||
| os.unlink(full_path) | ||
| except OSError: |
There was a problem hiding this comment.
Would be useful to log this failure.
| self._check_mimetype(file_path) | ||
| # Scan the file and see if the result is positive or negative. | ||
| exit_code = await self._run_scan(file_path) | ||
| # Log the result of the scan. |
There was a problem hiding this comment.
| # Log the result of the scan. |
| # Log the result of the scan. | ||
| logger.info("Scan has finished") | ||
| finally: | ||
| # This could be own function. |
There was a problem hiding this comment.
This comment seems temporary?
| except Exception: | ||
| # Delete the file if the write fails. | ||
| try: | ||
| os.unlink(full_path) |
There was a problem hiding this comment.
Should we be using self.removal_command here?
If so, it probably makes sense to pull running it out to a helper function.
| | Part name | Required | Type | Description | | ||
| |-----------|----------|-a-----|-------------| | ||
| | `body` | **Yes** | Binary (file content) | The raw file to scan. | | ||
| | `file` | No | JSON string | Decryption metadata for an encrypted file. Follows the [`EncryptedFile`](https://spec.matrix.org/v1.2/client-server-api/#extensions-to-mroommessage-msgtypes) structure from the Matrix specification. Only needed when the file in `body` is encrypted. | |
There was a problem hiding this comment.
It doesn't actually break table rendering. But yes, might as well fix it.
| metadata = _metadata_from_body(file_body, crypto_handler) | ||
|
|
||
| validate_encrypted_file_metadata(metadata) | ||
|
|
||
| # Unlike get_media_metadata_from_request, we intentionally skip extracting | ||
| # the file URL from the metadata because the caller already has the media content. | ||
|
|
||
| return metadata |
There was a problem hiding this comment.
The jsonschema for an encrypted file metadata requires url to exist:
Technically it can be an empty string. But indeed, it's awkward that the field must exist. The requirement comes from the spec.
What are clients expected to do in this case?
Continues #79 by @Half-Shot, which implements #78 — the ability to scan a file before uploading it to Matrix.
PR #79 was closed as Half-Shot shifted away from the content scanner (comment). This PR picks up the work, addresses the review feedback from @reivilibre, and restructures the implementation.
What changed from #79
Instead of adding the new
scan_filefunctionality on top of the existing code, I refactored the scanner so that all scan paths follow the same pipeline:Key change: no more streaming to disk for multipart uploads. In #79,
write_multipart_to_diskstreamed the upload to disk, but since ~99% of files will be encrypted, the content would immediately be read back into memory for decryption and then written to disk again. The streaming benefit was entirely negated. The multipart body is now read into memory once in the servlet layer and passed as raw bytes to the scanner.Scanner changes
scan_file_on_diskrenamed toscan_content— now accepts raw bytes instead of a file path._do_scan— extracted shared scan logic (mimetype check, run scan, cleanup) that was previously duplicated between_scan_mediaandscan_file_on_disk.write_multipart_to_disk— no longer needed.aiofilesin_write_file_to_disk.Bug fix from #79
write_multipart_to_diskwas called from the servlet with aBodyPartReader, butscan_file_on_diskreceived the resulting file path as itsfile_pathparameter — then passed it to_decrypt_filewhich expected bytes, not a path. This is fixed by the restructuring above.Documentation
Improved
docs/api.mdfor thescan_fileendpoint:curland Python usage examples.Suggestions for follow-up or now
scan_fileendpoint toscan_uploadorscan_contentto avoid confusion with the internalscan_file/scan_contentscanner methods.scan_media— the media path is used as filename but the file is deleted after scanning anyway.fileparameter naming — confusing, but inherited from the Matrix client-server specEncryptedFilestructure.