feat: Content Asset pipeline, release isolation, and API regression harness#918
Open
lsh-915 wants to merge 28 commits into
Open
feat: Content Asset pipeline, release isolation, and API regression harness#918lsh-915 wants to merge 28 commits into
lsh-915 wants to merge 28 commits into
Conversation
Author
|
Reviewer note: This PR is large mainly because it adds the FastAPI task/login/ws wrapper modules and the
The fix was verified with a real Docker API run:
|
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
…ut-utf8 Co-authored-by: Cursor <cursoragent@cursor.com>
Author
|
Update for reviewers:
|
Author
|
Security update added: P0 API security hardening has been pushed in commit bdb2c80. Summary:
Validation:
Remaining scope boundaries:
|
Author
|
Hi maintainers, just checking in on this PR. PR #918 is ready for review and is currently mergeable. The scope is frozen; I will only respond to review feedback from here. Could you please help review it when convenient, and approve/run the GitHub Actions workflow if fork PR checks require maintainer approval? For context:
Thank you! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
1. 背景
本 PR 在
fix/douyin-api-output-utf8分支上完成 T017-5 Content Asset 主线交付,并通过 T018 后 Release 隔离处理 将 runtime、WebUI、CDP、ASR 脚手架、测试、文档与 CI 工具链拆分为可审查、可推送的独立 commit。随后 Final-Fix-1 补齐遗漏的 API 行为补丁、运行文档与任务档案。核心目标:建立从搜索 → 评论/标题/文案 →
content_asset的标准化数据链路,并提供稳定的 API/Web 预览、导出与部署契约。2. 本 PR 完成内容
T017-5 Content Asset
search_result、comments_raw/comments_clean、search_title_clean、script_sources/script_raw/script_cleancontent_assetbuilder 与主结果选择语义(content_asset.csv>content_asset.jsonl> 旧合并 CSV)result/preview/export契约与 task-id merge 工作流CONTENT_ASSET.md、验收说明与 T017-5 任务档案api/utils.py缺失导致的 HEAD/API 导入问题T018 后 Release 隔离处理
docker-compose、端口对齐、deploy.sh、systemd 服务模板crawl_comments_v2.py与登录/浏览器/CDP 连接稳定化douyin_scraper/tests回归框架 +api/tests.py路由测试pyproject工具链 hunk、本地运行目录.gitignoreFinal-Fix-1
get_result目录响应增加csv_files/jsonl_filesREADME.md/api/README.md运行与部署说明补全--no-sandbox等,仅在显式配置CHROME_PATH时生效)docs/tasks/T017-1-CDP-Fix.md过程档案归档3. 关键功能说明
POST /scrape/merge支持search_task_id+ 可选comments_task_id/scripts_task_idresult、preview、GETexport共用同一 selectorexport保持旧七字段 CSV/TXT;GETexport保留完整 schema 与 BOMcrawl_comments_v2.py接入正式链路;旧crawl_comments.py已移除files、csv_files、jsonl_files4. 端口与运行约定
DY_API_PORTDY_API_PUBLIC_PORT/DY_API_BASE_URLWEB_DEV_PORTDY_CHROME_PORT/DY_CHROME_CDP_URL请勿使用以下默认值:
本地开发示例:
5. 测试与验证
douyin_scraper/tests+api/tests.pydocker compose configpy_compile(api / scraper / platform)api/utils.py导入修复requirements.txt安装 + pytest)6. 不包含范围
本 PR 不包含:
pyproject项目身份或主依赖大重构(mediacrawler→douyin_scraper等)7. 风险与注意事项
script_raw/script_clean的批量真实转写需后续任务验收。--no-sandbox仅在设置CHROME_PATH/PLAYWRIGHT_CHROMIUM_EXECUTABLE_PATH时启用,不影响本地默认 Playwright 行为。csv_files/jsonl_files返回绝对路径字符串,消费方需注意路径敏感性。docker compose config校验,但未纳入 CI docker-build,也未承诺镜像构建验收完成。8. 后续任务建议
pending_asr状态边界测试pyproject工具链与项目身份契约统一(若确有重构计划)9. Commit 范围摘要
T017-5 Content Asset(主线)
T018 隔离处理(8 commits)
Final-Fix-1(5 commits)
Test plan
result/preview/export)18080, CDP19222douyin_scraper/tests+api/tests.py)/ui/with embedded build