Skip to content

feat: Content Asset pipeline, release isolation, and API regression harness#918

Open
lsh-915 wants to merge 28 commits into
NanmiCoder:mainfrom
lsh-915:fix/douyin-api-output-utf8
Open

feat: Content Asset pipeline, release isolation, and API regression harness#918
lsh-915 wants to merge 28 commits into
NanmiCoder:mainfrom
lsh-915:fix/douyin-api-output-utf8

Conversation

@lsh-915

@lsh-915 lsh-915 commented Jun 16, 2026

Copy link
Copy Markdown

1. 背景

本 PR 在 fix/douyin-api-output-utf8 分支上完成 T017-5 Content Asset 主线交付,并通过 T018 后 Release 隔离处理 将 runtime、WebUI、CDP、ASR 脚手架、测试、文档与 CI 工具链拆分为可审查、可推送的独立 commit。随后 Final-Fix-1 补齐遗漏的 API 行为补丁、运行文档与任务档案。

核心目标:建立从搜索 → 评论/标题/文案 → content_asset 的标准化数据链路,并提供稳定的 API/Web 预览、导出与部署契约。

2. 本 PR 完成内容

T017-5 Content Asset

  • 标准化 task workspace 输出:search_resultcomments_raw / comments_cleansearch_title_cleanscript_sources / script_raw / script_clean
  • content_asset builder 与主结果选择语义(content_asset.csv > content_asset.jsonl > 旧合并 CSV)
  • API:result / preview / export 契约与 task-id merge 工作流
  • Web:Content Asset 预览与下载 UX
  • 文档:CONTENT_ASSET.md、验收说明与 T017-5 任务档案
  • 修复:api/utils.py 缺失导致的 HEAD/API 导入问题

T018 后 Release 隔离处理

  • Runtime / 部署:Dockerfile、docker-compose、端口对齐、deploy.sh、systemd 服务模板
  • api/webui:内置前端生产构建产物
  • CDPcrawl_comments_v2.py 与登录/浏览器/CDP 连接稳定化
  • ASR:Whisper 依赖脚手架(非批量真实转写)
  • 测试douyin_scraper/tests 回归框架 + api/tests.py 路由测试
  • 文档:架构图、序列图、v7 经验文档
  • CI:test-only workflow、Makefile、最小 pyproject 工具链 hunk、本地运行目录 .gitignore

Final-Fix-1

  • get_result 目录响应增加 csv_files / jsonl_files
  • README.md / api/README.md 运行与部署说明补全
  • Docker 容器下 Chromium 启动参数(--no-sandbox 等,仅在显式配置 CHROME_PATH 时生效)
  • docs/tasks/T017-1-CDP-Fix.md 过程档案归档

3. 关键功能说明

能力 说明
Content Asset 合并 POST /scrape/merge 支持 search_task_id + 可选 comments_task_id / scripts_task_id
主结果选择 resultpreview、GET export 共用同一 selector
导出兼容 POST export 保持旧七字段 CSV/TXT;GET export 保留完整 schema 与 BOM
任务隔离 每个 scrape 操作在独立 workspace 中运行,支持 status/result/cleanup
CDP 评论采集 crawl_comments_v2.py 接入正式链路;旧 crawl_comments.py 已移除
目录结果列表 当 result 为目录时,返回 filescsv_filesjsonl_files

4. 端口与运行约定

服务 端口 说明
API(容器内部) 8000 DY_API_PORT
API(宿主访问) 18080 DY_API_PUBLIC_PORT / DY_API_BASE_URL
Web 开发服务 15173 WEB_DEV_PORT
Chrome / CDP 19222 DY_CHROME_PORT / DY_CHROME_CDP_URL

请勿使用以下默认值:

  • 宿主 API 8000(容器内专用,非宿主入口)
  • Web 开发 3000
  • CDP 9222(项目统一使用 19222

本地开发示例:

uvicorn api.main:app --host 0.0.0.0 --port 18080
# WebUI: http://localhost:18080/ui/
# API Docs: http://localhost:18080/docs

5. 测试与验证

结果
Content Asset / API / selector 回归 PASS
douyin_scraper/tests + api/tests.py 79 passed
Web production build PASS
docker compose config PASS(未执行真实 Docker build)
py_compile(api / scraper / platform) PASS
api/utils.py 导入修复 PASS
CI workflow test-only(requirements.txt 安装 + pytest)

6. 不包含范围

本 PR 不包含

  • 真实批量 ASR / Whisper 模型转写验收
  • 真实 CDP 环境下的完整批量评论采集验收
  • CI lint / security / docker-build jobs
  • pre-commit ruff 格式化基线
  • pyproject 项目身份或主依赖大重构(mediacrawlerdouyin_scraper 等)
  • 大规模无关重构

7. 风险与注意事项

  1. CDP / 真实采集:链路已稳定化并通过单元/集成测试,但未在本 PR 中承诺生产环境批量真实采集 PASS;需在本机 Chrome CDP(端口 19222)下单独 smoke 验证。
  2. ASR:仅提供 Whisper 依赖脚手架;script_raw / script_clean 的批量真实转写需后续任务验收。
  3. Docker Chromium--no-sandbox 仅在设置 CHROME_PATH / PLAYWRIGHT_CHROMIUM_EXECUTABLE_PATH 时启用,不影响本地默认 Playwright 行为。
  4. CI 范围:当前 CI 仅跑 pytest,不含 ruff/bandit/docker build;合并后 CI 绿不代表 lint 基线已建立。
  5. API 目录结果csv_files / jsonl_files 返回绝对路径字符串,消费方需注意路径敏感性。
  6. Docker build:本 PR 已通过 docker compose config 校验,但未纳入 CI docker-build,也未承诺镜像构建验收完成。

8. 后续任务建议

优先级 任务
P1 真实 CDP 环境 smoke:单视频评论采集端到端
P1 ASR 批量转写验收与 pending_asr 状态边界测试
P2 CI 扩展:ruff lint 基线 + 格式化一次性落地
P2 CI docker-build job(验证 Dockerfile 可构建)
P3 pre-commit ruff hooks 接入
P3 pyproject 工具链与项目身份契约统一(若确有重构计划)

9. Commit 范围摘要

T017-5 Content Asset(主线)

be8fbcec feat(scraper): standardize search outputs in task workspaces
00a4946c feat(scraper): add raw and cleaned comment outputs
8bee152f feat(scraper): add cleaned search title outputs
2d7afaf8 feat(scraper): add script source raw and clean outputs
55199ea5 feat(content-asset): add content asset builder
adcc7f29 feat(api): expose content asset result preview and export
11722ba1 feat(web): establish Vite frontend baseline
31ceb012 chore(web): align frontend API and dev ports
f88a0ff4 feat(web): add content asset preview and download UX
1f74686d docs(content-asset): document schema workflow and acceptance
5f4bce09 docs(tasks): archive T017-5 release isolation notes
906266bb fix(api): add missing API utility helpers

T018 隔离处理(8 commits)

98168e9c chore(runtime): align ports and deployment scaffolding
04437124 chore(webui): refresh embedded frontend build artifacts
689e2fec fix(douyin): stabilize CDP comment collection
da7abe68 chore(asr): add whisper dependency scaffolding
68b7f5b1 test: add regression test harness
0e85d029 docs: reconcile architecture and operational notes
135e8af2 test(api): add stable API route regression tests
4d8f2e8b ci: add stable regression workflow

Final-Fix-1(5 commits)

45cac095 feat(api): list csv and jsonl files in task result
746a440e docs: add local runtime port notes
86397860 docs(api): expand operational guide
99149c46 chore(runtime): add container chromium launch args
69d27ecd docs(tasks): archive T017-1 CDP fix notes

Test plan

  • Review Content Asset API contract (result / preview / export)
  • Confirm port defaults: API host 18080, CDP 19222
  • CI pytest job passes (douyin_scraper/tests + api/tests.py)
  • WebUI loads at /ui/ with embedded build
  • CDP smoke (optional, manual): single-video comment collection on port 19222

@lsh-915 lsh-915 requested a review from NanmiCoder as a code owner June 16, 2026 02:54
@dosubot dosubot Bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label Jun 16, 2026
@lsh-915

lsh-915 commented Jun 16, 2026

Copy link
Copy Markdown
Author

Reviewer note:

This PR is large mainly because it adds the FastAPI task/login/ws wrapper modules and the douyin_scraper wrapper package needed by the API runtime. The actual hotfix scope is focused on two verified issues:

  1. --output_dir did not reach the actual JSONL writer, causing task workspace result files to remain empty.
  2. Chinese keywords were corrupted into ???? in the subprocess/CLI path.

The fix was verified with a real Docker API run:

  • /scrape/search completed successfully
  • result path: /app/workspaces/46191c11c9d5/outputs/search_result.jsonl
  • file size: 23K
  • rows: 14
  • source_keyword: 靠边停车
  • /scrape/data/preview/{task_id} returned rows successfully

lsh-915 and others added 25 commits June 22, 2026 09:26
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@lsh-915 lsh-915 changed the title fix: stabilize Douyin API task output and UTF-8 keyword handling feat: Content Asset pipeline, release isolation, and API regression harness Jun 22, 2026
…ut-utf8

Co-authored-by: Cursor <cursoragent@cursor.com>
@lsh-915

lsh-915 commented Jun 22, 2026

Copy link
Copy Markdown
Author

Update for reviewers:

  • Merge conflicts with upstream/main have been resolved.
  • PR feat: Content Asset pipeline, release isolation, and API regression harness #918 is now MERGEABLE.
  • Conflict resolution was limited to README.md, cmd_arg/arg.py, and tools/cdp_browser.py.
  • Re-ran local validation after conflict resolution:
    • python -m py_compile cmd_arg\arg.py tools\cdp_browser.py: PASS
    • python -m pytest douyin_scraper\tests api\tests.py -q: 79 passed
  • GitHub currently reports no visible checks for this fork PR, so the local test record above is provided for review context.
  • Scope boundaries remain unchanged: CDP bulk real-environment validation, ASR batch transcription, Docker image build, lint/security jobs, and ruff formatting baseline are intentionally left for follow-up tasks.

@lsh-915

lsh-915 commented Jun 22, 2026

Copy link
Copy Markdown
Author

Security update added:

P0 API security hardening has been pushed in commit bdb2c80.

Summary:

  • Added minimal API key authentication using DY_API_KEY / API_KEY.
  • Protected write APIs, task queries, login-related routes, WebSocket task access, and destructive operations.
  • Restricted CORS defaults to local development origins only.
  • Kept local no-key development compatible, while Docker/Compose can require DY_API_AUTH_REQUIRED=1.
  • Hardened delete/cleanup/reset path handling for task workspaces.
  • Updated frontend API-key propagation, including download and WebSocket access.
  • Added API auth/CORS tests.

Validation:

  • python py_compile: PASS
  • api/tests.py: 36 passed
  • douyin_scraper/tests + api/tests.py: 89 passed
  • React production build: PASS
  • docker compose config: PASS
  • git diff --check: PASS

Remaining scope boundaries:

  • This is shared API-key protection, not a user permission system.
  • Public deployment still requires HTTPS, reverse proxy hardening, rate limiting, key rotation, and trusted CORS origins.
  • Redis alignment, legacy API cleanup, full test-suite governance, logging lifecycle cleanup, and frontend bundle splitting remain follow-up tasks.

@lsh-915

lsh-915 commented Jun 23, 2026

Copy link
Copy Markdown
Author

Hi maintainers, just checking in on this PR.

PR #918 is ready for review and is currently mergeable. The scope is frozen; I will only respond to review feedback from here.

Could you please help review it when convenient, and approve/run the GitHub Actions workflow if fork PR checks require maintainer approval?

For context:

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant