Add TwelveLabs Marengo embedding integration for video vector search#3156
Add TwelveLabs Marengo embedding integration for video vector search#3156mohit-twelvelabs wants to merge 1 commit into
Conversation
Adds an opt-in deeplake.integrations.twelvelabs.MarengoEmbedder that generates 512-dim multimodal embeddings (text and video) via the TwelveLabs Marengo model, for building video vector stores in Deep Lake. Mirrors the lazy-SDK-import pattern used by the labelbox integration, so twelvelabs remains an optional dependency. Includes a guide page and tests (no-network unit tests plus a TWELVELABS_API_KEY-gated live test).
|
|
📝 WalkthroughWalkthroughA new opt-in TwelveLabs Marengo integration adds package exports, Marengo embedding code for text and video, validation tests, and a Deep Lake guide covering ingestion and cosine-similarity search. ChangesTwelveLabs Marengo integration
Sequence Diagram(s)sequenceDiagram
participant MarengoEmbedder
participant TwelveLabsClient as "TwelveLabs client"
participant TwelveLabsAPI as "TwelveLabs API"
participant DeepLakeSQL as "Deep Lake SQL"
MarengoEmbedder->>TwelveLabsClient: embed_videos(public video URLs)
TwelveLabsClient->>TwelveLabsAPI: create async embed tasks
TwelveLabsAPI-->>TwelveLabsClient: visual-text segment vectors
MarengoEmbedder->>TwelveLabsClient: embed_text(natural-language query)
TwelveLabsClient-->>MarengoEmbedder: query vector
MarengoEmbedder->>DeepLakeSQL: COSINE_SIMILARITY over embedding column
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~30 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@python/deeplake/integrations/twelvelabs/marengo.py`:
- Around line 52-56: The docstring example in marengo.py builds the query vector
incorrectly because embed_text returns a list of vectors, not a flat vector.
Update the example to index the first embedded result from
embedder.embed_text(...) before joining values, matching the approach used in
the TwelveLabs guide, so the generated SQL uses a flat ARRAY[...] of floats
instead of nested list text.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 9eb2d520-55a9-485e-9192-bfdf06f0d688
📒 Files selected for processing (4)
docs/docs/guide/twelvelabs.mdpython/deeplake/integrations/twelvelabs/__init__.pypython/deeplake/integrations/twelvelabs/marengo.pypython/deeplake/integrations/twelvelabs/test_marengo.py
| >>> # Embed a text query at search time and run vector search. | ||
| >>> query = ",".join(str(c) for c in embedder.embed_text("a chef plating a dish")) | ||
| >>> view = ds.query( | ||
| ... f"SELECT * ORDER BY COSINE_SIMILARITY(embedding, ARRAY[{query}]) DESC LIMIT 5" | ||
| ... ) |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win
Docstring example builds a malformed query vector. embed_text returns List[List[float]] (one vector per input string), so iterating directly yields the inner vector (a list), and str(c) stringifies that list rather than each float — producing ARRAY[[...]] instead of ARRAY[0.1,0.2,...]. The guide at docs/docs/guide/twelvelabs.md (Line 65) correctly indexes [0] first.
🐛 Proposed fix
- >>> query = ",".join(str(c) for c in embedder.embed_text("a chef plating a dish"))
+ >>> query = ",".join(str(c) for c in embedder.embed_text("a chef plating a dish")[0])📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| >>> # Embed a text query at search time and run vector search. | |
| >>> query = ",".join(str(c) for c in embedder.embed_text("a chef plating a dish")) | |
| >>> view = ds.query( | |
| ... f"SELECT * ORDER BY COSINE_SIMILARITY(embedding, ARRAY[{query}]) DESC LIMIT 5" | |
| ... ) | |
| >>> # Embed a text query at search time and run vector search. | |
| >>> query = ",".join(str(c) for c in embedder.embed_text("a chef plating a dish")[0]) | |
| >>> view = ds.query( | |
| ... f"SELECT * ORDER BY COSINE_SIMILARITY(embedding, ARRAY[{query}]) DESC LIMIT 5" | |
| ... ) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@python/deeplake/integrations/twelvelabs/marengo.py` around lines 52 - 56, The
docstring example in marengo.py builds the query vector incorrectly because
embed_text returns a list of vectors, not a flat vector. Update the example to
index the first embedded result from embedder.embed_text(...) before joining
values, matching the approach used in the TwelveLabs guide, so the generated SQL
uses a flat ARRAY[...] of floats instead of nested list text.
Hi! I'm Mohit, I work at TwelveLabs (@mohit-twelvelabs).
🚀 🚀 Pull Request
Impact
Description
This adds an opt-in TwelveLabs Marengo embedding integration for building multimodal video vector stores in Deep Lake.
Marengo maps text, images, audio, and video into the same 512-dimensional vector space. Since Deep Lake already stores raw video alongside embeddings and runs cosine-similarity search client-side over an
Embeddingcolumn, the two pair naturally: embed videos at ingestion, embed natural-language queries at search time, and search a singletypes.Embedding(512)column.New surface (under the existing
deeplake.integrationsnamespace):embed_textfollows the same calling convention as theembedding_functionused throughout the Deep Lake RAG guides (string or list in, list of vectors out), so it drops into existing pipelines.Things to be aware of
deeplake.integrations.twelvelabsis imported on demand, exactly like the existinglabelboxintegration.twelvelabsis an optional dependency, lazily imported inside the constructor (same# type: ignorelazy-import pattern asintegrations/labelbox), so it is not added to Deep Lake's required dependencies.docs/docs/guide/twelvelabs.mdand tests atpython/deeplake/integrations/twelvelabs/test_marengo.py.Testing
TWELVELABS_API_KEY-gated live test assertsembed_textreturns 512-dim vectors (single and batched).embed.taskscreate/wait/retrieve flow.blackpasses on all changed files;mypyreports no errors in the new module.Additional Context
You can grab a free API key at https://twelvelabs.io — there's a generous free tier.
Summary by CodeRabbit
New Features
Bug Fixes
Tests