Skip to content

Add TwelveLabs Marengo embedding integration for video vector search#3156

Open
mohit-twelvelabs wants to merge 1 commit into
activeloopai:mainfrom
mohit-twelvelabs:feat/twelvelabs-integration
Open

Add TwelveLabs Marengo embedding integration for video vector search#3156
mohit-twelvelabs wants to merge 1 commit into
activeloopai:mainfrom
mohit-twelvelabs:feat/twelvelabs-integration

Conversation

@mohit-twelvelabs

@mohit-twelvelabs mohit-twelvelabs commented Jun 25, 2026

Copy link
Copy Markdown

Hi! I'm Mohit, I work at TwelveLabs (@mohit-twelvelabs).

🚀 🚀 Pull Request

Impact

  • Enhancement/New feature (adds functionality without impacting existing logic)

Description

This adds an opt-in TwelveLabs Marengo embedding integration for building multimodal video vector stores in Deep Lake.

Marengo maps text, images, audio, and video into the same 512-dimensional vector space. Since Deep Lake already stores raw video alongside embeddings and runs cosine-similarity search client-side over an Embedding column, the two pair naturally: embed videos at ingestion, embed natural-language queries at search time, and search a single types.Embedding(512) column.

New surface (under the existing deeplake.integrations namespace):

from deeplake import types
from deeplake.integrations.twelvelabs import MarengoEmbedder

embedder = MarengoEmbedder()                      # reads TWELVELABS_API_KEY
ds.add_column("embedding", types.Embedding(embedder.dim))   # dim == 512
ds.append({"url": urls, "embedding": embedder.embed_videos(urls)})

q = ",".join(str(c) for c in embedder.embed_text("a chef plating a dish")[0])
ds.query(f"SELECT *, COSINE_SIMILARITY(embedding, ARRAY[{q}]) AS score ORDER BY score DESC LIMIT 5")

embed_text follows the same calling convention as the embedding_function used throughout the Deep Lake RAG guides (string or list in, list of vectors out), so it drops into existing pipelines.

Things to be aware of

  • Opt-in and non-breaking. Nothing is auto-imported and no defaults change; deeplake.integrations.twelvelabs is imported on demand, exactly like the existing labelbox integration.
  • twelvelabs is an optional dependency, lazily imported inside the constructor (same # type: ignore lazy-import pattern as integrations/labelbox), so it is not added to Deep Lake's required dependencies.
  • Added a guide page at docs/docs/guide/twelvelabs.md and tests at python/deeplake/integrations/twelvelabs/test_marengo.py.

Testing

  • No-network unit tests cover the missing-API-key error path and the empty/errored-response guard.
  • A TWELVELABS_API_KEY-gated live test asserts embed_text returns 512-dim vectors (single and batched).
  • Verified live against the TwelveLabs API: a Marengo text-embedding call returns a 512-dim float vector. Video wiring uses the SDK's embed.tasks create/wait/retrieve flow.
  • black passes on all changed files; mypy reports no errors in the new module.

Additional Context

You can grab a free API key at https://twelvelabs.io — there's a generous free tier.

Summary by CodeRabbit

  • New Features

    • Added a new guide for using TwelveLabs Marengo embeddings with Deep Lake for multimodal video search.
    • Introduced support for generating embeddings from text and public video URLs, including batched inputs.
    • Exposed a new TwelveLabs embedding integration for easier use in applications.
  • Bug Fixes

    • Improved validation and error messaging when an API key or embedding data is missing.
  • Tests

    • Added coverage for API key handling, embedding output shape, batching, and empty-result cases.

Adds an opt-in deeplake.integrations.twelvelabs.MarengoEmbedder that
generates 512-dim multimodal embeddings (text and video) via the
TwelveLabs Marengo model, for building video vector stores in Deep Lake.
Mirrors the lazy-SDK-import pattern used by the labelbox integration, so
twelvelabs remains an optional dependency. Includes a guide page and
tests (no-network unit tests plus a TWELVELABS_API_KEY-gated live test).
@CLAassistant

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@coderabbitai

coderabbitai Bot commented Jun 25, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

A new opt-in TwelveLabs Marengo integration adds package exports, Marengo embedding code for text and video, validation tests, and a Deep Lake guide covering ingestion and cosine-similarity search.

Changes

TwelveLabs Marengo integration

Layer / File(s) Summary
Marengo API and validation
python/deeplake/integrations/twelvelabs/marengo.py, python/deeplake/integrations/twelvelabs/__init__.py, python/deeplake/integrations/twelvelabs/test_marengo.py
MarengoEmbedder adds the Marengo constants, API-key loading, 512-dim output shape, segment validation, text embedding, package re-exports, and tests for missing keys, empty segments, and vector dimensions.
Video workflow guide
python/deeplake/integrations/twelvelabs/marengo.py, docs/docs/guide/twelvelabs.md
embed_videos adds async video embedding, and the new guide documents setup, dataset creation, video ingestion, committing data, and cosine-similarity search with embed_text.

Sequence Diagram(s)

sequenceDiagram
  participant MarengoEmbedder
  participant TwelveLabsClient as "TwelveLabs client"
  participant TwelveLabsAPI as "TwelveLabs API"
  participant DeepLakeSQL as "Deep Lake SQL"
  MarengoEmbedder->>TwelveLabsClient: embed_videos(public video URLs)
  TwelveLabsClient->>TwelveLabsAPI: create async embed tasks
  TwelveLabsAPI-->>TwelveLabsClient: visual-text segment vectors
  MarengoEmbedder->>TwelveLabsClient: embed_text(natural-language query)
  TwelveLabsClient-->>MarengoEmbedder: query vector
  MarengoEmbedder->>DeepLakeSQL: COSINE_SIMILARITY over embedding column
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

Poem

🐇 I hopped through Marengo’s vector trail,
With 512 dreams that never fail.
I sniffed a query, thumped the ground,
And Deep Lake answers all around!
✨ Thump-thump, the embeddings sing.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 37.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly names the new TwelveLabs Marengo embedding integration and its video vector search use case.
Description check ✅ Passed The PR description covers impact, implementation details, testing, and context; only the 'Things to worry about' section is missing.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@python/deeplake/integrations/twelvelabs/marengo.py`:
- Around line 52-56: The docstring example in marengo.py builds the query vector
incorrectly because embed_text returns a list of vectors, not a flat vector.
Update the example to index the first embedded result from
embedder.embed_text(...) before joining values, matching the approach used in
the TwelveLabs guide, so the generated SQL uses a flat ARRAY[...] of floats
instead of nested list text.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 9eb2d520-55a9-485e-9192-bfdf06f0d688

📥 Commits

Reviewing files that changed from the base of the PR and between f432041 and 6e01f38.

📒 Files selected for processing (4)
  • docs/docs/guide/twelvelabs.md
  • python/deeplake/integrations/twelvelabs/__init__.py
  • python/deeplake/integrations/twelvelabs/marengo.py
  • python/deeplake/integrations/twelvelabs/test_marengo.py

Comment on lines +52 to +56
>>> # Embed a text query at search time and run vector search.
>>> query = ",".join(str(c) for c in embedder.embed_text("a chef plating a dish"))
>>> view = ds.query(
... f"SELECT * ORDER BY COSINE_SIMILARITY(embedding, ARRAY[{query}]) DESC LIMIT 5"
... )

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Docstring example builds a malformed query vector. embed_text returns List[List[float]] (one vector per input string), so iterating directly yields the inner vector (a list), and str(c) stringifies that list rather than each float — producing ARRAY[[...]] instead of ARRAY[0.1,0.2,...]. The guide at docs/docs/guide/twelvelabs.md (Line 65) correctly indexes [0] first.

🐛 Proposed fix
-        >>> query = ",".join(str(c) for c in embedder.embed_text("a chef plating a dish"))
+        >>> query = ",".join(str(c) for c in embedder.embed_text("a chef plating a dish")[0])
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
>>> # Embed a text query at search time and run vector search.
>>> query = ",".join(str(c) for c in embedder.embed_text("a chef plating a dish"))
>>> view = ds.query(
... f"SELECT * ORDER BY COSINE_SIMILARITY(embedding, ARRAY[{query}]) DESC LIMIT 5"
... )
>>> # Embed a text query at search time and run vector search.
>>> query = ",".join(str(c) for c in embedder.embed_text("a chef plating a dish")[0])
>>> view = ds.query(
... f"SELECT * ORDER BY COSINE_SIMILARITY(embedding, ARRAY[{query}]) DESC LIMIT 5"
... )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@python/deeplake/integrations/twelvelabs/marengo.py` around lines 52 - 56, The
docstring example in marengo.py builds the query vector incorrectly because
embed_text returns a list of vectors, not a flat vector. Update the example to
index the first embedded result from embedder.embed_text(...) before joining
values, matching the approach used in the TwelveLabs guide, so the generated SQL
uses a flat ARRAY[...] of floats instead of nested list text.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants