Add TwelveLabs Marengo embedding integration for video vector search by mohit-twelvelabs · Pull Request #3156 · activeloopai/deeplake

mohit-twelvelabs · 2026-06-25T20:37:15Z

Hi! I'm Mohit, I work at TwelveLabs (@mohit-twelvelabs).

🚀 🚀 Pull Request

Impact

Enhancement/New feature (adds functionality without impacting existing logic)

Description

This adds an opt-in TwelveLabs Marengo embedding integration for building multimodal video vector stores in Deep Lake.

Marengo maps text, images, audio, and video into the same 512-dimensional vector space. Since Deep Lake already stores raw video alongside embeddings and runs cosine-similarity search client-side over an Embedding column, the two pair naturally: embed videos at ingestion, embed natural-language queries at search time, and search a single types.Embedding(512) column.

New surface (under the existing deeplake.integrations namespace):

from deeplake import types
from deeplake.integrations.twelvelabs import MarengoEmbedder

embedder = MarengoEmbedder()                      # reads TWELVELABS_API_KEY
ds.add_column("embedding", types.Embedding(embedder.dim))   # dim == 512
ds.append({"url": urls, "embedding": embedder.embed_videos(urls)})

q = ",".join(str(c) for c in embedder.embed_text("a chef plating a dish")[0])
ds.query(f"SELECT *, COSINE_SIMILARITY(embedding, ARRAY[{q}]) AS score ORDER BY score DESC LIMIT 5")

embed_text follows the same calling convention as the embedding_function used throughout the Deep Lake RAG guides (string or list in, list of vectors out), so it drops into existing pipelines.

Things to be aware of

Opt-in and non-breaking. Nothing is auto-imported and no defaults change; deeplake.integrations.twelvelabs is imported on demand, exactly like the existing labelbox integration.
twelvelabs is an optional dependency, lazily imported inside the constructor (same # type: ignore lazy-import pattern as integrations/labelbox), so it is not added to Deep Lake's required dependencies.
Added a guide page at docs/docs/guide/twelvelabs.md and tests at python/deeplake/integrations/twelvelabs/test_marengo.py.

Testing

No-network unit tests cover the missing-API-key error path and the empty/errored-response guard.
A TWELVELABS_API_KEY-gated live test asserts embed_text returns 512-dim vectors (single and batched).
Verified live against the TwelveLabs API: a Marengo text-embedding call returns a 512-dim float vector. Video wiring uses the SDK's embed.tasks create/wait/retrieve flow.
black passes on all changed files; mypy reports no errors in the new module.

Additional Context

You can grab a free API key at https://twelvelabs.io — there's a generous free tier.

Summary by CodeRabbit

New Features
- Added a new guide for using TwelveLabs Marengo embeddings with Deep Lake for multimodal video search.
- Introduced support for generating embeddings from text and public video URLs, including batched inputs.
- Exposed a new TwelveLabs embedding integration for easier use in applications.
Bug Fixes
- Improved validation and error messaging when an API key or embedding data is missing.
Tests
- Added coverage for API key handling, embedding output shape, batching, and empty-result cases.

Adds an opt-in deeplake.integrations.twelvelabs.MarengoEmbedder that generates 512-dim multimodal embeddings (text and video) via the TwelveLabs Marengo model, for building video vector stores in Deep Lake. Mirrors the lazy-SDK-import pattern used by the labelbox integration, so twelvelabs remains an optional dependency. Includes a guide page and tests (no-network unit tests plus a TWELVELABS_API_KEY-gated live test).

CLAassistant · 2026-06-25T20:37:23Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

coderabbitai · 2026-06-25T20:37:34Z

📝 Walkthrough

Walkthrough

A new opt-in TwelveLabs Marengo integration adds package exports, Marengo embedding code for text and video, validation tests, and a Deep Lake guide covering ingestion and cosine-similarity search.

Changes

TwelveLabs Marengo integration

Layer / File(s)	Summary
Marengo API and validation `python/deeplake/integrations/twelvelabs/marengo.py`, `python/deeplake/integrations/twelvelabs/__init__.py`, `python/deeplake/integrations/twelvelabs/test_marengo.py`	`MarengoEmbedder` adds the Marengo constants, API-key loading, 512-dim output shape, segment validation, text embedding, package re-exports, and tests for missing keys, empty segments, and vector dimensions.
Video workflow guide `python/deeplake/integrations/twelvelabs/marengo.py`, `docs/docs/guide/twelvelabs.md`	`embed_videos` adds async video embedding, and the new guide documents setup, dataset creation, video ingestion, committing data, and cosine-similarity search with `embed_text`.

Sequence Diagram(s)

sequenceDiagram
  participant MarengoEmbedder
  participant TwelveLabsClient as "TwelveLabs client"
  participant TwelveLabsAPI as "TwelveLabs API"
  participant DeepLakeSQL as "Deep Lake SQL"
  MarengoEmbedder->>TwelveLabsClient: embed_videos(public video URLs)
  TwelveLabsClient->>TwelveLabsAPI: create async embed tasks
  TwelveLabsAPI-->>TwelveLabsClient: visual-text segment vectors
  MarengoEmbedder->>TwelveLabsClient: embed_text(natural-language query)
  TwelveLabsClient-->>MarengoEmbedder: query vector
  MarengoEmbedder->>DeepLakeSQL: COSINE_SIMILARITY over embedding column

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

Poem

🐇 I hopped through Marengo’s vector trail,
With 512 dreams that never fail.
I sniffed a query, thumped the ground,
And Deep Lake answers all around!
✨ Thump-thump, the embeddings sing.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 37.50% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly names the new TwelveLabs Marengo embedding integration and its video vector search use case.
Description check	✅ Passed	The PR description covers impact, implementation details, testing, and context; only the 'Things to worry about' section is missing.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@python/deeplake/integrations/twelvelabs/marengo.py`:
- Around line 52-56: The docstring example in marengo.py builds the query vector
incorrectly because embed_text returns a list of vectors, not a flat vector.
Update the example to index the first embedded result from
embedder.embed_text(...) before joining values, matching the approach used in
the TwelveLabs guide, so the generated SQL uses a flat ARRAY[...] of floats
instead of nested list text.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 9eb2d520-55a9-485e-9192-bfdf06f0d688

📥 Commits

Reviewing files that changed from the base of the PR and between f432041 and 6e01f38.

📒 Files selected for processing (4)

docs/docs/guide/twelvelabs.md
python/deeplake/integrations/twelvelabs/__init__.py
python/deeplake/integrations/twelvelabs/marengo.py
python/deeplake/integrations/twelvelabs/test_marengo.py

coderabbitai · 2026-06-25T20:43:12Z

+        >>> # Embed a text query at search time and run vector search.
+        >>> query = ",".join(str(c) for c in embedder.embed_text("a chef plating a dish"))
+        >>> view = ds.query(
+        ...     f"SELECT * ORDER BY COSINE_SIMILARITY(embedding, ARRAY[{query}]) DESC LIMIT 5"
+        ... )


🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Docstring example builds a malformed query vector. embed_text returns List[List[float]] (one vector per input string), so iterating directly yields the inner vector (a list), and str(c) stringifies that list rather than each float — producing ARRAY[[...]] instead of ARRAY[0.1,0.2,...]. The guide at docs/docs/guide/twelvelabs.md (Line 65) correctly indexes [0] first.

🐛 Proposed fix

- >>> query = ",".join(str(c) for c in embedder.embed_text("a chef plating a dish")) + >>> query = ",".join(str(c) for c in embedder.embed_text("a chef plating a dish")[0])

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

>>> # Embed a text query at search time and run vector search.

>>> query = ",".join(str(c) for c in embedder.embed_text("a chef plating a dish"))

>>> view = ds.query(

... f"SELECT * ORDER BY COSINE_SIMILARITY(embedding, ARRAY[{query}]) DESC LIMIT 5"

... )

>>> # Embed a text query at search time and run vector search.

>>> query = ",".join(str(c) for c in embedder.embed_text("a chef plating a dish")[0])

>>> view = ds.query(

... f"SELECT * ORDER BY COSINE_SIMILARITY(embedding, ARRAY[{query}]) DESC LIMIT 5"

... )

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@python/deeplake/integrations/twelvelabs/marengo.py` around lines 52 - 56, The docstring example in marengo.py builds the query vector incorrectly because embed_text returns a list of vectors, not a flat vector. Update the example to index the first embedded result from embedder.embed_text(...) before joining values, matching the approach used in the TwelveLabs guide, so the generated SQL uses a flat ARRAY[...] of floats instead of nested list text.

coderabbitai Bot reviewed Jun 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add TwelveLabs Marengo embedding integration for video vector search#3156

Add TwelveLabs Marengo embedding integration for video vector search#3156
mohit-twelvelabs wants to merge 1 commit into
activeloopai:mainfrom
mohit-twelvelabs:feat/twelvelabs-integration

mohit-twelvelabs commented Jun 25, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

CLAassistant commented Jun 25, 2026

Uh oh!

coderabbitai Bot commented Jun 25, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

mohit-twelvelabs commented Jun 25, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🚀 🚀 Pull Request

Impact

Description

Things to be aware of

Testing

Additional Context

Summary by CodeRabbit

Uh oh!

CLAassistant commented Jun 25, 2026

Uh oh!

coderabbitai Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mohit-twelvelabs commented Jun 25, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 25, 2026 •

edited

Loading