Skip to content

[SYSTEMDS-??] Modality Alignment, Contrastive Learning, PDF and Transcript Loader#2459

Closed
b-enedict wants to merge 2929 commits intoapache:mainfrom
b-enedict:feat/alignment-operator
Closed

[SYSTEMDS-??] Modality Alignment, Contrastive Learning, PDF and Transcript Loader#2459
b-enedict wants to merge 2929 commits intoapache:mainfrom
b-enedict:feat/alignment-operator

Conversation

@b-enedict
Copy link
Copy Markdown

Summary

This PR introduces new functionality for multimodal learning in Scuro, including a contrastive learning operator, a modality alignment operator, and additional data loaders.

Changes

Contrastive Learning Operator

  • Constructs modality pairs via a Cartesian product
  • Uses a user-defined function to label pairs as positive or negative
  • Enables dynamic generation of contrastive samples

Modality Alignment Operator

  • Aligns previously unaligned modalities using feature-based similarity (e.g., ORB, perceptual hashing)
  • Outputs a matching between a primary and secondary modality
  • Matching is applied after representation learning and before fusion

Data Loaders

  • PDF loader: converts document pages into NumPy arrays for OpenCV processing
  • Audio loader: converts audio to text using faster-whisper

e-strauss and others added 30 commits March 31, 2025 15:22
The current inner class DMLGateWayListener implements function from the GatewayServerListener interface, which are never invoked by the GatewayServer (since the GatewayServer, which also implements GatewayServerListener, does not implement these methods. Furthermore, DMLGateWayListener previously called, Sys.exit(), which I think is not correct, since it breaks the proper shutdown of the GatewayServer. Finally, this commit added a new unit case, which checks the functionality of the DMLGateWayListener. While merging, we verified that the additions did not contain any regressions in startup and shutdown of the Python API.

Closes apache#2243
Currently the format of the .asf.yaml sends emails to everyone on all commits. According to https://issues.apache.org/jira/browse/INFRA-26700 the error is in our project. This commit, therefore, removes the outputdir, as specified in: https://github.com/apache/infrastructure-asfyaml?tab=readme-ov-file#jekyll-cms
0) hard-coded server names / properties
1) windows line endings
2) memory configurations
3) datagen scripts
4) missing l2svm script
The perftest runMSVM_10k_1k_dense_k5 reproducible failed on serializing
the parfor body program for a remote spark parfor job, because there
were remaining spark instructions (checkpoints). The reason was because
the forced CP compilation before such remote jobs was only applied in
a subset of cases and special hop properties were not correctly cleaned
up. This patch fixes the general issue.
This patch fixes a perftest performance regression of
runMSVM_10k_1k_dense_k5 which ran in 36s instead of few seconds in
earlier releases. The reason was unnecessary spark context creation
during parfor optimization. We now handle theses cluster info requests
more carefully, which now avoids this unnecessary spark context creation
and reduced the total runtime back to 5.9s.
This patch introduces a pruning technique on the cleaning pipeline
returned by the top-K cleaning. We identify a smaller yet equally
effective subset of primitives for all top-performing pipelines,
which optimizes their scoring performance.

Closes apache#2251
This patch adds the embedding layer as a built-in operator in our nn/layers library.
The functionality is similar to pytorch.nn.Embedding
(https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html)
The layer receives indices as input which refer to indices of an embedding
dictionary and returns an embedding matrix where row i refers to embedding
vector indices[i] of the embedding dictionary.
This layer is used in every transformer architecture. Here the indices
usually come from a tokenizer and the embedding matrix is the input
to the actual transformer model.

Closes apache#2237
This patch fixes an invalid left-hand-side and left- and right-hand-side
broadcasting in the new ampute builtin function. We now have a proper
error handling in the hop to guide script developers that broadcasts
can only be used from the right-hand-side.
This patch fixes various issues where the new error handling was too
strict because temporarily invalid hop configurations exist (e.g.,
in tests as well as while setting the outer config).
Invalid broadcasting and outer operations - to be fixed separately.
Following the stricter error handling for binary broadcast operations,
we found a number of issues in existing builtin scripts. In CP, the
runtime compensates for such incorrect operations (redirecting to
outer operations) but in Spark we can't because different RDD operations
are used before we see the block dimensions. This patch accordingly
also fixed other existing issues.
This patch adds the missing hoisting of DML function calls
(which always need to bind to variables) from basic if
predicates for convenience and in order to prevent
unexpected errors. Furthermore, this patch simplifies the
existing DML-bodied ampute() builtin by using this features
as well as call the existing sigmoid() instead of a custom one.
This commit changes the memory estimates of many of the HOPS,
to include the datatype, Indicating if the output is a frame or a
Matrix. It is included because I had to make one modification to a
unary Op in BWARE, and in general, it does not hurt to add to many of the
Op Types (even if they do not return frame types at the moment)

Closes apache#2221
This patch fixes a compilation issue, where certain if-branches were
lost during hoisting of function calls from if statements. The issue
did not show up, because for test instances of function calls in
if statements triggered rewrites of predicate constant folding and
branch removal. Now, we also resolved all remaining FIXMEs in the new
 ampute() builtin function.
dependabot Bot and others added 20 commits March 28, 2026 16:27
Bumps [codecov/codecov-action](https://github.com/codecov/codecov-action) from 5.5.3 to 6.0.0.
- [Release notes](https://github.com/codecov/codecov-action/releases)
- [Changelog](https://github.com/codecov/codecov-action/blob/main/CHANGELOG.md)
- [Commits](codecov/codecov-action@v5.5.3...v6.0.0)

---
updated-dependencies:
- dependency-name: codecov/codecov-action
  dependency-version: 6.0.0
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Optimize dense matrix mult for transposed inputs

This introduces specialized kernels for dense matrix multiplication
involving transposed inputs (t(A)%*%B, A%*%t(B), t(A)%*%t(B)).
Previously, these operations required an explicit intermediate transpose
step, which caused unnecessary runtime.

The new kernels perform the operations in-place or using
tiled-transposition,
avoiding the full allocation cost.

Performance benchmarks on 100x100 dense matrices show significant
speedups especially for t(A)%*%B
and t(A)%*%t(B) and can be tested with higher dimensions.

Closes apache#2425.
Bumps [docker/build-push-action](https://github.com/docker/build-push-action) from 6 to 7.
- [Release notes](https://github.com/docker/build-push-action/releases)
- [Commits](docker/build-push-action@v6...v7)

---
updated-dependencies:
- dependency-name: docker/build-push-action
  dependency-version: '7'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [docker/metadata-action](https://github.com/docker/metadata-action) from 5 to 6.
- [Release notes](https://github.com/docker/metadata-action/releases)
- [Commits](docker/metadata-action@v5...v6)

---
updated-dependencies:
- dependency-name: docker/metadata-action
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [docker/setup-buildx-action](https://github.com/docker/setup-buildx-action) from 3 to 4.
- [Release notes](https://github.com/docker/setup-buildx-action/releases)
- [Commits](docker/setup-buildx-action@v3...v4)

---
updated-dependencies:
- dependency-name: docker/setup-buildx-action
  dependency-version: '4'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Closes apache#2412.

Co-authored-by: bakiberkay <baki.b.uzel@campus.tu-berlin.de>
In this patch a new unimodal optimizing strategy is introduced. The dags are executed in parallel via a memory aware node scheduler and executor.
The scheduler tracks the dependencies between the nodes and records the memory used by the system in order to run new nodes that fit these memory limits. Every representation needs to provide a memory estimate so the scheduler knows which representation nodes can be scheduled. The memory estimates consist of a cpu and gpu estimate and the scheduler is aware of both those attributes.

(The hyperparameter tuner and multimodal optimizer still need to be adapted to this new approach, therefore the tests are currently disabled)
This patch adds a new loaders for loading PDF files by converting all pages of the document into numpy arrays processable by openCV.
Furthermore it adds a loader for loading and converting an audio file into a transcript using faster-whisper.
This patch introduces a modality alignment operator to match previously
unaligned data based on feature similarity.

The operator computes similarities (e.g., ORB descriptors or perceptual
hashing) between a primary and a secondary modality and determines an
optimal matching. The implementation includes an abstract alignment
interface and concrete methods for ORB-based and p-hash-based image
alignment.

Instead of producing reordered modalities, the operator outputs a
matching that is applied after representation learning and before
fusion. This ensures consistent ordering and equal-length modalities for
downstream processing.
…pairing

This patch introduces a new operator to Scuro for building contrastive
learning pipelines with greater flexibility in handling input modalities.

Previously, contrastive pairs had to be structurally aligned in a
preprocessing step before being used in Scuro. This limited the ability
to work with independently transformed or dynamically generated
modalities.

The new operator constructs contrastive pairs via a Cartesian product of
modalities and optionally extends them with additional modalities that
are already aligned. The resulting combinations are evaluated using a
user-defined function to determine whether a pair represents a positive
or negative sample. Based on this evaluation, the operator outputs both
the assigned label and the corresponding modality pair.

This design enables dynamic label generation and supports scenarios where
modalities are windowed, reshuffled, or transformed differently. It also
allows flexible fusion of modalities after contrastive pairing, improving
the expressiveness of contrastive learning workflows.

Limitations: The Cartesian product can introduce significant
computational overhead for large modality sets, which may require further
optimization.
@b-enedict b-enedict force-pushed the feat/alignment-operator branch from 76b705c to 8b18bd0 Compare April 23, 2026 08:47
@b-enedict b-enedict closed this Apr 23, 2026
@b-enedict b-enedict deleted the feat/alignment-operator branch April 23, 2026 08:50
@github-project-automation github-project-automation Bot moved this from In Progress to Done in SystemDS PR Queue Apr 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.