[SYSTEMDS-??] Modality Alignment, Contrastive Learning, PDF and Transcript Loader by b-enedict · Pull Request #2459 · apache/systemds

b-enedict · 2026-04-21T20:11:24Z

Summary

This PR introduces new functionality for multimodal learning in Scuro, including a contrastive learning operator, a modality alignment operator, and additional data loaders.

Changes

Contrastive Learning Operator

Constructs modality pairs via a Cartesian product
Uses a user-defined function to label pairs as positive or negative
Enables dynamic generation of contrastive samples

Modality Alignment Operator

Aligns previously unaligned modalities using feature-based similarity (e.g., ORB, perceptual hashing)
Outputs a matching between a primary and secondary modality
Matching is applied after representation learning and before fusion

Data Loaders

PDF loader: converts document pages into NumPy arrays for OpenCV processing
Audio loader: converts audio to text using faster-whisper

The current inner class DMLGateWayListener implements function from the GatewayServerListener interface, which are never invoked by the GatewayServer (since the GatewayServer, which also implements GatewayServerListener, does not implement these methods. Furthermore, DMLGateWayListener previously called, Sys.exit(), which I think is not correct, since it breaks the proper shutdown of the GatewayServer. Finally, this commit added a new unit case, which checks the functionality of the DMLGateWayListener. While merging, we verified that the additions did not contain any regressions in startup and shutdown of the Python API. Closes apache#2243

Currently the format of the .asf.yaml sends emails to everyone on all commits. According to https://issues.apache.org/jira/browse/INFRA-26700 the error is in our project. This commit, therefore, removes the outputdir, as specified in: https://github.com/apache/infrastructure-asfyaml?tab=readme-ov-file#jekyll-cms

0) hard-coded server names / properties 1) windows line endings 2) memory configurations 3) datagen scripts 4) missing l2svm script

The perftest runMSVM_10k_1k_dense_k5 reproducible failed on serializing the parfor body program for a remote spark parfor job, because there were remaining spark instructions (checkpoints). The reason was because the forced CP compilation before such remote jobs was only applied in a subset of cases and special hop properties were not correctly cleaned up. This patch fixes the general issue.

This patch fixes a perftest performance regression of runMSVM_10k_1k_dense_k5 which ran in 36s instead of few seconds in earlier releases. The reason was unnecessary spark context creation during parfor optimization. We now handle theses cluster info requests more carefully, which now avoids this unnecessary spark context creation and reduced the total runtime back to 5.9s.

This patch introduces a pruning technique on the cleaning pipeline returned by the top-K cleaning. We identify a smaller yet equally effective subset of primitives for all top-performing pipelines, which optimizes their scoring performance. Closes apache#2251

This patch adds the embedding layer as a built-in operator in our nn/layers library. The functionality is similar to pytorch.nn.Embedding (https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html) The layer receives indices as input which refer to indices of an embedding dictionary and returns an embedding matrix where row i refers to embedding vector indices[i] of the embedding dictionary. This layer is used in every transformer architecture. Here the indices usually come from a tokenizer and the embedding matrix is the input to the actual transformer model. Closes apache#2237

Closes apache#2250.

…DMLOptions Closes apache#2241.

This patch fixes an invalid left-hand-side and left- and right-hand-side broadcasting in the new ampute builtin function. We now have a proper error handling in the hop to guide script developers that broadcasts can only be used from the right-hand-side.

This patch fixes various issues where the new error handling was too strict because temporarily invalid hop configurations exist (e.g., in tests as well as while setting the outer config).

Invalid broadcasting and outer operations - to be fixed separately.

Following the stricter error handling for binary broadcast operations, we found a number of issues in existing builtin scripts. In CP, the runtime compensates for such incorrect operations (redirecting to outer operations) but in Spark we can't because different RDD operations are used before we see the block dimensions. This patch accordingly also fixed other existing issues.

Closes apache#2244.

Closes apache#2238.

Closes apache#2249.

Closes apache#2229.

Closes apache#2255.

Closes apache#2254.

This patch adds the missing hoisting of DML function calls (which always need to bind to variables) from basic if predicates for convenience and in order to prevent unexpected errors. Furthermore, this patch simplifies the existing DML-bodied ampute() builtin by using this features as well as call the existing sigmoid() instead of a custom one.

This commit changes the memory estimates of many of the HOPS, to include the datatype, Indicating if the output is a frame or a Matrix. It is included because I had to make one modification to a unary Op in BWARE, and in general, it does not hurt to add to many of the Op Types (even if they do not return frame types at the moment) Closes apache#2221

This patch fixes a compilation issue, where certain if-branches were lost during hoisting of function calls from if statements. The issue did not show up, because for test instances of function calls in if statements triggered rewrites of predicate constant folding and branch removal. Now, we also resolved all remaining FIXMEs in the new ampute() builtin function.

Bumps [codecov/codecov-action](https://github.com/codecov/codecov-action) from 5.5.3 to 6.0.0. - [Release notes](https://github.com/codecov/codecov-action/releases) - [Changelog](https://github.com/codecov/codecov-action/blob/main/CHANGELOG.md) - [Commits](codecov/codecov-action@v5.5.3...v6.0.0) --- updated-dependencies: - dependency-name: codecov/codecov-action dependency-version: 6.0.0 dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com>

Closes apache#2428.

Closes apache#2427.

Closes apache#2381.

Closes apache#2423.

Closes apache#2413.

Closes apache#2410.

Optimize dense matrix mult for transposed inputs This introduces specialized kernels for dense matrix multiplication involving transposed inputs (t(A)%*%B, A%*%t(B), t(A)%*%t(B)). Previously, these operations required an explicit intermediate transpose step, which caused unnecessary runtime. The new kernels perform the operations in-place or using tiled-transposition, avoiding the full allocation cost. Performance benchmarks on 100x100 dense matrices show significant speedups especially for t(A)%*%B and t(A)%*%t(B) and can be tested with higher dimensions. Closes apache#2425.

Bumps [docker/build-push-action](https://github.com/docker/build-push-action) from 6 to 7. - [Release notes](https://github.com/docker/build-push-action/releases) - [Commits](docker/build-push-action@v6...v7) --- updated-dependencies: - dependency-name: docker/build-push-action dependency-version: '7' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

Bumps [docker/metadata-action](https://github.com/docker/metadata-action) from 5 to 6. - [Release notes](https://github.com/docker/metadata-action/releases) - [Commits](docker/metadata-action@v5...v6) --- updated-dependencies: - dependency-name: docker/metadata-action dependency-version: '6' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

Bumps [docker/setup-buildx-action](https://github.com/docker/setup-buildx-action) from 3 to 4. - [Release notes](https://github.com/docker/setup-buildx-action/releases) - [Commits](docker/setup-buildx-action@v3...v4) --- updated-dependencies: - dependency-name: docker/setup-buildx-action dependency-version: '4' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

Closes apache#2412. Co-authored-by: bakiberkay <baki.b.uzel@campus.tu-berlin.de>

Closes apache#2426.

In this patch a new unimodal optimizing strategy is introduced. The dags are executed in parallel via a memory aware node scheduler and executor. The scheduler tracks the dependencies between the nodes and records the memory used by the system in order to run new nodes that fit these memory limits. Every representation needs to provide a memory estimate so the scheduler knows which representation nodes can be scheduled. The memory estimates consist of a cpu and gpu estimate and the scheduler is aware of both those attributes. (The hyperparameter tuner and multimodal optimizer still need to be adapted to this new approach, therefore the tests are currently disabled)

Closes apache#2458.

This patch adds a new loaders for loading PDF files by converting all pages of the document into numpy arrays processable by openCV. Furthermore it adds a loader for loading and converting an audio file into a transcript using faster-whisper.

This patch introduces a modality alignment operator to match previously unaligned data based on feature similarity. The operator computes similarities (e.g., ORB descriptors or perceptual hashing) between a primary and a secondary modality and determines an optimal matching. The implementation includes an abstract alignment interface and concrete methods for ORB-based and p-hash-based image alignment. Instead of producing reordered modalities, the operator outputs a matching that is applied after representation learning and before fusion. This ensures consistent ordering and equal-length modalities for downstream processing.

…pairing This patch introduces a new operator to Scuro for building contrastive learning pipelines with greater flexibility in handling input modalities. Previously, contrastive pairs had to be structurally aligned in a preprocessing step before being used in Scuro. This limited the ability to work with independently transformed or dynamically generated modalities. The new operator constructs contrastive pairs via a Cartesian product of modalities and optionally extends them with additional modalities that are already aligned. The resulting combinations are evaluated using a user-defined function to determine whether a pair represents a positive or negative sample. Based on this evaluation, the operator outputs both the assigned label and the corresponding modality pair. This design enables dynamic label generation and supports scenarios where modalities are windowed, reshuffled, or transformed differently. It also allows flexible fusion of modalities after contrastive pairing, improving the expressiveness of contrastive learning workflows. Limitations: The Cartesian product can introduce significant computational overhead for large modality sets, which may require further optimization.

e-strauss and others added 30 commits March 31, 2025 15:22

Bump codecov/codecov-action from 5.3.1 to 5.4.0 (apache#2239)

87854a2

[SYSTEMDS-3847] Fix non-functional perftest benchmarking suite

7092e87

0) hard-coded server names / properties 1) windows line endings 2) memory configurations 3) datagen scripts 4) missing l2svm script

[SYSTEMDS-3847] Fix perftest refactoring (datagen scripts)

83ec10b

[SYSTEMDS-3819] Bug fixes in sliceLineExtract builtin

b9e3f19

[maven-release-plugin] prepare release 3.3.0-rc1

dd743c9

[maven-release-plugin] prepare for next development iteration

76a1643

[SYSTEMDS-3182] Builtin ampute() for introducing missing values in data

86c0a4d

Closes apache#2250.

[SYSTEMDS-3842] Improve test coverage of API components: DMLScript & …

870c515

…DMLOptions Closes apache#2241.

[SYSTEMDS-3853] Fix error handling invalid binary broadcasting

3689150

This patch fixes various issues where the new error handling was too strict because temporarily invalid hop configurations exist (e.g., in tests as well as while setting the outer config).

[MINOR] Disable invalid builtin function implementations

c40318a

Invalid broadcasting and outer operations - to be fixed separately.

[SYSTEMDS-3860] Extended codegen row template by var aggregates

f203cea

Closes apache#2244.

[SYSTEMDS-3790] Extended optimizer for federated execution plans

0a1d67f

Closes apache#2238.

[SYSTEMDS-3861] Fix redundant transposes due to multi-level rewrites

f5cf84f

Closes apache#2249.

[SYSTEMDS-2229] Extended I/O Framework: Readers/Writers for Parquet

35b9bb5

Closes apache#2229.

[SYSTEMDS-3864] Additional trace alias tr()

44ea3c0

Closes apache#2255.

[SYSTEMDS-3864] Additional trace simplification rewrites

6dc30f4

Closes apache#2254.

[MINOR] Update Python API

2c8efa9

[DOCS] Update system docs for new version

0efe23d

Bump codecov/codecov-action from 5.4.0 to 5.4.2 (apache#2252)

2cee6ea

dependabot Bot and others added 20 commits March 28, 2026 16:27

[SYSTEMDS-3920] Vector API in more codegen primitives

90bad1a

Closes apache#2428.

[SYSTEMDS-3928] New builtin function for Independent Subnet Training

e9a56f8

Closes apache#2427.

[MINOR] Fix formatting and method name violations

b0c8dbd

[MINOR] Misc fixes code quality (warnings, formatting, naming)

d93e3d4

[SYSTEMDS-3858] New HDBSCAN builtin function

60a787f

Closes apache#2381.

[SYSTEMDS-3855] Extended Vector API Use in Dense-Sparse Matmult

bf2ed43

Closes apache#2423.

[SYSTEMDS-3921] New Rewrite for Relational Selection Pushdown

887639c

Closes apache#2413.

[SYSTEMDS-3151] New detectMissingType builtin function

edf92ff

Closes apache#2410.

[MINOR] Fix formatting and method name violations

f2102b2

[SYSTEMDS-3547] Tensor in-place permutation operations

9f4b332

Closes apache#2412. Co-authored-by: bakiberkay <baki.b.uzel@campus.tu-berlin.de>

[SYSTEMDS-3547] Tensor permutation operations

7bfe518

Closes apache#2426.

[MINOR] Fix test formatting (method names, tab indentation)

14633a9

[MINOR] Fix Codestyle Issues

3d0f1f0

[SYSTEMDS-3891] Add OOC Memory Tracking

cd8aa27

Closes apache#2458.

github-project-automation Bot added this to SystemDS PR Queue Apr 21, 2026

github-project-automation Bot moved this to In Progress in SystemDS PR Queue Apr 21, 2026

b-enedict force-pushed the feat/alignment-operator branch from 383f470 to 76b705c Compare April 23, 2026 08:38

b-enedict added 3 commits April 23, 2026 10:46

b-enedict force-pushed the feat/alignment-operator branch from 76b705c to 8b18bd0 Compare April 23, 2026 08:47

b-enedict closed this Apr 23, 2026

b-enedict deleted the feat/alignment-operator branch April 23, 2026 08:50

github-project-automation Bot moved this from In Progress to Done in SystemDS PR Queue Apr 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYSTEMDS-??] Modality Alignment, Contrastive Learning, PDF and Transcript Loader#2459

[SYSTEMDS-??] Modality Alignment, Contrastive Learning, PDF and Transcript Loader#2459
b-enedict wants to merge 2929 commits intoapache:mainfrom
b-enedict:feat/alignment-operator

b-enedict commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

b-enedict commented Apr 21, 2026

Summary

Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants