Skip to content

refactor: initialize translation pipeline from config#44

Merged
ClemDoum merged 1 commit into
mainfrom
refactor(translation-worker)/from-config
Jun 8, 2026
Merged

refactor: initialize translation pipeline from config#44
ClemDoum merged 1 commit into
mainfrom
refactor(translation-worker)/from-config

Conversation

@ClemDoum

@ClemDoum ClemDoum commented May 28, 2026

Copy link
Copy Markdown
Contributor

⚠️ breaking (translated content changed from map to list)

Description

Make the translation worker API more generic by allowing multiple translation pipeline implementation.
Initializes the translation pipeline from a config object and decoralated instanciation of pipeline component and loading of language specific resources:

json_config = '{"sentence_splitter": {"model": "ARGOS"}, "translator": {"model": "ARGOS"}}'
config = TranslationConfig.model_validate_json(json_config)
translator = config.to_translator()
sentence_splitter = config.to_sentence_splitter()

with translator.load(source="en", target="es"), sentence_splitter.load(language="en"):
    ...

Fixed translation format to be consistent with ES translator translations.

Changes

datashare-python

Added

  • added DatashareLanguage to reflect DS language formatting and validation (uppercase language names)
  • added IETFLanguage to support locals
  • defined Language = DatashareLanguage | IETFLanguage
  • added a Translation to reflect translation format in DS

Fixed

  • changed Document.content_translated from a dict[str, str] to a list[Translation]

translation-worker

Added

  • added the Translator and SentenceSplitter abstraction and made argos component inherit from them
  • updated implem to allows initializing a translation pipeline from a TranslationConfig

Changed

  • refactored batching by allowing multiple worker to process batches from the same source language. It's not longer allowed to run multiple batch translation inside the same worker. For parallel CPU processing we rely solely on CUDA batch processing + horizontal scaling
  • refactored translation in a publish/consumer fashion, where the publisher translates batches and populate an asyncio Queue when a translation buffer is full. The consumer consumes the queue and concurrently writes translations to ES
  • improved logging

@ClemDoum ClemDoum force-pushed the refactor(translation-worker)/from-config branch 7 times, most recently from 078f655 to 0396302 Compare June 1, 2026 10:24
@ClemDoum ClemDoum force-pushed the refactor(translation-worker)/from-config branch from 0396302 to 5f441e0 Compare June 1, 2026 10:28
@ClemDoum ClemDoum marked this pull request as ready for review June 1, 2026 11:01
@ClemDoum

ClemDoum commented Jun 1, 2026

Copy link
Copy Markdown
Contributor Author

Addresses: #26

@ClemDoum ClemDoum merged commit 41d44cd into main Jun 8, 2026
8 checks passed
@ClemDoum ClemDoum deleted the refactor(translation-worker)/from-config branch June 8, 2026 11:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant