⚙️ Distributed Task Scheduler

A high-performance, distributed task scheduling system designed to handle delayed jobs, background processing, retries, and real-time job orchestration at scale. Inspired by systems like Celery, Sidekiq, and Quartz, this project demonstrates key concepts in distributed systems, backend engineering, and system design—aligned with industry expectations.

🧾 Abstract

This project presents the design and development of a Distributed Task Scheduler that orchestrates task execution based on Directed Acyclic Graphs (DAGs). Built using Spring Boot, Kafka, Redis, and Kubernetes, the system ensures reliable, scalable, and fault-tolerant execution of interdependent tasks across distributed nodes. The scheduler supports real-time execution, failure recovery, retry mechanisms, and dependency management. It is suitable for cloud-native workloads such as data pipelines, CI/CD processes, and backend automation workflows.

❗ Problem Statement

Modern enterprise systems require execution of complex workflows that involve multiple interdependent tasks across distributed environments. Traditional schedulers lack scalability, dynamic task control, or failover handling. This project addresses the problem by developing a DAG-based distributed task scheduler that efficiently manages task dependencies, execution order, and fault tolerance, providing a modular, scalable, and reusable solution for orchestrating cloud-native operations.

🔧 Examples of Traditional Schedulers & Their Limitations

Traditional Scheduler	Limitations Compared to Your Project
Cron / crontab (Linux)	- No support for task dependencies - No retry/failure handling - No distribution across nodes - Hard-coded/static schedule
Quartz Scheduler	- Mostly designed for monolithic or single-node systems - Limited support for distributed fault-tolerant execution - Lacks native DAG task orchestration
Spring @Scheduled	- Embedded in single-instance Spring apps - No parallel or distributed task execution - No dependency resolution
Windows Task Scheduler	- Single-machine only - Not cloud-native - No concept of task graphs or fault recovery
Airflow (Pre-optimized)	- Airflow itself is DAG-based and powerful, but not lightweight for embedded use - Setup overhead for small or custom environments

✅ What this Project Improves

DAG awareness → Executes tasks only after dependencies complete
Distributed execution → Tasks can run across nodes (via Kafka or REST)
Resilience → Uses Redis to track states; retries failed tasks
Scalability → Can handle many parallel workflows dynamically
Modular → Pluggable and lightweight compared to heavy-duty schedulers like Airflow

In essence, this project fills the gap where traditional tools are either too simple (like cron) or too heavy (like Airflow) by offering a middle-ground: lightweight, DAG-aware, distributed, and scalable.

Traditional schedulers such as cron, Quartz, and Spring's @Scheduled are widely used for basic task automation but fall short when it comes to handling complex, interdependent workflows in a distributed environment. These tools typically lack support for dynamic task control, dependency resolution, failover handling, and scalable execution across multiple nodes. For example, cron and Quartz operate on static time-based triggers without awareness of task dependencies, while Spring @Scheduled is limited to single-instance execution within a monolithic application. Even more advanced solutions like Apache Airflow offer DAG-based scheduling but are often heavyweight and require significant setup, making them less suitable for lightweight or embedded use cases. This project addresses these limitations by developing a DAG-based Distributed Task Scheduler that offers a modular, scalable, and fault-tolerant approach to orchestrating cloud-native operations. By leveraging technologies like Spring Boot, Redis, Kafka, and Kubernetes, it enables efficient task orchestration with built-in support for retries, execution tracking, and dynamic dependency resolution.

🔬 Research Scope (for Distributed Task Scheduler)

The project covers an exploratory and empirical study in the field of distributed systems and cloud-native task orchestration. It investigates the design and implementation of a fault-tolerant, DAG-based task scheduling system using modern backend technologies like Spring Boot, Kafka, Redis, and Kubernetes. The system is applicable to real-world scenarios such as workflow automation, CI/CD pipelines, and data engineering job execution. The research may target a general industry area—specifically cloud infrastructure and DevOps automation—and can also be adapted for a specific enterprise use case like internal microservices coordination or multi-tenant job execution systems.

📊 Data Requirements

This project primarily utilizes secondary data in the form of system-generated task metadata, execution logs, and DAG structures. The scheduler does not rely on external real-world datasets but instead works with internally defined task definitions, job dependencies, and runtime status data. The data includes:

Task metadata: Task ID, task name, type, command to execute, retry count, timeout, etc.
DAG structure: Representation of task dependencies (edges and nodes)
Job metadata: Job ID, user ID (if multi-tenant), job status (pending, running, failed, complete)
Execution logs: Timestamps, outcomes, retries, error messages
Redis Data: Stores task state, queues, and interim progress
Kafka Events: Used for task communication and status propagation across services

Thus, the project uses a combination of simulated primary data (created by the scheduler itself during runtime) and secondary system-level data, which are both crucial for evaluating task execution, fault tolerance, and performance analysis.

📊 Table: Data Requirements for Distributed Task Scheduler

Data Type	Description	Source	Usage
Task Metadata	Task ID, name, type, command, retries, timeout	Defined by user or system config	Used to schedule and execute individual tasks
DAG Structure	Nodes (tasks) and Edges (dependencies between tasks)	User-defined / System-generated	Determines execution order based on dependencies
Job Metadata	Job ID, job name, user ID, status (pending/running/failed/completed)	Internally generated	Tracks overall workflow execution
Execution Logs	Start time, end time, success/failure status, error messages	Captured during runtime	Used for debugging and audit
Redis Data Store	Task states, queues, retry counters, heartbeat/status values	System-level	Enables distributed state management
Kafka Events	Task execution commands, status updates, retry triggers	System-generated	Enables asynchronous and decoupled communication

📁 Sample Mock Data for Distributed Task Scheduler

🧩 1. Task Metadata

{
  "taskId": "task_101",
  "taskName": "SendEmail",
  "type": "email",
  "command": "python send_email.py",
  "retries": 3,
  "timeout": 60
}

🔗 2. DAG Structure

{
  "jobId": "job_202",
  "tasks": ["task_101", "task_102", "task_103"],
  "dependencies": {
    "task_102": ["task_101"],
    "task_103": ["task_102"]
  }
}

📋 3. Job Metadata

{
  "jobId": "job_202",
  "jobName": "DailyReportPipeline",
  "userId": "user_1",
  "status": "running"
}

📝 4. Execution Logs

{
  "taskId": "task_101",
  "startTime": "2025-04-20T10:00:00Z",
  "endTime": "2025-04-20T10:00:05Z",
  "status": "success",
  "logs": "Email sent successfully to 300 users."
}

🧠 5. Redis Entry (Sample)

{
  "key": "job_202:task_101:status",
  "value": "completed"
}

🔄 6. Kafka Event

{
  "eventType": "TASK_COMPLETE",
  "jobId": "job_202",
  "taskId": "task_101",
  "status": "success",
  "timestamp": "2025-04-20T10:00:05Z"
}

🧾 Project Overview

Title: Design and Implementation of a Distributed Task Scheduler for Scalable Background Job Processing
Objective: To build a scalable, fault-tolerant system for managing time-based and background jobs in a distributed environment.

This project was developed as part of my academic coursework in the MCA program, with a focus on distributed systems and backend engineering.

🔥 Core Features

🧠 Job queuing and priority support
⏳ Delayed and recurring job scheduling
🔁 Retry with exponential backoff
💥 Dead-letter queue for failed jobs
🧵 Multi-threaded worker execution
🚨 Real-time monitoring and job status tracking
✅ Idempotency and fault tolerance
🐳 Dockerized and ready for deployment

🚀 Advanced Enhancements

🔗 Job dependencies and chaining
📩 Webhook & email notifications on job completion/failure
🔐 JWT-based API authentication
📈 Prometheus + Grafana dashboards for metrics
📦 CI/CD integration with GitHub Actions
⚙️ Configurable job scheduler profiles
☸️ Kubernetes deployment support
⚠️ Dynamic pause/resume of workers
🧪 Load testing scripts for benchmarking

💻 Tech Stack

Layer	Technology
Language	Java 17
Framework	Spring Boot
Messaging	Kafka / RabbitMQ
Storage	Redis / PostgreSQL
Orchestration	Docker / Docker Compose
Monitoring	Prometheus + Grafana
CI/CD	GitHub Actions
Deployment	Kubernetes (Minikube)

🏗️ Architecture

Client → REST API (Spring Boot)
             ↓
        Kafka Queue
             ↓
        Task Processor (Worker Pool)
             ↓
   - Execute Task Logic
   - Retry or DLQ on Failure
             ↓
     Update Task Status in Redis

🎨 UI

Dashboard

Tasks

DAG Visualizer

DLQ

Other screens | Work in progress

📦 Project Modules

api-service: REST endpoints to submit tasks
scheduler-core: Core scheduling logic and cron handling
task-consumer: Kafka/RabbitMQ consumer and executor
monitoring: Prometheus metrics and health checks
notifier-service: Sends webhook/email notifications
config-server: Loads job configs dynamically
k8s-deployment: YAMLs for container orchestration

🚀 Getting Started

Prerequisites

Java 17+
Docker & Docker Compose
Apache Kafka or RabbitMQ

Setup

# Clone the repo
git clone https://github.com/your-username/distributed-task-scheduler.git
cd distributed-task-scheduler

# Start required services
docker-compose up -d

# Run the application
./mvnw spring-boot:run

🧪 Example API Usage

POST /tasks/schedule

{
  "type": "email",
  "payload": { "to": "user@example.com", "subject": "Hello" },
  "delaySeconds": 30,
  "retries": 3
}

📊 Monitoring

Prometheus metrics at /actuator/prometheus
Grafana dashboard visualizing:
- Job status count
- Execution time
- Retry and failure rate

📚 Relevance

This project showcases expertise in:

Distributed Systems & Message Queues
Cloud-Native Backend Architecture
Fault-Tolerant, Scalable Services
Observability and DevOps Practices
Real-World Microservices Development

🙋‍♂️ Author

Raj Kishor Maharana – Full stack developer | Exploring Distributed Systems

LinkedIn • Email

📜 License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 127 Commits
.idea		.idea
docs		docs
express-cron-client		express-cron-client
gradle/wrapper		gradle/wrapper
infrastructure		infrastructure
runner		runner
webhook-mock-server		webhook-mock-server
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Distributed_Task_Scheduler_MCA_Project.docx		Distributed_Task_Scheduler_MCA_Project.docx
LICENSE		LICENSE
Project Check List.md		Project Check List.md
Project Tagging and Release.md		Project Tagging and Release.md
README.md		README.md
build.gradle		build.gradle
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle		settings.gradle

Folders and files

Latest commit

History

Repository files navigation

⚙️ Distributed Task Scheduler

🧾 Abstract

❗ Problem Statement

🔧 Examples of Traditional Schedulers & Their Limitations

✅ What this Project Improves

🔬 Research Scope (for Distributed Task Scheduler)

📊 Data Requirements

📊 Table: Data Requirements for Distributed Task Scheduler

📁 Sample Mock Data for Distributed Task Scheduler

🧩 1. Task Metadata

🔗 2. DAG Structure

📋 3. Job Metadata

📝 4. Execution Logs

🧠 5. Redis Entry (Sample)

🔄 6. Kafka Event

🧾 Project Overview

🔥 Core Features

🚀 Advanced Enhancements

💻 Tech Stack

🏗️ Architecture

🎨 UI

📦 Project Modules

🚀 Getting Started

Prerequisites

Setup

🧪 Example API Usage

📊 Monitoring

📚 Relevance

🙋‍♂️ Author

📜 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages