Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -196,6 +196,7 @@ There are two ways of interacting with AliECS:
* Kubernetes
* [Operator controller](/control-operator/README.md)
* [Testing manifests](/control-operator/ecs-manifests/kubernetes-ecs.md)
* [ECS bridge to Kubernetes](/docs/kubernetes_ecs.md)
* Resources
* T. Mrnjavac et. al, [AliECS: A New Experiment Control System for the ALICE Experiment](https://doi.org/10.1051/epjconf/202429502027), CHEP23

10 changes: 10 additions & 0 deletions common/controlmode/controlmode.go
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,8 @@ const (
FAIRMQ
BASIC
HOOK
KUBECTL_DIRECT
KUBECTL_FAIRMQ
)

func (cm ControlMode) String() string {
Expand All @@ -51,6 +53,10 @@ func (cm ControlMode) String() string {
return "basic"
case HOOK:
return "hook"
case KUBECTL_DIRECT:
return "kubectl_direct"
case KUBECTL_FAIRMQ:
return "kubectl_fairmq"
}
return "direct"
}
Expand All @@ -71,6 +77,10 @@ func (cm *ControlMode) UnmarshalText(b []byte) error {
*cm = BASIC
case "hook":
*cm = HOOK
case "kubectl_direct":
*cm = KUBECTL_DIRECT
case "kubectl_fairmq":
*cm = KUBECTL_FAIRMQ
default:
*cm = DIRECT
}
Expand Down
29 changes: 21 additions & 8 deletions control-operator/README.md
Original file line number Diff line number Diff line change
@@ -1,63 +1,76 @@
# operator
// TODO(user): Add simple overview of use/purpose

Folder with operators regarding Task and Environment deployment.

## Description
// TODO(user): An in-depth paragraph about your project and overview of use

In order to deploy Task and Environment workflows to the k8s cluster you need controllers and operators
controlling custom CRDs defining ALICE custom workload. This Folder defines and implements all moving parts together with Makefile
to build, deploy, install CRDs and operators.

## Getting Started
You’ll need a Kubernetes cluster to run against. You can use [KIND](https://sigs.k8s.io/kind) to get a local cluster for testing, or run against a remote cluster.
**Note:** Your controller will automatically use the current context in your kubeconfig file (i.e. whatever cluster `kubectl cluster-info` shows).

You’ll need a Kubernetes cluster to run against. You can use [KIND](https://sigs.k8s.io/kind) to get a local cluster for testing, or run against a remote cluster. Author had the most success with K3s [see](/docs/kubernetes_ecs.md).
**Note:** Your controller will automatically use the current context in your kubeconfig (usually ~/.kube/config) file (i.e. whatever cluster `kubectl cluster-info` shows).

### Running on the cluster

Following commands show basic use of Makefile. However this isn't exhaustive list.

1. Install Instances of Custom Resources:

```sh
kubectl apply -f config/samples/
```

2. Build and push your image to the location specified by `IMG`:
1. Build and push your image to the location specified by `IMG`:

```sh
make docker-build docker-push IMG=<some-registry>/operator:tag
```

3. Deploy the controller to the cluster with the image specified by `IMG`:
1. Deploy the controller to the cluster with the image specified by `IMG`:

```sh
make deploy IMG=<some-registry>/operator:tag
```

### Uninstall CRDs

To delete the CRDs from the cluster:

```sh
make uninstall
```

### Undeploy controller

UnDeploy the controller from the cluster:

```sh
make undeploy
```

## Contributing

// TODO(user): Add detailed information on how you would like others to contribute to this project

### How it works

This project aims to follow the Kubernetes [Operator pattern](https://kubernetes.io/docs/concepts/extend-kubernetes/operator/).

It uses [Controllers](https://kubernetes.io/docs/concepts/architecture/controller/),
which provide a reconcile function responsible for synchronizing resources until the desired state is reached on the cluster.

### Test It Out

1. Install the CRDs into the cluster:

```sh
make install
```

2. Run your controller (this will run in the foreground, so switch to a new terminal if you want to leave it running):
1. Run your controller (this will run in the foreground, so switch to a new terminal if you want to leave it running):

```sh
make run
Expand All @@ -66,6 +79,7 @@ make run
**NOTE:** You can also run this in one step by running: `make install run`

### Modifying the API definitions

If you are editing the API definitions, generate the manifests such as CRs or CRDs using:

```sh
Expand All @@ -91,4 +105,3 @@ distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

3 changes: 2 additions & 1 deletion core/task/scheduler.go
Original file line number Diff line number Diff line change
Expand Up @@ -1432,7 +1432,8 @@ func makeTaskForMesosResources(
cmd.Env = append(cmd.Env, fmt.Sprintf("%s=%d", "OCC_CONTROL_PORT", controlPort))
}

if cmd.ControlMode == controlmode.FAIRMQ {
if cmd.ControlMode == controlmode.FAIRMQ ||
cmd.ControlMode == controlmode.KUBECTL_FAIRMQ {
cmd.Arguments = append(cmd.Arguments, "--control-port", strconv.FormatUint(controlPort, 10))
}

Expand Down
11 changes: 8 additions & 3 deletions core/task/task.go
Original file line number Diff line number Diff line change
Expand Up @@ -286,7 +286,9 @@ func (t *Task) BuildTaskCommand(role parentRole) (err error) {
if class.Control.Mode == controlmode.BASIC ||
class.Control.Mode == controlmode.HOOK ||
class.Control.Mode == controlmode.DIRECT ||
class.Control.Mode == controlmode.FAIRMQ {
class.Control.Mode == controlmode.FAIRMQ ||
class.Control.Mode == controlmode.KUBECTL_DIRECT ||
class.Control.Mode == controlmode.KUBECTL_FAIRMQ {
var varStack map[string]string

// First we get the full varStack from the parent role, and
Expand Down Expand Up @@ -393,7 +395,8 @@ func (t *Task) BuildTaskCommand(role parentRole) (err error) {
}
}

if class.Control.Mode == controlmode.FAIRMQ {
if class.Control.Mode == controlmode.FAIRMQ ||
class.Control.Mode == controlmode.KUBECTL_FAIRMQ {
// FIXME read this from configuration
// if the task class doesn't provide an id, we generate one ourselves
if !utils.StringSliceContains(cmd.Arguments, "--id") {
Expand Down Expand Up @@ -635,7 +638,9 @@ func (t *Task) BuildPropertyMap(bindMap channel.BindMap) (propMap controlcommand

// For FAIRMQ tasks, we append FairMQ channel configuration
if class.Control.Mode == controlmode.FAIRMQ ||
class.Control.Mode == controlmode.DIRECT {
class.Control.Mode == controlmode.DIRECT ||
class.Control.Mode == controlmode.KUBECTL_DIRECT ||
class.Control.Mode == controlmode.KUBECTL_FAIRMQ {
for _, inbCh := range channel.MergeInbound(parent.CollectInboundChannels(), class.Bind) {
// We get the FairMQ-formatted propertyMap from the inbound channel spec
var chanProps controlcommands.PropertyMap
Expand Down
19 changes: 11 additions & 8 deletions core/task/taskclass/class.go
Original file line number Diff line number Diff line change
Expand Up @@ -123,7 +123,6 @@ func (c *Class) UnmarshalYAML(unmarshal func(interface{}) error) (err error) {
}
}
return

}

func (c *Class) MarshalYAML() (interface{}, error) {
Expand Down Expand Up @@ -154,13 +153,17 @@ func (c *Class) MarshalYAML() (interface{}, error) {
Command: c.Command,
}

if c.Control.Mode == controlmode.FAIRMQ {
aux.Control.Mode = "fairmq"
} else if c.Control.Mode == controlmode.BASIC {
aux.Control.Mode = "basic"
} else {
aux.Control.Mode = "direct"
}
// if c.Control.Mode == controlmode.FAIRMQ {
// aux.Control.Mode = "fairmq"
// } else if c.Control.Mode == controlmode.BASIC {
// aux.Control.Mode = "basic"
// } else if c.Control.Mode == controlmode.KUBECTL {
// aux.Control.Mode = "kubectl"
// } else {
// aux.Control.Mode = "direct"
// }

aux.Control.Mode = c.Control.Mode.String()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a change in behaviour for hooks though, no? Before they were getting direct instead of hook, which actually smells like a bug, but perhaps something is relying on it?

Copy link
Copy Markdown
Collaborator Author

@justonedev1 justonedev1 Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably my misunderstanding, as I thought that it is a bug to implicitly change hook to direct.. especially when we have hooktask that is created only if controlmode.HOOK is present.
see


return aux, nil
}
Expand Down
76 changes: 76 additions & 0 deletions docs/kubernetes_ecs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# ECS with Kubernetes

> ⚠️ **Warning**
> All Kubernetes work done is in a stage of prototype.

## Kubernetes Cluster
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My notes on what was generally missing here (or I just didn't see it):

  • make install to register the task CRD
  • make sure executor has access to a ~/.kube/config
  • kubectl does not react well to stfsender env var http_proxy="", which is surrounded by additional quotes in its manifest, making kubectl apply fail. Easy fix - http_proxy=.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, and one has to remember to create an image pull secret for the controller after make deploy, because only then the namespace already exists.

Copy link
Copy Markdown
Collaborator Author

@justonedev1 justonedev1 Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that some of those of those comments belong into kubernetse_ecs.md, but to the docs in control-operator. But those are fair comments. The reason why I didn't add them at the time of creation was that I didn't encounter those bcs k3s behaved a bit differently (can automatically read ~/.docker/config.json and other things) But I will add those comments into the control-operator with maybe a link pointing from kubernetes_ecs.md there.

Copy link
Copy Markdown
Collaborator Author

@justonedev1 justonedev1 Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what you mean by http_proxy comment. Can you tell me what exactly should I add and what you did, I don't think that I had to do anything about that.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that some of those of those comments belong into ...

Yes, very likely, at some point I became a bit lost which PR is which :D

I don't know what you mean by http_proxy comment. Can you tell me what exactly should I add and what you did, I don't think that I had to do anything about that.

I mean this line: https://github.com/AliceO2Group/ControlWorkflows/blob/15b05c6ea90cf61b322a921c9640300ac981ff4e/tasks/stfsender.yaml#L28

These env vars are somehow passed to the manifest and causing some quoting/escaping issue, I don't remember exactly.


While prototyping we used many Kubernetes clusters, namely [`kind`](https://kind.sigs.k8s.io/), [`minikube`](https://minikube.sigs.k8s.io/docs/) and [`k3s`](https://k3s.io/)
in both local and remote cluster deployment. We used Openstack for remote deployment.
Follow the guides at the individual distributions in order to create the desired cluster setup.
k3s is recommended to run this prototype, as it is lightweight
and easily installed distribution which is also [`CNCF`](https://www.cncf.io/training/certification/) certified.

All settings of `k3s` were used as default except one: locked-in-memory size. Use `ulimit -l` to learn
what is the limit for the current user and `LimitMEMLOCK` inside the k3s systemd service config
to set it for correct value. Right now the `flp` user has unlimited size (`LimitMEMLOCK=infinity`).
This config is necessary because even if you are running Pods with the privileged security context
under user flp, Kubernetes still sets limits according to its internal settings and doesn't
respect linux settings.

Another setup we expect at this moment to be present at the target nodes
is ability to run Pods with privileged permissions and also under user `flp`.
This means that the machine has to have `flp` user setup the same way as
if you would do the installation with [`o2-flp-setup`](https://alice-flp.docs.cern.ch/Operations/Experts/system-configuration/utils/o2-flp-setup/).

## Task Controller

Following text assumes that there is a Task Controller from `control-operator` running
at your K8s cluster and Task CRD installed at your cluster.
You can find the details about the usage in the [documentation](/control-operator/README.md).

## Running tasks (`KubectlTask`)

ECS is setup to run tasks through Mesos on all required hosts baremetal with active
task management (see [`ControllableTask`](/executor/executable/controllabletask.go))
and OCC gRPC communication. When running docker task through ECS we could easily
wrap command to be run into the docker container with proper settings
([see](/docs/running_docker.md)). This is however not possible for Kubernetes
workloads as the Pods are "hidden" inside the cluster. So we plan
to deploy our own Task Controller which will connect to and guide
OCC state machine of required tasks. Thus we need to create custom
POC way to communicate with Kubernetes cluster from Mesos executor.

The reason why we don't call Kubernetes cluster directly from ECS core
is that ECS does a lot of heavy lifting while deploying workloads,
monitoring workloads and by generating a lot of configuration which
is not trivial to replicate manually. However, if we create some class
that would be able to deploy one task into the Kubernetes and monitor its
state we could replicate `ControllableTask` workflow and leave ECS
mostly intact for now, save a lot of work and focus on prototyping
Kubernetes operator pattern.

Thus [`KubectlTask`](/executor/executable/kubectltask.go) was created. This class
is written as a wrapper around `kubectl` utility to manage Kubernetes cluster.
It is based on following `kubectl` commands:

* `apply` => `kubectl apply -f manifest.yaml` - deploys resource described inside given manifest
* `delete` => `kubectl delete -f manifest.yaml` - deletes resource from cluster
* `patch` => `kubectl patch -f exampletask.yaml --type='json' -p='[{"op": "replace", "path": "/spec/state", "value": "running"}]` - changes the state of resource inside cluster
* `get` => `kubectl get -f manifest.yaml -o jsonpath='{.spec.state}'` - queries exact field of resource (`state` in the example) inside cluster.

These four commands allow us to deploy and monitor status of the deployed
resource without necessity to interact with it directly. However `KubectlTask`
expects that resource is the CRD [Task](/control-operator/api/v1alpha1/task_types.go).

In order to activate `KubectlTask` you need to change yaml template
inside the `ControlWorkflows` directory. Namely:

* add path to the kubectl manifest as the first argument in `.command.arguments` field
Comment thread
knopers8 marked this conversation as resolved.
* change `.control.mode` to either `kubectl_direct` or `kubectl_fairmq`
You can find working template inside `control-operator/ecs-manifests/control-workflows/*-kube.yaml`

Working kubectl manifests are to be found in `control-operator/ecs-manifests/kubernetes-manifests`.
You can see `*test.yaml` for concrete deployable manifests by `kubectl apply`, the rest
are the templates with variables to be filled in in a `${var}` format. `KubectlTask`
fills these variables from env vars.
Loading
Loading